Avoiding Double Payments in a Distributed Payments System
感觉 AirBnB 一个重要的场景就是银行卡支付,而我们主要是微信和支付宝付款。两者的区别在于,银行卡支付就是直接从服务器发起的,而微信和支付宝是用户发起的。因此他们最重要的是保证不会触发两次扣款请求,也就是 exactly once;而我们最重要的是保证给用户的同时最多只有一个有效的支付请求。两者既有联系又有区别,通过避免向微信/支付宝发送两次支付请求可以解决我们的这个问题,但是这就不能处理微信/支付宝的超时问题。所以他们的一些思想我们可以借鉴,同时也要注意里面不同的地方不能生搬硬套。
初步看下来,这篇文章主要是在说如何在 Eventually Consisitency 下通过幂等来做 Exactly Once,以及根据保证幂等而引伸出来的一系列问题与解决方案。为什么强调最终一致性,是因为 ABB 的 SOA 架构,会出现服务 call 服务 call 服务的情况,如果用 2PC 的话就太重了。
接下来就是安利他们自己的 “Orpheus”, a general-purpose idempotency library。这个 library 就不看了。下面会重点看里面提到的一些设计思想。
Pre-RPC,RPC,Post-RPC
- Pre-RPC: Details of the payment request are recorded in the database.
- RPC: The request is made live to the external service over network and the response is received. This is a place to do one or more idempotent computations or RPCs (for example, query service for the status of a transaction first if it’s a retry-attempt).
- Post-RPC: Details of the response from the external service are recorded in the database, including its successfulness and whether a bad request is retryable or not.
To maintain data integrity, we adhere to two simple ground rules:
- No service interaction over networks in Pre and Post-RPC phases
- No database interactions in the RPC phases
We essentially want to avoid mixing network communication with database work.(database mix network是怎么做到的?他们也是挺牛逼的)
Pre-RPC and Post-RPC phases is combined into a single database transaction
正确处理 Retryable 和 Non-Retryable 的 Exception
这个里面更多地从功能上考虑哪些 Exception 是可重试的,但是没有从服务器的角度(负载、性能)上考虑。
Clients Play a Vital Rule,调用方也很重要,要承担更多的职责
- Pass in a unique idempotency key for every new request; reuse the same idempotency key for retries.
- Persist these idempotency keys to the database before calling the service (to later use for retries).
- Properly consume successful responses and subsequently unassign (or nullify) idempotency keys.
- Ensure mutation of the request payload between retry attempts is not allowed.
- Carefully devise and configure auto-retry strategies based on business needs (using exponential backoff or randomized wait times (“jitter”) to avoid the thundering herd problem).
如何选择一个幂等 Key?
结合业务,分为 request-level 和 entry-level。
Recording the Response
看下来就是当 response 到了一个 deterministic end state,也就是 Non-Retryable Error 和 Success 的状态时,就把它存起来。以后的 request 都会返回这个存起来的缓存。为什么要做这一步?给的原因是 “maintain and monitor idempotent behavior”。看起来是个永不过期的缓存。还要做什么冷热数据分离等操作来保证数据不会把数据库撑爆炸。
每个 Request 都要有个 Expiring Lease(会过期的租约)
主要是解决用户多次点击或者是 client 端十分激进的重试策略。里面列到的实现方法就是一个会过期的数据库行锁。
避免读从库
嗯