Notes on Avoiding Double Payments in a Distributed Paym...

回到网站

Notes on Avoiding Double Payments in a Distributed Payments System

Thinkings from a blog

Avoiding Double Payments in a Distributed Payments System

Original Post: https://medium.com/airbnb-engineering/avoiding-double-payments-in-a-distributed-payments-system-2981f6b070bb

感觉 AirBnB 一个重要的场景就是银行卡支付，而我们主要是微信和支付宝付款。两者的区别在于，银行卡支付就是直接从服务器发起的，而微信和支付宝是用户发起的。因此他们最重要的是保证不会触发两次扣款请求，也就是 exactly once；而我们最重要的是保证给用户的同时最多只有一个有效的支付请求。两者既有联系又有区别，通过避免向微信/支付宝发送两次支付请求可以解决我们的这个问题，但是这就不能处理微信/支付宝的超时问题。所以他们的一些思想我们可以借鉴，同时也要注意里面不同的地方不能生搬硬套。

初步看下来，这篇文章主要是在说如何在 Eventually Consisitency 下通过幂等来做 Exactly Once，以及根据保证幂等而引伸出来的一系列问题与解决方案。为什么强调最终一致性，是因为 ABB 的 SOA 架构，会出现服务 call 服务 call 服务的情况，如果用 2PC 的话就太重了。

接下来就是安利他们自己的 “Orpheus”, a general-purpose idempotency library。这个 library 就不看了。下面会重点看里面提到的一些设计思想。

Pre-RPC，RPC，Post-RPC

Pre-RPC: Details of the payment request are recorded in the database.
RPC: The request is made live to the external service over network and the response is received. This is a place to do one or more idempotent computations or RPCs (for example, query service for the status of a transaction first if it’s a retry-attempt).
Post-RPC: Details of the response from the external service are recorded in the database, including its successfulness and whether a bad request is retryable or not.

To maintain data integrity, we adhere to two simple ground rules:

   No service interaction over networks in Pre and Post-RPC phases
   No database interactions in the RPC phases

We essentially want to avoid mixing network communication with database work.（database mix network是怎么做到的？他们也是挺牛逼的）

Pre-RPC and Post-RPC phases is combined into a single database transaction

正确处理 Retryable 和 Non-Retryable 的 Exception

这个里面更多地从功能上考虑哪些 Exception 是可重试的，但是没有从服务器的角度（负载、性能）上考虑。

Clients Play a Vital Rule，调用方也很重要，要承担更多的职责

Pass in a unique idempotency key for every new request; reuse the same idempotency key for retries.
Persist these idempotency keys to the database before calling the service (to later use for retries).
Properly consume successful responses and subsequently unassign (or nullify) idempotency keys.
Ensure mutation of the request payload between retry attempts is not allowed.
Carefully devise and configure auto-retry strategies based on business needs (using exponential backoff or randomized wait times (“jitter”) to avoid the thundering herd problem).

如何选择一个幂等 Key？

结合业务，分为 request-level 和 entry-level。

Recording the Response

看下来就是当 response 到了一个 deterministic end state，也就是 Non-Retryable Error 和 Success 的状态时，就把它存起来。以后的 request 都会返回这个存起来的缓存。为什么要做这一步？给的原因是 “maintain and monitor idempotent behavior”。看起来是个永不过期的缓存。还要做什么冷热数据分离等操作来保证数据不会把数据库撑爆炸。

每个 Request 都要有个 Expiring Lease（会过期的租约）

主要是解决用户多次点击或者是 client 端十分激进的重试策略。里面列到的实现方法就是一个会过期的数据库行锁。

避免读从库

嗯