为什么介绍这篇 paper?
- Raft 作为简单易懂的共识算法的代表,已经有了许多业界实现,而基于这些实现,一个庞大的软件生态也被构建了起来。通过了解 raft,可以帮助我们更加深入地了解这个软件生态。什么是可以做的,什么是不能做的,能做的事情的极限在什么地方。K8S
- 通过 raft 的设计思路,我们可以了解他们是怎么做的 invent & simplify。
Raft 在解决什么问题?
Consensus Algorithm。Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. 和 Paxos 做比较,raft 更加地 “understandable”。
一个共识算法的模型是什么样的?Replicated State Machine。
(All Pictures below are from https://raft.github.io/raft.pdf)
共识算法要解决的核心问题就是要让 Replicated log 保持一致,记录一致并且顺序一致,即使有些机器会 fail。
如何衡量一个共识算法是不是“好”的呢?
- Safety,never returning an incorrect result under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
- Liveness,as long as any majority of the servers are operational and can communicate with each other and with clients.
- Not Depend on Timing
- Performance depends on majority of servers
对于“易懂性”,raft 是怎么做的?
- 将算法的各个部分相对独立地切分开。The first technique is the well-known approach of problem decomposition: wher- ever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently.
- 选主
- 复制
- 安全性
- 成员变更
- 通过一些算法的限制和假设,减少整个算法的复杂性。减少 corner case 和 if 分支。 simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible.
log 里不能有空洞
Log 不一致的情况的限制
使用“随机”来降低复杂性
说了这么多,Raft 到底是怎么做的?
Raft Basics
Raft Cluster 是由多个 node 的组成的。这些 node 的身份只可能是
- Leader
- Follower
- Candidate
Term 用来作为逻辑时钟。每一次重新的选主,term 都会单调增加。Raft 会保证每个 term 里最多只有一个 leader。
RPC call 是 raft 算法中 node 间通信的方法。主要有
- RequestVote RPCs are initiated by candidates during elections
- AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat
Leader Election
- Heartbeat
- Random to simplify
- Term +1
- Win by majority
- Vote only 1 at one term, first come first serve
- [safety] only node containing all committed entries can win
Log Replication
- Committed. The leader decides when it is safe to apply a log en- try to the state machines; such an entry is called commit- ted.
- Durable and eventually executed
- leader 维护最大的可以 commit 的 log index 并在 AppendEntries 里同步给 follower
- [safety]
- 如果两个 node 上的 log 的 term 和 index 都相同,那么它们的 command 也是相同的。
- 如果两个 node 上的 log 的 term 和 index 都相同,那么它们之前所有的 log entries 都是相同的。
- The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log.
- The second property is guaranteed by a simple con- sistency check performed by AppendEntries. When send- ing an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.
正常情况下一切都好说,但是如果 leader 宕机了。。。
Leader 强制 follower 覆盖成自己的 log。但是这个地方的限制还是太弱了。
为了解决这个问题,引入了“间接提交”。
Raft never commits log entries from previous terms by count- ing replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.
参数调优
broadcastTime ≪ electionTimeout ≪ MTBF
( MTBF is the average time between failures for a single server. )
都是相对时间。
成员变更
成员变更要解决的首要问题就是“如何避免脑裂”
采用的方法是两阶段变更(2-phase approach)
C old -> C old, new -> C new
引入了三个问题
- 新节点没有历史记录。join as non-voting member
- leader in C old, new may not in new config. Leader 会退化为 follower when C new is committed, 并且在 committing C new 的时候,leader 不会把自己作为 majority。
- 在 C old 但是不在 C new 的机器可能会扰乱集群。如果 node 认为当前 leader 还存活(没过 election timeout),就不会回应 RequestVote 的请求。
Log Compaction
这个和 raft 本身关系不大。
Raft 在现实中的应用
- etcd
- consul
总结
Feature
- 只有 leader 负责和 client 通信。如果集群没有 leader 的话就无法对外提供读写服务。
- follower 无法提供读服务,没有 paxos 里的 listener。单集群很难 scale。scale 需要重新设计 multi-raft。
- commit 要多数派同意,因此集群内部的 node 数量是有限制的。通常是 5 个。
Invent and Simplify
- 使用更多的 constraints 来减少系统的 edge case
- 适当引入 random 反而会简化系统设计
- 将大的系统尽量 breakdown 成相对独立的子系统,使用单独的一个章节来介绍子系统之间的联系 (For writing)
TLA+
导演剪辑版