Introduction#
Raft is a consensus algorithm designed to be understandable while providing the same guarantees as Paxos. Developed by Diego Ongaro and John Ousterhout in 2013, Raft achieves consensus among distributed nodes through leader election and log replication. It is widely used in production systems like etcd, Consul, and CockroachDB.
The Consensus Problem#
Distributed consensus requires multiple nodes to agree on a single value despite failures. Key challenges include:
Network Partitions: Nodes may be unable to communicate Node Failures: Servers may crash and restart Message Loss: Network packets may be dropped or delayed Byzantine Faults: Malicious nodes (not addressed by Raft)
Raft Overview#
Raft decomposes consensus into three independent subproblems:
- Leader Election: Select a single leader to manage the replicated log
- Log Replication: Leader accepts commands and replicates them to followers
- Safety: Ensure replicated logs remain consistent
Node States#
Every Raft node exists in one of three states:
Leader: Handles all client requests and replicates log entries
Follower: Passive nodes that respond to leader and candidate requests
Candidate: Nodes campaigning to become the leader
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
import random
import time
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
term: int
index: int
command: str
class RaftNode:
def __init__(self, node_id: str, cluster_nodes: List[str]):
self.node_id = node_id
self.cluster_nodes = cluster_nodes
# Persistent state (survives restarts)
self.current_term = 0
self.voted_for = None
self.log: List[LogEntry] = []
# Volatile state (all nodes)
self.commit_index = 0
self.last_applied = 0
self.state = NodeState.FOLLOWER
# Volatile state (leaders only)
self.next_index = {} # For each server, index of next log entry to send
self.match_index = {} # For each server, index of highest log entry known to be replicated
# Timeouts
self.election_timeout = self.random_election_timeout()
self.last_heartbeat = time.time()
def random_election_timeout(self):
"""Election timeout randomized between 150-300ms"""
return random.uniform(0.15, 0.3)
def reset_election_timer(self):
"""Reset election timeout"""
self.election_timeout = self.random_election_timeout()
self.last_heartbeat = time.time()
def time_since_last_heartbeat(self):
"""Time elapsed since last heartbeat"""
return time.time() - self.last_heartbeat
def start_election(self):
"""Transition to candidate and start election"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id
votes_received = 1 # Vote for self
print(f"[{self.node_id}] Starting election for term {self.current_term}")
# Request votes from all other nodes
for node in self.cluster_nodes:
if node != self.node_id:
vote_granted = self.request_vote(node)
if vote_granted:
votes_received += 1
# Check if won election (majority)
if votes_received > len(self.cluster_nodes) / 2:
self.become_leader()
else:
# Didn't win, revert to follower
self.state = NodeState.FOLLOWER
print(f"[{self.node_id}] Lost election with {votes_received} votes")
def become_leader(self):
"""Become leader after winning election"""
self.state = NodeState.LEADER
print(f"[{self.node_id}] Became leader for term {self.current_term}")
# Initialize leader state
for node in self.cluster_nodes:
if node != self.node_id:
self.next_index[node] = len(self.log) + 1
self.match_index[node] = 0
# Send initial heartbeat
self.send_heartbeats()
def request_vote(self, candidate_id: str) -> bool:
"""
RPC called by candidates to gather votes
Receiver implementation
"""
# Grant vote if:
# 1. Haven't voted in this term, or already voted for this candidate
# 2. Candidate's log is at least as up-to-date as receiver's log
last_log_index = len(self.log)
last_log_term = self.log[-1].term if self.log else 0
if self.voted_for is None or self.voted_for == candidate_id:
# Check log up-to-dateness
candidate_last_log_index = len(self.log) # Simplified
candidate_last_log_term = self.log[-1].term if self.log else 0
if candidate_last_log_term > last_log_term or \
(candidate_last_log_term == last_log_term and
candidate_last_log_index >= last_log_index):
self.voted_for = candidate_id
self.reset_election_timer()
return True
return False
def send_heartbeats(self):
"""Leader sends heartbeats to all followers"""
if self.state != NodeState.LEADER:
return
for node in self.cluster_nodes:
if node != self.node_id:
self.append_entries(node, heartbeat=True)
def append_entries(self, follower_id: str, heartbeat: bool = False):
"""
RPC to replicate log entries or send heartbeat
Simplified version
"""
prev_log_index = self.next_index.get(follower_id, 1) - 1
prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
entries = []
if not heartbeat:
# Get entries to send
next_idx = self.next_index[follower_id]
if next_idx <= len(self.log):
entries = self.log[next_idx - 1:]
print(f"[{self.node_id}] Sending {'heartbeat' if heartbeat else 'entries'} to {follower_id}")
# In real implementation, this would be a network RPC
# For now, simulate success
return True
Leader Election#
Leader election ensures exactly one leader per term. The process:
- Follower Timeout: If a follower doesn’t receive heartbeat within election timeout, it becomes a candidate
- Request Votes: Candidate increments term, votes for itself, sends RequestVote RPCs
- Collect Votes: Nodes grant vote if they haven’t voted this term and candidate’s log is up-to-date
- Become Leader: If candidate receives majority votes, it becomes leader
- New Term: If another node wins or term expires, candidate becomes follower
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
// Java/Spring Boot: Raft Leader Election
@Service
public class RaftLeaderElection {
private final String nodeId;
private final List<String> clusterNodes;
private final AtomicInteger currentTerm = new AtomicInteger(0);
private final AtomicReference<NodeState> state = new AtomicReference<>(NodeState.FOLLOWER);
private volatile String votedFor = null;
private volatile Instant lastHeartbeat = Instant.now();
private final Random random = new Random();
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
@Autowired
public RaftLeaderElection(
@Value("${raft.node.id}") String nodeId,
@Value("${raft.cluster.nodes}") List<String> clusterNodes
) {
this.nodeId = nodeId;
this.clusterNodes = clusterNodes;
startElectionTimer();
}
private void startElectionTimer() {
// Random election timeout between 150-300ms
long timeout = 150 + random.nextInt(150);
scheduler.schedule(() -> {
if (state.get() != NodeState.LEADER) {
Duration sinceHeartbeat = Duration.between(lastHeartbeat, Instant.now());
if (sinceHeartbeat.toMillis() > timeout) {
startElection();
}
}
startElectionTimer(); // Reschedule
}, timeout, TimeUnit.MILLISECONDS);
}
private void startElection() {
// Transition to candidate
state.set(NodeState.CANDIDATE);
int newTerm = currentTerm.incrementAndGet();
votedFor = nodeId; // Vote for self
log.info("[{}] Starting election for term {}", nodeId, newTerm);
AtomicInteger votesReceived = new AtomicInteger(1); // Self vote
CountDownLatch votingComplete = new CountDownLatch(clusterNodes.size() - 1);
// Request votes from all other nodes
for (String node : clusterNodes) {
if (!node.equals(nodeId)) {
CompletableFuture.supplyAsync(() -> requestVote(node, newTerm))
.thenAccept(granted -> {
if (granted) {
votesReceived.incrementAndGet();
}
votingComplete.countDown();
});
}
}
// Wait for votes with timeout
try {
votingComplete.await(100, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
// Check if won election
int votes = votesReceived.get();
int majority = (clusterNodes.size() / 2) + 1;
if (votes >= majority && state.get() == NodeState.CANDIDATE) {
becomeLeader();
} else {
log.info("[{}] Lost election with {} votes (needed {})", nodeId, votes, majority);
state.set(NodeState.FOLLOWER);
}
}
private boolean requestVote(String targetNode, int term) {
try {
RequestVoteRequest request = new RequestVoteRequest(
term,
nodeId,
getLastLogIndex(),
getLastLogTerm()
);
// Send RPC to target node
RequestVoteResponse response = sendRequestVoteRPC(targetNode, request);
// If response term is higher, revert to follower
if (response.getTerm() > currentTerm.get()) {
currentTerm.set(response.getTerm());
state.set(NodeState.FOLLOWER);
votedFor = null;
}
return response.isVoteGranted();
} catch (Exception e) {
log.warn("[{}] Failed to request vote from {}: {}",
nodeId, targetNode, e.getMessage());
return false;
}
}
private void becomeLeader() {
state.set(NodeState.LEADER);
log.info("[{}] Became leader for term {}", nodeId, currentTerm.get());
// Start sending heartbeats
startHeartbeats();
}
private void startHeartbeats() {
scheduler.scheduleAtFixedRate(() -> {
if (state.get() == NodeState.LEADER) {
sendHeartbeats();
}
}, 0, 50, TimeUnit.MILLISECONDS); // Heartbeat every 50ms
}
private void sendHeartbeats() {
for (String node : clusterNodes) {
if (!node.equals(nodeId)) {
CompletableFuture.runAsync(() -> {
try {
sendAppendEntriesRPC(node, true);
} catch (Exception e) {
log.warn("[{}] Failed to send heartbeat to {}", nodeId, node);
}
});
}
}
}
@RaftRPC
public RequestVoteResponse handleRequestVote(RequestVoteRequest request) {
int term = currentTerm.get();
// Reply false if term < currentTerm
if (request.getTerm() < term) {
return new RequestVoteResponse(term, false);
}
// If RPC request has higher term, update and convert to follower
if (request.getTerm() > term) {
currentTerm.set(request.getTerm());
state.set(NodeState.FOLLOWER);
votedFor = null;
}
// Grant vote if haven't voted or already voted for this candidate
// and candidate's log is at least as up-to-date
boolean logUpToDate = isLogUpToDate(
request.getLastLogIndex(),
request.getLastLogTerm()
);
if ((votedFor == null || votedFor.equals(request.getCandidateId()))
&& logUpToDate) {
votedFor = request.getCandidateId();
lastHeartbeat = Instant.now();
return new RequestVoteResponse(currentTerm.get(), true);
}
return new RequestVoteResponse(currentTerm.get(), false);
}
private boolean isLogUpToDate(int candidateLastIndex, int candidateLastTerm) {
int lastLogIndex = getLastLogIndex();
int lastLogTerm = getLastLogTerm();
// Candidate's log is more up-to-date if:
// - Last term is greater, OR
// - Last term is equal and index is greater or equal
return candidateLastTerm > lastLogTerm ||
(candidateLastTerm == lastLogTerm &&
candidateLastIndex >= lastLogIndex);
}
}
Log Replication#
Once elected, the leader accepts client commands and replicates them to followers:
- Client Request: Leader appends command to its log
- Replicate: Leader sends AppendEntries RPCs to followers in parallel
- Commit: Once majority of followers acknowledge, entry is committed
- Apply: Leader applies committed entry to state machine, returns result to client
- Propagate: Leader includes commitIndex in subsequent AppendEntries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
// Node.js: Raft Log Replication
class RaftLogReplication {
constructor(nodeId, clusterNodes) {
this.nodeId = nodeId;
this.clusterNodes = clusterNodes;
// Persistent state
this.currentTerm = 0;
this.log = []; // Array of {term, index, command}
// Volatile state
this.commitIndex = 0;
this.lastApplied = 0;
this.state = 'follower';
// Leader state
this.nextIndex = {}; // Next log index to send to each follower
this.matchIndex = {}; // Highest log index replicated to each follower
// State machine
this.stateMachine = new Map();
}
/**
* Client submits command to leader
*/
async submitCommand(command) {
if (this.state !== 'leader') {
throw new Error('Not the leader');
}
// Append to local log
const entry = {
term: this.currentTerm,
index: this.log.length + 1,
command: command
};
this.log.push(entry);
console.log(
`[${this.nodeId}] Appended command to log: ` +
`index=${entry.index}, term=${entry.term}`
);
// Replicate to followers
const replicated = await this.replicateToFollowers(entry);
if (replicated) {
// Majority acknowledged, commit the entry
this.commitIndex = entry.index;
this.applyToStateMachine(entry);
return {
success: true,
index: entry.index,
term: entry.term
};
}
return {
success: false,
error: 'Failed to achieve majority replication'
};
}
/**
* Replicate log entry to followers
*/
async replicateToFollowers(entry) {
const replicationPromises = [];
for (const follower of this.clusterNodes) {
if (follower === this.nodeId) continue;
replicationPromises.push(
this.sendAppendEntries(follower, [entry])
);
}
// Wait for responses
const results = await Promise.allSettled(replicationPromises);
// Count successful replications
const successCount = results.filter(
r => r.status === 'fulfilled' && r.value.success
).length;
// Include self in count
const total = successCount + 1;
const majority = Math.floor(this.clusterNodes.length / 2) + 1;
console.log(
`[${this.nodeId}] Replication result: ${total}/${this.clusterNodes.length} ` +
`(needed ${majority})`
);
return total >= majority;
}
/**
* Send AppendEntries RPC to follower
*/
async sendAppendEntries(follower, entries) {
const prevLogIndex = this.nextIndex[follower] - 1;
const prevLogTerm = prevLogIndex > 0
? this.log[prevLogIndex - 1].term
: 0;
const request = {
term: this.currentTerm,
leaderId: this.nodeId,
prevLogIndex: prevLogIndex,
prevLogTerm: prevLogTerm,
entries: entries,
leaderCommit: this.commitIndex
};
try {
const response = await this.rpcCall(follower, 'appendEntries', request);
if (response.success) {
// Update nextIndex and matchIndex for follower
this.nextIndex[follower] = prevLogIndex + entries.length + 1;
this.matchIndex[follower] = prevLogIndex + entries.length;
console.log(
`[${this.nodeId}] Successfully replicated to ${follower}: ` +
`matchIndex=${this.matchIndex[follower]}`
);
// Try to advance commitIndex
this.advanceCommitIndex();
} else {
// Log inconsistency, decrement nextIndex and retry
this.nextIndex[follower] = Math.max(1, this.nextIndex[follower] - 1);
console.log(
`[${this.nodeId}] Log inconsistency with ${follower}, ` +
`retry with nextIndex=${this.nextIndex[follower]}`
);
}
return response;
} catch (error) {
console.error(
`[${this.nodeId}] Failed to replicate to ${follower}: ${error.message}`
);
return { success: false };
}
}
/**
* Advance commitIndex based on replicated entries
*/
advanceCommitIndex() {
// Find highest index replicated to majority
const indices = Object.values(this.matchIndex);
indices.push(this.log.length); // Include leader's log
indices.sort((a, b) => b - a);
const majority = Math.floor(this.clusterNodes.length / 2) + 1;
const newCommitIndex = indices[majority - 1];
// Only commit entries from current term
if (newCommitIndex > this.commitIndex) {
const entry = this.log[newCommitIndex - 1];
if (entry && entry.term === this.currentTerm) {
console.log(
`[${this.nodeId}] Advancing commitIndex: ` +
`${this.commitIndex} -> ${newCommitIndex}`
);
this.commitIndex = newCommitIndex;
// Apply newly committed entries
this.applyCommittedEntries();
}
}
}
/**
* Apply committed entries to state machine
*/
applyCommittedEntries() {
while (this.lastApplied < this.commitIndex) {
this.lastApplied++;
const entry = this.log[this.lastApplied - 1];
this.applyToStateMachine(entry);
}
}
/**
* Apply entry to state machine
*/
applyToStateMachine(entry) {
console.log(
`[${this.nodeId}] Applying to state machine: ` +
`index=${entry.index}, command=${entry.command}`
);
// Parse command (simplified: "SET key value" or "GET key")
const parts = entry.command.split(' ');
const operation = parts[0];
if (operation === 'SET') {
const key = parts[1];
const value = parts.slice(2).join(' ');
this.stateMachine.set(key, value);
}
}
/**
* Handle AppendEntries RPC from leader
*/
handleAppendEntries(request) {
// Reply false if term < currentTerm
if (request.term < this.currentTerm) {
return {
term: this.currentTerm,
success: false
};
}
// Update term and convert to follower if needed
if (request.term > this.currentTerm) {
this.currentTerm = request.term;
this.state = 'follower';
}
// Reset election timer (received valid heartbeat/append)
this.resetElectionTimer();
// Reply false if log doesn't contain entry at prevLogIndex with prevLogTerm
if (request.prevLogIndex > 0) {
const prevEntry = this.log[request.prevLogIndex - 1];
if (!prevEntry || prevEntry.term !== request.prevLogTerm) {
return {
term: this.currentTerm,
success: false
};
}
}
// Append new entries (if any)
if (request.entries && request.entries.length > 0) {
// Delete any conflicting entries and append new ones
const startIndex = request.prevLogIndex;
this.log = this.log.slice(0, startIndex);
this.log.push(...request.entries);
console.log(
`[${this.nodeId}] Appended ${request.entries.length} entries from leader`
);
}
// Update commitIndex if leader's commitIndex is higher
if (request.leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(
request.leaderCommit,
this.log.length
);
this.applyCommittedEntries();
}
return {
term: this.currentTerm,
success: true
};
}
}
Safety Properties#
Raft guarantees several safety properties:
Election Safety#
At most one leader per term.
Leader Append-Only#
Leaders never overwrite or delete entries in their log.
Log Matching#
If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index.
Leader Completeness#
If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.
State Machine Safety#
If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
// C#: Raft Safety Checks
public class RaftSafetyValidator
{
private readonly RaftNode _node;
public RaftSafetyValidator(RaftNode node)
{
_node = node;
}
/// <summary>
/// Verify Election Safety: at most one leader per term
/// </summary>
public bool ValidateElectionSafety(List<RaftNode> allNodes)
{
var leadersByTerm = new Dictionary<int, List<string>>();
foreach (var node in allNodes)
{
if (node.State == NodeState.Leader)
{
if (!leadersByTerm.ContainsKey(node.CurrentTerm))
{
leadersByTerm[node.CurrentTerm] = new List<string>();
}
leadersByTerm[node.CurrentTerm].Add(node.NodeId);
}
}
// Check that each term has at most one leader
foreach (var kvp in leadersByTerm)
{
if (kvp.Value.Count > 1)
{
Console.WriteLine(
$"SAFETY VIOLATION: Multiple leaders in term {kvp.Key}: " +
$"{string.Join(", ", kvp.Value)}"
);
return false;
}
}
return true;
}
/// <summary>
/// Verify Log Matching Property
/// </summary>
public bool ValidateLogMatching(RaftNode node1, RaftNode node2)
{
var minLength = Math.Min(node1.Log.Count, node2.Log.Count);
for (int i = 0; i < minLength; i++)
{
var entry1 = node1.Log[i];
var entry2 = node2.Log[i];
// If entries have same index and term, commands must match
if (entry1.Index == entry2.Index && entry1.Term == entry2.Term)
{
if (entry1.Command != entry2.Command)
{
Console.WriteLine(
$"SAFETY VIOLATION: Log entries at index {entry1.Index} " +
$"have same term but different commands"
);
return false;
}
}
}
return true;
}
/// <summary>
/// Verify State Machine Safety
/// </summary>
public bool ValidateStateMachineSafety(List<RaftNode> allNodes)
{
// Check that all nodes have applied same commands up to min commit index
var minCommitIndex = allNodes.Min(n => n.CommitIndex);
for (int i = 0; i < minCommitIndex; i++)
{
var commands = new HashSet<string>();
foreach (var node in allNodes)
{
if (i < node.Log.Count)
{
commands.Add(node.Log[i].Command);
}
}
if (commands.Count > 1)
{
Console.WriteLine(
$"SAFETY VIOLATION: Nodes have different commands at index {i + 1}"
);
return false;
}
}
return true;
}
/// <summary>
/// Verify Leader Completeness
/// </summary>
public bool ValidateLeaderCompleteness(RaftNode currentLeader, RaftNode previousLeader)
{
if (currentLeader.CurrentTerm <= previousLeader.CurrentTerm)
{
return true; // Not applicable
}
// Current leader must have all committed entries from previous terms
for (int i = 0; i < previousLeader.CommitIndex; i++)
{
if (i >= currentLeader.Log.Count)
{
Console.WriteLine(
$"SAFETY VIOLATION: Current leader missing committed entry at index {i + 1}"
);
return false;
}
var prevEntry = previousLeader.Log[i];
var currEntry = currentLeader.Log[i];
if (prevEntry.Term != currEntry.Term || prevEntry.Command != currEntry.Command)
{
Console.WriteLine(
$"SAFETY VIOLATION: Leader has different entry at committed index {i + 1}"
);
return false;
}
}
return true;
}
}
Cluster Membership Changes#
Raft supports dynamic cluster membership through joint consensus:
- Leader receives configuration change request
- Leader commits C_old,new (joint consensus configuration)
- Once C_old,new is committed, leader commits C_new
- Old servers can be shut down after C_new is committed
Performance Characteristics#
Latency: O(1) round trips for successful operations (leader to majority of followers) Throughput: Limited by leader’s capacity and network bandwidth Availability: Tolerates (N-1)/2 failures in an N-node cluster
Raft vs Paxos#
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | Designed for clarity | Complex, hard to understand |
| Leader | Strong leader | No designated leader |
| Log Structure | Continuous log | Arbitrary gaps allowed |
| Performance | Similar | Similar |
| Implementation | Many open-source implementations | Fewer implementations |
Production Use Cases#
etcd: Kubernetes’ distributed configuration store Consul: Service discovery and configuration CockroachDB: Distributed SQL database TiKV: Distributed key-value store
Summary#
Raft provides understandable distributed consensus through leader election and log replication. By decomposing the problem into distinct subproblems and maintaining strong invariants, Raft achieves linearizable operations with high availability. Its clarity makes it the consensus algorithm of choice for many modern distributed systems.
Key takeaways:
- Leader-based design simplifies implementation and reasoning
- Randomized election timeouts prevent split votes
- Majority quorum ensures safety despite failures
- Log replication provides strong consistency guarantees
- Safety properties are mathematically proven and maintained