Raft Consensus Explained

Introduction#

Raft is a consensus algorithm designed to be understandable while providing the same guarantees as Paxos. Developed by Diego Ongaro and John Ousterhout in 2013, Raft achieves consensus among distributed nodes through leader election and log replication. It is widely used in production systems like etcd, Consul, and CockroachDB.

The Consensus Problem#

Distributed consensus requires multiple nodes to agree on a single value despite failures. Key challenges include:

Network Partitions: Nodes may be unable to communicate Node Failures: Servers may crash and restart Message Loss: Network packets may be dropped or delayed Byzantine Faults: Malicious nodes (not addressed by Raft)

Raft Overview#

Raft decomposes consensus into three independent subproblems:

Leader Election: Select a single leader to manage the replicated log
Log Replication: Leader accepts commands and replicates them to followers
Safety: Ensure replicated logs remain consistent

Node States#

Every Raft node exists in one of three states:

Leader: Handles all client requests and replicates log entries Follower: Passive nodes that respond to leader and candidate requests
Candidate: Nodes campaigning to become the leader

from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
import random
import time

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    term: int
    index: int
    command: str

class RaftNode:
    def __init__(self, node_id: str, cluster_nodes: List[str]):
        self.node_id = node_id
        self.cluster_nodes = cluster_nodes
        
        # Persistent state (survives restarts)
        self.current_term = 0
        self.voted_for = None
        self.log: List[LogEntry] = []
        
        # Volatile state (all nodes)
        self.commit_index = 0
        self.last_applied = 0
        self.state = NodeState.FOLLOWER
        
        # Volatile state (leaders only)
        self.next_index = {}  # For each server, index of next log entry to send
        self.match_index = {}  # For each server, index of highest log entry known to be replicated
        
        # Timeouts
        self.election_timeout = self.random_election_timeout()
        self.last_heartbeat = time.time()
        
    def random_election_timeout(self):
        """Election timeout randomized between 150-300ms"""
        return random.uniform(0.15, 0.3)
    
    def reset_election_timer(self):
        """Reset election timeout"""
        self.election_timeout = self.random_election_timeout()
        self.last_heartbeat = time.time()
    
    def time_since_last_heartbeat(self):
        """Time elapsed since last heartbeat"""
        return time.time() - self.last_heartbeat
    
    def start_election(self):
        """Transition to candidate and start election"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        votes_received = 1  # Vote for self
        
        print(f"[{self.node_id}] Starting election for term {self.current_term}")
        
        # Request votes from all other nodes
        for node in self.cluster_nodes:
            if node != self.node_id:
                vote_granted = self.request_vote(node)
                if vote_granted:
                    votes_received += 1
        
        # Check if won election (majority)
        if votes_received > len(self.cluster_nodes) / 2:
            self.become_leader()
        else:
            # Didn't win, revert to follower
            self.state = NodeState.FOLLOWER
            print(f"[{self.node_id}] Lost election with {votes_received} votes")
    
    def become_leader(self):
        """Become leader after winning election"""
        self.state = NodeState.LEADER
        print(f"[{self.node_id}] Became leader for term {self.current_term}")
        
        # Initialize leader state
        for node in self.cluster_nodes:
            if node != self.node_id:
                self.next_index[node] = len(self.log) + 1
                self.match_index[node] = 0
        
        # Send initial heartbeat
        self.send_heartbeats()
    
    def request_vote(self, candidate_id: str) -> bool:
        """
        RPC called by candidates to gather votes
        Receiver implementation
        """
        # Grant vote if:
        # 1. Haven't voted in this term, or already voted for this candidate
        # 2. Candidate's log is at least as up-to-date as receiver's log
        
        last_log_index = len(self.log)
        last_log_term = self.log[-1].term if self.log else 0
        
        if self.voted_for is None or self.voted_for == candidate_id:
            # Check log up-to-dateness
            candidate_last_log_index = len(self.log)  # Simplified
            candidate_last_log_term = self.log[-1].term if self.log else 0
            
            if candidate_last_log_term > last_log_term or \
               (candidate_last_log_term == last_log_term and 
                candidate_last_log_index >= last_log_index):
                self.voted_for = candidate_id
                self.reset_election_timer()
                return True
        
        return False
    
    def send_heartbeats(self):
        """Leader sends heartbeats to all followers"""
        if self.state != NodeState.LEADER:
            return
        
        for node in self.cluster_nodes:
            if node != self.node_id:
                self.append_entries(node, heartbeat=True)
    
    def append_entries(self, follower_id: str, heartbeat: bool = False):
        """
        RPC to replicate log entries or send heartbeat
        Simplified version
        """
        prev_log_index = self.next_index.get(follower_id, 1) - 1
        prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
        
        entries = []
        if not heartbeat:
            # Get entries to send
            next_idx = self.next_index[follower_id]
            if next_idx <= len(self.log):
                entries = self.log[next_idx - 1:]
        
        print(f"[{self.node_id}] Sending {'heartbeat' if heartbeat else 'entries'} to {follower_id}")
        
        # In real implementation, this would be a network RPC
        # For now, simulate success
        return True

Leader Election#

Leader election ensures exactly one leader per term. The process:

Follower Timeout: If a follower doesn’t receive heartbeat within election timeout, it becomes a candidate
Request Votes: Candidate increments term, votes for itself, sends RequestVote RPCs
Collect Votes: Nodes grant vote if they haven’t voted this term and candidate’s log is up-to-date
Become Leader: If candidate receives majority votes, it becomes leader
New Term: If another node wins or term expires, candidate becomes follower

// Java/Spring Boot: Raft Leader Election
@Service
public class RaftLeaderElection {
    private final String nodeId;
    private final List<String> clusterNodes;
    private final AtomicInteger currentTerm = new AtomicInteger(0);
    private final AtomicReference<NodeState> state = new AtomicReference<>(NodeState.FOLLOWER);
    private volatile String votedFor = null;
    private volatile Instant lastHeartbeat = Instant.now();
    
    private final Random random = new Random();
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
    
    @Autowired
    public RaftLeaderElection(
        @Value("${raft.node.id}") String nodeId,
        @Value("${raft.cluster.nodes}") List<String> clusterNodes
    ) {
        this.nodeId = nodeId;
        this.clusterNodes = clusterNodes;
        startElectionTimer();
    }
    
    private void startElectionTimer() {
        // Random election timeout between 150-300ms
        long timeout = 150 + random.nextInt(150);
        
        scheduler.schedule(() -> {
            if (state.get() != NodeState.LEADER) {
                Duration sinceHeartbeat = Duration.between(lastHeartbeat, Instant.now());
                if (sinceHeartbeat.toMillis() > timeout) {
                    startElection();
                }
            }
            startElectionTimer();  // Reschedule
        }, timeout, TimeUnit.MILLISECONDS);
    }
    
    private void startElection() {
        // Transition to candidate
        state.set(NodeState.CANDIDATE);
        int newTerm = currentTerm.incrementAndGet();
        votedFor = nodeId;  // Vote for self
        
        log.info("[{}] Starting election for term {}", nodeId, newTerm);
        
        AtomicInteger votesReceived = new AtomicInteger(1);  // Self vote
        CountDownLatch votingComplete = new CountDownLatch(clusterNodes.size() - 1);
        
        // Request votes from all other nodes
        for (String node : clusterNodes) {
            if (!node.equals(nodeId)) {
                CompletableFuture.supplyAsync(() -> requestVote(node, newTerm))
                    .thenAccept(granted -> {
                        if (granted) {
                            votesReceived.incrementAndGet();
                        }
                        votingComplete.countDown();
                    });
            }
        }
        
        // Wait for votes with timeout
        try {
            votingComplete.await(100, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        
        // Check if won election
        int votes = votesReceived.get();
        int majority = (clusterNodes.size() / 2) + 1;
        
        if (votes >= majority && state.get() == NodeState.CANDIDATE) {
            becomeLeader();
        } else {
            log.info("[{}] Lost election with {} votes (needed {})", nodeId, votes, majority);
            state.set(NodeState.FOLLOWER);
        }
    }
    
    private boolean requestVote(String targetNode, int term) {
        try {
            RequestVoteRequest request = new RequestVoteRequest(
                term,
                nodeId,
                getLastLogIndex(),
                getLastLogTerm()
            );
            
            // Send RPC to target node
            RequestVoteResponse response = sendRequestVoteRPC(targetNode, request);
            
            // If response term is higher, revert to follower
            if (response.getTerm() > currentTerm.get()) {
                currentTerm.set(response.getTerm());
                state.set(NodeState.FOLLOWER);
                votedFor = null;
            }
            
            return response.isVoteGranted();
            
        } catch (Exception e) {
            log.warn("[{}] Failed to request vote from {}: {}", 
                nodeId, targetNode, e.getMessage());
            return false;
        }
    }
    
    private void becomeLeader() {
        state.set(NodeState.LEADER);
        log.info("[{}] Became leader for term {}", nodeId, currentTerm.get());
        
        // Start sending heartbeats
        startHeartbeats();
    }
    
    private void startHeartbeats() {
        scheduler.scheduleAtFixedRate(() -> {
            if (state.get() == NodeState.LEADER) {
                sendHeartbeats();
            }
        }, 0, 50, TimeUnit.MILLISECONDS);  // Heartbeat every 50ms
    }
    
    private void sendHeartbeats() {
        for (String node : clusterNodes) {
            if (!node.equals(nodeId)) {
                CompletableFuture.runAsync(() -> {
                    try {
                        sendAppendEntriesRPC(node, true);
                    } catch (Exception e) {
                        log.warn("[{}] Failed to send heartbeat to {}", nodeId, node);
                    }
                });
            }
        }
    }
    
    @RaftRPC
    public RequestVoteResponse handleRequestVote(RequestVoteRequest request) {
        int term = currentTerm.get();
        
        // Reply false if term < currentTerm
        if (request.getTerm() < term) {
            return new RequestVoteResponse(term, false);
        }
        
        // If RPC request has higher term, update and convert to follower
        if (request.getTerm() > term) {
            currentTerm.set(request.getTerm());
            state.set(NodeState.FOLLOWER);
            votedFor = null;
        }
        
        // Grant vote if haven't voted or already voted for this candidate
        // and candidate's log is at least as up-to-date
        boolean logUpToDate = isLogUpToDate(
            request.getLastLogIndex(), 
            request.getLastLogTerm()
        );
        
        if ((votedFor == null || votedFor.equals(request.getCandidateId())) 
            && logUpToDate) {
            votedFor = request.getCandidateId();
            lastHeartbeat = Instant.now();
            return new RequestVoteResponse(currentTerm.get(), true);
        }
        
        return new RequestVoteResponse(currentTerm.get(), false);
    }
    
    private boolean isLogUpToDate(int candidateLastIndex, int candidateLastTerm) {
        int lastLogIndex = getLastLogIndex();
        int lastLogTerm = getLastLogTerm();
        
        // Candidate's log is more up-to-date if:
        // - Last term is greater, OR
        // - Last term is equal and index is greater or equal
        return candidateLastTerm > lastLogTerm || 
               (candidateLastTerm == lastLogTerm && 
                candidateLastIndex >= lastLogIndex);
    }
}

Log Replication#

Once elected, the leader accepts client commands and replicates them to followers:

Client Request: Leader appends command to its log
Replicate: Leader sends AppendEntries RPCs to followers in parallel
Commit: Once majority of followers acknowledge, entry is committed
Apply: Leader applies committed entry to state machine, returns result to client
Propagate: Leader includes commitIndex in subsequent AppendEntries

// Node.js: Raft Log Replication
class RaftLogReplication {
    constructor(nodeId, clusterNodes) {
        this.nodeId = nodeId;
        this.clusterNodes = clusterNodes;
        
        // Persistent state
        this.currentTerm = 0;
        this.log = [];  // Array of {term, index, command}
        
        // Volatile state
        this.commitIndex = 0;
        this.lastApplied = 0;
        this.state = 'follower';
        
        // Leader state
        this.nextIndex = {};  // Next log index to send to each follower
        this.matchIndex = {}; // Highest log index replicated to each follower
        
        // State machine
        this.stateMachine = new Map();
    }
    
    /**
     * Client submits command to leader
     */
    async submitCommand(command) {
        if (this.state !== 'leader') {
            throw new Error('Not the leader');
        }
        
        // Append to local log
        const entry = {
            term: this.currentTerm,
            index: this.log.length + 1,
            command: command
        };
        this.log.push(entry);
        
        console.log(
            `[${this.nodeId}] Appended command to log: ` +
            `index=${entry.index}, term=${entry.term}`
        );
        
        // Replicate to followers
        const replicated = await this.replicateToFollowers(entry);
        
        if (replicated) {
            // Majority acknowledged, commit the entry
            this.commitIndex = entry.index;
            this.applyToStateMachine(entry);
            
            return {
                success: true,
                index: entry.index,
                term: entry.term
            };
        }
        
        return {
            success: false,
            error: 'Failed to achieve majority replication'
        };
    }
    
    /**
     * Replicate log entry to followers
     */
    async replicateToFollowers(entry) {
        const replicationPromises = [];
        
        for (const follower of this.clusterNodes) {
            if (follower === this.nodeId) continue;
            
            replicationPromises.push(
                this.sendAppendEntries(follower, [entry])
            );
        }
        
        // Wait for responses
        const results = await Promise.allSettled(replicationPromises);
        
        // Count successful replications
        const successCount = results.filter(
            r => r.status === 'fulfilled' && r.value.success
        ).length;
        
        // Include self in count
        const total = successCount + 1;
        const majority = Math.floor(this.clusterNodes.length / 2) + 1;
        
        console.log(
            `[${this.nodeId}] Replication result: ${total}/${this.clusterNodes.length} ` +
            `(needed ${majority})`
        );
        
        return total >= majority;
    }
    
    /**
     * Send AppendEntries RPC to follower
     */
    async sendAppendEntries(follower, entries) {
        const prevLogIndex = this.nextIndex[follower] - 1;
        const prevLogTerm = prevLogIndex > 0 
            ? this.log[prevLogIndex - 1].term 
            : 0;
        
        const request = {
            term: this.currentTerm,
            leaderId: this.nodeId,
            prevLogIndex: prevLogIndex,
            prevLogTerm: prevLogTerm,
            entries: entries,
            leaderCommit: this.commitIndex
        };
        
        try {
            const response = await this.rpcCall(follower, 'appendEntries', request);
            
            if (response.success) {
                // Update nextIndex and matchIndex for follower
                this.nextIndex[follower] = prevLogIndex + entries.length + 1;
                this.matchIndex[follower] = prevLogIndex + entries.length;
                
                console.log(
                    `[${this.nodeId}] Successfully replicated to ${follower}: ` +
                    `matchIndex=${this.matchIndex[follower]}`
                );
                
                // Try to advance commitIndex
                this.advanceCommitIndex();
            } else {
                // Log inconsistency, decrement nextIndex and retry
                this.nextIndex[follower] = Math.max(1, this.nextIndex[follower] - 1);
                
                console.log(
                    `[${this.nodeId}] Log inconsistency with ${follower}, ` +
                    `retry with nextIndex=${this.nextIndex[follower]}`
                );
            }
            
            return response;
            
        } catch (error) {
            console.error(
                `[${this.nodeId}] Failed to replicate to ${follower}: ${error.message}`
            );
            return { success: false };
        }
    }
    
    /**
     * Advance commitIndex based on replicated entries
     */
    advanceCommitIndex() {
        // Find highest index replicated to majority
        const indices = Object.values(this.matchIndex);
        indices.push(this.log.length);  // Include leader's log
        indices.sort((a, b) => b - a);
        
        const majority = Math.floor(this.clusterNodes.length / 2) + 1;
        const newCommitIndex = indices[majority - 1];
        
        // Only commit entries from current term
        if (newCommitIndex > this.commitIndex) {
            const entry = this.log[newCommitIndex - 1];
            if (entry && entry.term === this.currentTerm) {
                console.log(
                    `[${this.nodeId}] Advancing commitIndex: ` +
                    `${this.commitIndex} -> ${newCommitIndex}`
                );
                this.commitIndex = newCommitIndex;
                
                // Apply newly committed entries
                this.applyCommittedEntries();
            }
        }
    }
    
    /**
     * Apply committed entries to state machine
     */
    applyCommittedEntries() {
        while (this.lastApplied < this.commitIndex) {
            this.lastApplied++;
            const entry = this.log[this.lastApplied - 1];
            this.applyToStateMachine(entry);
        }
    }
    
    /**
     * Apply entry to state machine
     */
    applyToStateMachine(entry) {
        console.log(
            `[${this.nodeId}] Applying to state machine: ` +
            `index=${entry.index}, command=${entry.command}`
        );
        
        // Parse command (simplified: "SET key value" or "GET key")
        const parts = entry.command.split(' ');
        const operation = parts[0];
        
        if (operation === 'SET') {
            const key = parts[1];
            const value = parts.slice(2).join(' ');
            this.stateMachine.set(key, value);
        }
    }
    
    /**
     * Handle AppendEntries RPC from leader
     */
    handleAppendEntries(request) {
        // Reply false if term < currentTerm
        if (request.term < this.currentTerm) {
            return {
                term: this.currentTerm,
                success: false
            };
        }
        
        // Update term and convert to follower if needed
        if (request.term > this.currentTerm) {
            this.currentTerm = request.term;
            this.state = 'follower';
        }
        
        // Reset election timer (received valid heartbeat/append)
        this.resetElectionTimer();
        
        // Reply false if log doesn't contain entry at prevLogIndex with prevLogTerm
        if (request.prevLogIndex > 0) {
            const prevEntry = this.log[request.prevLogIndex - 1];
            if (!prevEntry || prevEntry.term !== request.prevLogTerm) {
                return {
                    term: this.currentTerm,
                    success: false
                };
            }
        }
        
        // Append new entries (if any)
        if (request.entries && request.entries.length > 0) {
            // Delete any conflicting entries and append new ones
            const startIndex = request.prevLogIndex;
            this.log = this.log.slice(0, startIndex);
            this.log.push(...request.entries);
            
            console.log(
                `[${this.nodeId}] Appended ${request.entries.length} entries from leader`
            );
        }
        
        // Update commitIndex if leader's commitIndex is higher
        if (request.leaderCommit > this.commitIndex) {
            this.commitIndex = Math.min(
                request.leaderCommit, 
                this.log.length
            );
            this.applyCommittedEntries();
        }
        
        return {
            term: this.currentTerm,
            success: true
        };
    }
}

Safety Properties#

Raft guarantees several safety properties:

Election Safety#

At most one leader per term.

Leader Append-Only#

Leaders never overwrite or delete entries in their log.

Log Matching#

If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index.

Leader Completeness#

If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.

State Machine Safety#

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

// C#: Raft Safety Checks
public class RaftSafetyValidator
{
    private readonly RaftNode _node;
    
    public RaftSafetyValidator(RaftNode node)
    {
        _node = node;
    }
    
    /// <summary>
    /// Verify Election Safety: at most one leader per term
    /// </summary>
    public bool ValidateElectionSafety(List<RaftNode> allNodes)
    {
        var leadersByTerm = new Dictionary<int, List<string>>();
        
        foreach (var node in allNodes)
        {
            if (node.State == NodeState.Leader)
            {
                if (!leadersByTerm.ContainsKey(node.CurrentTerm))
                {
                    leadersByTerm[node.CurrentTerm] = new List<string>();
                }
                leadersByTerm[node.CurrentTerm].Add(node.NodeId);
            }
        }
        
        // Check that each term has at most one leader
        foreach (var kvp in leadersByTerm)
        {
            if (kvp.Value.Count > 1)
            {
                Console.WriteLine(
                    $"SAFETY VIOLATION: Multiple leaders in term {kvp.Key}: " +
                    $"{string.Join(", ", kvp.Value)}"
                );
                return false;
            }
        }
        
        return true;
    }
    
    /// <summary>
    /// Verify Log Matching Property
    /// </summary>
    public bool ValidateLogMatching(RaftNode node1, RaftNode node2)
    {
        var minLength = Math.Min(node1.Log.Count, node2.Log.Count);
        
        for (int i = 0; i < minLength; i++)
        {
            var entry1 = node1.Log[i];
            var entry2 = node2.Log[i];
            
            // If entries have same index and term, commands must match
            if (entry1.Index == entry2.Index && entry1.Term == entry2.Term)
            {
                if (entry1.Command != entry2.Command)
                {
                    Console.WriteLine(
                        $"SAFETY VIOLATION: Log entries at index {entry1.Index} " +
                        $"have same term but different commands"
                    );
                    return false;
                }
            }
        }
        
        return true;
    }
    
    /// <summary>
    /// Verify State Machine Safety
    /// </summary>
    public bool ValidateStateMachineSafety(List<RaftNode> allNodes)
    {
        // Check that all nodes have applied same commands up to min commit index
        var minCommitIndex = allNodes.Min(n => n.CommitIndex);
        
        for (int i = 0; i < minCommitIndex; i++)
        {
            var commands = new HashSet<string>();
            
            foreach (var node in allNodes)
            {
                if (i < node.Log.Count)
                {
                    commands.Add(node.Log[i].Command);
                }
            }
            
            if (commands.Count > 1)
            {
                Console.WriteLine(
                    $"SAFETY VIOLATION: Nodes have different commands at index {i + 1}"
                );
                return false;
            }
        }
        
        return true;
    }
    
    /// <summary>
    /// Verify Leader Completeness
    /// </summary>
    public bool ValidateLeaderCompleteness(RaftNode currentLeader, RaftNode previousLeader)
    {
        if (currentLeader.CurrentTerm <= previousLeader.CurrentTerm)
        {
            return true;  // Not applicable
        }
        
        // Current leader must have all committed entries from previous terms
        for (int i = 0; i < previousLeader.CommitIndex; i++)
        {
            if (i >= currentLeader.Log.Count)
            {
                Console.WriteLine(
                    $"SAFETY VIOLATION: Current leader missing committed entry at index {i + 1}"
                );
                return false;
            }
            
            var prevEntry = previousLeader.Log[i];
            var currEntry = currentLeader.Log[i];
            
            if (prevEntry.Term != currEntry.Term || prevEntry.Command != currEntry.Command)
            {
                Console.WriteLine(
                    $"SAFETY VIOLATION: Leader has different entry at committed index {i + 1}"
                );
                return false;
            }
        }
        
        return true;
    }
}

Cluster Membership Changes#

Raft supports dynamic cluster membership through joint consensus:

Leader receives configuration change request
Leader commits C_old,new (joint consensus configuration)
Once C_old,new is committed, leader commits C_new
Old servers can be shut down after C_new is committed

Performance Characteristics#

Latency: O(1) round trips for successful operations (leader to majority of followers) Throughput: Limited by leader’s capacity and network bandwidth Availability: Tolerates (N-1)/2 failures in an N-node cluster

Raft vs Paxos#

Aspect	Raft	Paxos
Understandability	Designed for clarity	Complex, hard to understand
Leader	Strong leader	No designated leader
Log Structure	Continuous log	Arbitrary gaps allowed
Performance	Similar	Similar
Implementation	Many open-source implementations	Fewer implementations

Production Use Cases#

etcd: Kubernetes’ distributed configuration store Consul: Service discovery and configuration CockroachDB: Distributed SQL database TiKV: Distributed key-value store

Summary#

Raft provides understandable distributed consensus through leader election and log replication. By decomposing the problem into distinct subproblems and maintaining strong invariants, Raft achieves linearizable operations with high availability. Its clarity makes it the consensus algorithm of choice for many modern distributed systems.

Key takeaways:

Leader-based design simplifies implementation and reasoning
Randomized election timeouts prevent split votes
Majority quorum ensures safety despite failures
Log replication provides strong consistency guarantees
Safety properties are mathematically proven and maintained