Distributed Locks — Correct vs Dangerous

Distributed locks are one of the most misunderstood primitives in distributed systems. While they seem conceptually simple, implementing them correctly requires deep understanding of failure modes, ti

Introduction#

Distributed locks are one of the most misunderstood primitives in distributed systems. While they seem conceptually simple, implementing them correctly requires deep understanding of failure modes, timing assumptions, and consistency guarantees. This post explores what makes a distributed lock safe, common pitfalls, and when alternative approaches are better.

Why Distributed Locks Are Hard#

Unlike single-process locks, distributed locks must handle:

Network Partitions: Messages can be delayed, lost, or reordered Process Pauses: GC pauses, context switches, paging Clock Drift: Physical clocks are never perfectly synchronized Partial Failures: Some nodes may fail while others continue Byzantine Faults: Nodes may behave maliciously or incorrectly

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Python: Demonstrating why naive distributed locks fail
import time
import threading
import random

class NaiveDistributedLock:
    """
    DANGEROUS: This implementation has multiple serious flaws
    DO NOT USE IN PRODUCTION
    """
    
    def __init__(self):
        self.holder = None
        self.expiry = 0
    
    def acquire(self, client_id: str, ttl_seconds: int = 10) -> bool:
        """Attempt to acquire lock"""
        current_time = time.time()
        
        # Check if lock is free or expired
        if self.holder is None or current_time > self.expiry:
            self.holder = client_id
            self.expiry = current_time + ttl_seconds
            return True
        
        return False
    
    def release(self, client_id: str):
        """Release lock"""
        if self.holder == client_id:
            self.holder = None
            self.expiry = 0

# Demonstrate race condition
def demonstrate_race_condition():
    lock = NaiveDistributedLock()
    shared_counter = {'value': 0}
    
    def critical_section(client_id):
        if lock.acquire(client_id):
            print(f"[{client_id}] Acquired lock")
            
            # Simulate work in critical section
            current = shared_counter['value']
            time.sleep(0.01)  # Simulate processing
            shared_counter['value'] = current + 1
            
            print(f"[{client_id}] Counter: {shared_counter['value']}")
            lock.release(client_id)
        else:
            print(f"[{client_id}] Failed to acquire lock")
    
    # Multiple threads trying to acquire lock
    threads = []
    for i in range(5):
        t = threading.Thread(target=critical_section, args=(f"client-{i}",))
        threads.append(t)
        t.start()
    
    for t in threads:
        t.join()
    
    print(f"\nFinal counter value: {shared_counter['value']}")
    print("Expected: 5, but might be less due to race conditions!")

demonstrate_race_condition()

Safety Properties of Distributed Locks#

A correct distributed lock must guarantee:

Mutual Exclusion: At most one client holds the lock at any time Deadlock-Free: System eventually makes progress Fault Tolerance: Lock remains functional despite node failures Bounded Wait: Lock acquisition doesn’t wait indefinitely

The Fencing Token Pattern#

The key to safe distributed locks is using fencing tokens to detect stale lock holders.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
// Java/Spring Boot: Correct distributed lock with fencing tokens
import java.util.concurrent.atomic.AtomicLong;
import java.time.Instant;
import java.util.UUID;

@Service
public class FencedDistributedLock {
    
    private final AtomicLong fencingTokenCounter = new AtomicLong(0);
    private final RedisTemplate<String, String> redis;
    private final String lockKey;
    
    @Autowired
    public FencedDistributedLock(
        RedisTemplate<String, String> redis,
        @Value("${lock.key}") String lockKey
    ) {
        this.redis = redis;
        this.lockKey = lockKey;
    }
    
    /**
     * Acquire lock with fencing token
     * @return LockHandle with fencing token, or null if acquisition failed
     */
    public LockHandle acquire(String clientId, Duration ttl) {
        String lockValue = UUID.randomUUID().toString();
        
        // Use Redis SET NX EX for atomic lock acquisition
        Boolean acquired = redis.opsForValue().setIfAbsent(
            lockKey,
            lockValue,
            ttl
        );
        
        if (Boolean.TRUE.equals(acquired)) {
            // Generate monotonically increasing fencing token
            long fencingToken = fencingTokenCounter.incrementAndGet();
            
            // Store fencing token with lock
            redis.opsForHash().put(
                lockKey + ":fence",
                lockValue,
                String.valueOf(fencingToken)
            );
            
            return new LockHandle(
                clientId,
                lockValue,
                fencingToken,
                Instant.now().plus(ttl)
            );
        }
        
        return null;
    }
    
    /**
     * Release lock only if holder matches
     */
    public boolean release(LockHandle handle) {
        // Lua script for atomic check-and-delete
        String luaScript = 
            "if redis.call('get', KEYS[1]) == ARGV[1] then " +
            "    redis.call('del', KEYS[1]) " +
            "    redis.call('hdel', KEYS[2], ARGV[1]) " +
            "    return 1 " +
            "else " +
            "    return 0 " +
            "end";
        
        Long result = redis.execute(
            new DefaultRedisScript<>(luaScript, Long.class),
            Arrays.asList(lockKey, lockKey + ":fence"),
            handle.lockValue
        );
        
        return result != null && result == 1;
    }
    
    /**
     * Refresh lock TTL
     */
    public boolean refresh(LockHandle handle, Duration newTtl) {
        String luaScript =
            "if redis.call('get', KEYS[1]) == ARGV[1] then " +
            "    redis.call('expire', KEYS[1], ARGV[2]) " +
            "    return 1 " +
            "else " +
            "    return 0 " +
            "end";
        
        Long result = redis.execute(
            new DefaultRedisScript<>(luaScript, Long.class),
            Collections.singletonList(lockKey),
            handle.lockValue,
            String.valueOf(newTtl.getSeconds())
        );
        
        if (result != null && result == 1) {
            handle.expiry = Instant.now().plus(newTtl);
            return true;
        }
        
        return false;
    }
}

public class LockHandle {
    private final String clientId;
    private final String lockValue;
    private final long fencingToken;
    private Instant expiry;
    
    public LockHandle(
        String clientId,
        String lockValue,
        long fencingToken,
        Instant expiry
    ) {
        this.clientId = clientId;
        this.lockValue = lockValue;
        this.fencingToken = fencingToken;
        this.expiry = expiry;
    }
    
    public long getFencingToken() {
        return fencingToken;
    }
    
    public boolean isExpired() {
        return Instant.now().isAfter(expiry);
    }
    
    public String getLockValue() {
        return lockValue;
    }
}

@Service
public class ProtectedResourceService {
    
    private final FencedDistributedLock lock;
    private long lastFencingToken = 0;
    
    /**
     * Access resource with fencing token protection
     */
    public void updateResource(LockHandle lockHandle, String data) {
        long fencingToken = lockHandle.getFencingToken();
        
        // Reject requests with stale fencing tokens
        if (fencingToken <= lastFencingToken) {
            throw new StaleTokenException(
                String.format(
                    "Stale fencing token %d, current is %d",
                    fencingToken,
                    lastFencingToken
                )
            );
        }
        
        // Update resource
        performUpdate(data);
        
        // Update last seen token
        lastFencingToken = fencingToken;
        
        System.out.printf(
            "Resource updated with fencing token %d%n",
            fencingToken
        );
    }
    
    private void performUpdate(String data) {
        // Actual resource update logic
    }
}

Redlock Algorithm Analysis#

Redis’s Redlock algorithm attempts to provide distributed locks using multiple Redis instances. However, it has serious problems.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// C#: Redlock implementation with commentary on its flaws
public class RedlockImplementation
{
    private readonly List<IRedisClient> _redisInstances;
    private readonly int _quorumSize;
    private const int ClockDriftFactor = 100;  // Milliseconds
    
    public RedlockImplementation(List<IRedisClient> redisInstances)
    {
        _redisInstances = redisInstances;
        _quorumSize = (_redisInstances.Count / 2) + 1;
    }
    
    public async Task<RedlockHandle?> AcquireLock(
        string resource,
        TimeSpan ttl,
        CancellationToken cancellationToken = default)
    {
        var lockValue = Guid.NewGuid().ToString();
        var startTime = DateTimeOffset.UtcNow;
        
        // Try to acquire lock on majority of instances
        int successCount = 0;
        var tasks = _redisInstances.Select(async redis =>
        {
            try
            {
                var acquired = await redis.SetIfNotExistsAsync(
                    resource,
                    lockValue,
                    ttl,
                    cancellationToken
                );
                return acquired;
            }
            catch
            {
                return false;
            }
        });
        
        var results = await Task.WhenAll(tasks);
        successCount = results.Count(r => r);
        
        var elapsedTime = DateTimeOffset.UtcNow - startTime;
        
        // Check if we acquired majority before TTL expired
        var validityTime = ttl - elapsedTime - 
                          TimeSpan.FromMilliseconds(ClockDriftFactor);
        
        if (successCount >= _quorumSize && validityTime > TimeSpan.Zero)
        {
            return new RedlockHandle
            {
                Resource = resource,
                LockValue = lockValue,
                ValidUntil = DateTimeOffset.UtcNow.Add(validityTime)
            };
        }
        
        // Failed to acquire, release locks we did get
        await ReleaseLock(resource, lockValue);
        return null;
    }
    
    public async Task ReleaseLock(string resource, string lockValue)
    {
        // Lua script for atomic check-and-delete
        const string luaScript = @"
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('del', KEYS[1])
            else
                return 0
            end
        ";
        
        var tasks = _redisInstances.Select(redis =>
            redis.EvalAsync(luaScript, new[] { resource }, new[] { lockValue })
        );
        
        await Task.WhenAll(tasks);
    }
}

/*
 * PROBLEMS WITH REDLOCK:
 * 
 * 1. Timing Assumptions:
 *    - Assumes bounded network delays (not guaranteed)
 *    - Assumes bounded process pauses (GC can pause longer than TTL)
 *    - Assumes synchronized clocks (clock drift causes issues)
 * 
 * 2. Process Pause Problem:
 *    Client A acquires lock
 *    Client A pauses (GC/context switch) for > TTL
 *    Lock expires, Client B acquires lock
 *    Client A resumes, thinks it still has lock
 *    BOTH clients in critical section!
 * 
 * 3. No Fencing Tokens:
 *    - Cannot detect stale lock holders
 *    - Resource access not protected against delayed operations
 * 
 * 4. Split Brain During Partitions:
 *    - Network partition can create two majorities
 *    - Both sides think they have the lock
 * 
 * Martin Kleppmann's critique is correct:
 * "Redlock is neither fish nor fowl. It's not suitable for efficiency 
 *  (too expensive), nor is it suitable for correctness (no fencing)."
 */

public class RedlockHandle
{
    public string Resource { get; set; }
    public string LockValue { get; set; }
    public DateTimeOffset ValidUntil { get; set; }
}

Correct Approach: Consensus-Based Locks#

Use consensus systems like Chubby, etcd, or ZooKeeper for correct distributed locks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
// Node.js: ZooKeeper-based distributed lock (correct implementation)
const zookeeper = require('node-zookeeper-client');

class ZooKeeperDistributedLock {
    constructor(zkClient, lockPath) {
        this.zkClient = zkClient;
        this.lockPath = lockPath;
        this.sequentialNode = null;
    }
    
    async acquire(clientId) {
        // Create lock directory if doesn't exist
        await this.ensureLockPath();
        
        // Create ephemeral sequential node
        const nodePath = await this.createSequentialNode(clientId);
        this.sequentialNode = nodePath;
        
        console.log(`[${clientId}] Created node: ${nodePath}`);
        
        // Try to acquire lock
        while (true) {
            const children = await this.getChildren();
            const sortedChildren = children.sort();
            
            const myNode = nodePath.split('/').pop();
            const myIndex = sortedChildren.indexOf(myNode);
            
            if (myIndex === 0) {
                // We have the lock!
                console.log(`[${clientId}] Acquired lock`);
                return {
                    clientId,
                    nodePath,
                    release: () => this.release()
                };
            }
            
            // Watch the node before us
            const watchNode = sortedChildren[myIndex - 1];
            await this.watchNode(watchNode);
            
            console.log(
                `[${clientId}] Waiting for ${watchNode} to release lock`
            );
            
            // Wait for notification
            await new Promise(resolve => {
                this.zkClient.once('nodeDeleted', resolve);
            });
        }
    }
    
    async release() {
        if (this.sequentialNode) {
            await this.deleteNode(this.sequentialNode);
            console.log(`Released lock: ${this.sequentialNode}`);
            this.sequentialNode = null;
        }
    }
    
    async ensureLockPath() {
        return new Promise((resolve, reject) => {
            this.zkClient.create(
                this.lockPath,
                Buffer.from('lock'),
                zookeeper.CreateMode.PERSISTENT,
                (error) => {
                    if (error && error.getCode() !== zookeeper.Exception.NODE_EXISTS) {
                        reject(error);
                    } else {
                        resolve();
                    }
                }
            );
        });
    }
    
    async createSequentialNode(clientId) {
        return new Promise((resolve, reject) => {
            const nodePath = `${this.lockPath}/lock-`;
            this.zkClient.create(
                nodePath,
                Buffer.from(clientId),
                zookeeper.CreateMode.EPHEMERAL_SEQUENTIAL,
                (error, path) => {
                    if (error) {
                        reject(error);
                    } else {
                        resolve(path);
                    }
                }
            );
        });
    }
    
    async getChildren() {
        return new Promise((resolve, reject) => {
            this.zkClient.getChildren(
                this.lockPath,
                (error, children) => {
                    if (error) {
                        reject(error);
                    } else {
                        resolve(children);
                    }
                }
            );
        });
    }
    
    async watchNode(nodeName) {
        return new Promise((resolve, reject) => {
            const fullPath = `${this.lockPath}/${nodeName}`;
            this.zkClient.exists(
                fullPath,
                (event) => {
                    if (event.type === zookeeper.Event.NODE_DELETED) {
                        this.zkClient.emit('nodeDeleted');
                    }
                },
                (error, stat) => {
                    if (error) {
                        reject(error);
                    } else {
                        resolve(stat);
                    }
                }
            );
        });
    }
    
    async deleteNode(path) {
        return new Promise((resolve, reject) => {
            this.zkClient.remove(path, -1, (error) => {
                if (error) {
                    reject(error);
                } else {
                    resolve();
                }
            });
        });
    }
}

/*
 * WHY ZOOKEEPER LOCKS ARE CORRECT:
 * 
 * 1. Consensus-based: Uses Zab protocol (similar to Raft/Paxos)
 * 2. Ephemeral nodes: Automatically released on client disconnect
 * 3. Sequential ordering: Provides total ordering of lock requests
 * 4. Watch mechanism: Efficient waiting without polling
 * 5. No timing assumptions: Correct despite arbitrary delays
 * 6. Linearizable: Guarantees strong consistency
 */

// Usage example
async function example() {
    const client = zookeeper.createClient('localhost:2181');
    
    client.once('connected', async () => {
        console.log('Connected to ZooKeeper');
        
        const lock = new ZooKeeperDistributedLock(
            client,
            '/distributed-locks/my-resource'
        );
        
        try {
            const handle = await lock.acquire('client-1');
            
            // Critical section
            console.log('Executing critical section...');
            await new Promise(resolve => setTimeout(resolve, 2000));
            
            // Release lock
            await handle.release();
        } catch (error) {
            console.error('Lock error:', error);
        } finally {
            client.close();
        }
    });
    
    client.connect();
}

Lease-Based Locks with Heartbeats#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# Python: Lease-based lock with heartbeat renewal
import time
import threading
from typing import Optional, Callable
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class LeaseInfo:
    """Lease information for distributed lock"""
    lock_id: str
    client_id: str
    expires_at: datetime
    fencing_token: int

class LeaseBasedLock:
    """
    Lease-based distributed lock with automatic renewal
    Safer than fixed TTL because it actively maintains the lease
    """
    
    def __init__(self, lock_key: str, lease_duration_seconds: int = 30):
        self.lock_key = lock_key
        self.lease_duration = lease_duration_seconds
        self.current_lease: Optional[LeaseInfo] = None
        self.heartbeat_thread: Optional[threading.Thread] = None
        self.should_renew = False
        self.fencing_token_counter = 0
    
    def acquire(self, client_id: str) -> Optional[int]:
        """
        Acquire lock and return fencing token
        Returns None if lock is held by another client
        """
        now = datetime.now()
        
        # Check if lock is available
        if self.current_lease and self.current_lease.expires_at > now:
            if self.current_lease.client_id != client_id:
                return None  # Lock held by someone else
        
        # Acquire lock
        self.fencing_token_counter += 1
        lock_id = f"{client_id}-{time.time()}"
        
        self.current_lease = LeaseInfo(
            lock_id=lock_id,
            client_id=client_id,
            expires_at=now + timedelta(seconds=self.lease_duration),
            fencing_token=self.fencing_token_counter
        )
        
        # Start heartbeat thread to renew lease
        self.should_renew = True
        self.heartbeat_thread = threading.Thread(
            target=self._heartbeat_loop,
            args=(lock_id,),
            daemon=True
        )
        self.heartbeat_thread.start()
        
        print(f"[{client_id}] Acquired lock with fencing token {self.fencing_token_counter}")
        return self.fencing_token_counter
    
    def release(self, client_id: str):
        """Release lock"""
        if self.current_lease and self.current_lease.client_id == client_id:
            self.should_renew = False
            self.current_lease = None
            print(f"[{client_id}] Released lock")
    
    def _heartbeat_loop(self, lock_id: str):
        """Background thread that renews lease periodically"""
        renewal_interval = self.lease_duration / 3  # Renew at 1/3 of lease duration
        
        while self.should_renew:
            time.sleep(renewal_interval)
            
            if not self.should_renew:
                break
            
            # Renew lease
            if self.current_lease and self.current_lease.lock_id == lock_id:
                self.current_lease.expires_at = (
                    datetime.now() + timedelta(seconds=self.lease_duration)
                )
                print(f"[{self.current_lease.client_id}] Renewed lease")
            else:
                # Our lease was taken over, stop renewing
                break
    
    def is_held_by(self, client_id: str) -> bool:
        """Check if lock is currently held by client"""
        if not self.current_lease:
            return False
        
        if self.current_lease.expires_at < datetime.now():
            return False
        
        return self.current_lease.client_id == client_id

class ProtectedResource:
    """Resource protected by fencing tokens"""
    
    def __init__(self):
        self.last_fencing_token = 0
        self.data = None
    
    def write(self, fencing_token: int, data: str) -> bool:
        """
        Write to resource with fencing token protection
        Rejects writes with stale tokens
        """
        if fencing_token <= self.last_fencing_token:
            print(
                f"Rejected write with stale token {fencing_token} "
                f"(current: {self.last_fencing_token})"
            )
            return False
        
        # Accept write
        self.data = data
        self.last_fencing_token = fencing_token
        print(f"Accepted write with token {fencing_token}: {data}")
        return True

# Demonstrate lease-based lock with fencing
def demonstrate_lease_lock():
    print("=== Lease-Based Lock with Fencing Demo ===\n")
    
    lock = LeaseBasedLock("resource-1", lease_duration_seconds=10)
    resource = ProtectedResource()
    
    # Client 1 acquires lock
    token1 = lock.acquire("client-1")
    if token1:
        resource.write(token1, "Data from client-1")
        time.sleep(2)
    
    # Client 1 still holds lock (heartbeat renewing)
    token1_retry = lock.acquire("client-2")
    print(f"Client-2 acquire attempt: {'Success' if token1_retry else 'Failed (expected)'}\n")
    
    # Client 1 releases
    lock.release("client-1")
    time.sleep(1)
    
    # Now Client 2 can acquire
    token2 = lock.acquire("client-2")
    if token2:
        resource.write(token2, "Data from client-2")
        
        # Simulate Client 1 trying to write with old token
        resource.write(token1, "Stale data from client-1")  # Rejected!
        
        lock.release("client-2")

demonstrate_lease_lock()

When NOT to Use Distributed Locks#

Distributed locks are often the wrong solution. Consider alternatives:

Use Optimistic Concurrency Control Instead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@Service
public class OptimisticConcurrencyService {
    
    @Autowired
    private ResourceRepository repository;
    
    /**
     * Update resource using optimistic locking
     * No distributed lock needed!
     */
    @Transactional
    public void updateResource(Long resourceId, String newData) {
        int maxRetries = 3;
        int attempt = 0;
        
        while (attempt < maxRetries) {
            try {
                // Read current version
                Resource resource = repository.findById(resourceId)
                    .orElseThrow(() -> new ResourceNotFoundException());
                
                // Update data
                resource.setData(newData);
                
                // Save with version check
                // If version changed, OptimisticLockException is thrown
                repository.save(resource);
                
                System.out.printf(
                    "Updated resource %d (version %d)%n",
                    resourceId,
                    resource.getVersion()
                );
                
                return;
                
            } catch (OptimisticLockException e) {
                attempt++;
                System.out.printf(
                    "Optimistic lock conflict, retry %d/%d%n",
                    attempt,
                    maxRetries
                );
                
                if (attempt >= maxRetries) {
                    throw new ConcurrentModificationException(
                        "Failed to update after " + maxRetries + " attempts"
                    );
                }
                
                // Exponential backoff
                try {
                    Thread.sleep((long) Math.pow(2, attempt) * 100);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException(ie);
                }
            }
        }
    }
}

@Entity
@Table(name = "resources")
public class Resource {
    @Id
    private Long id;
    
    private String data;
    
    @Version  // Optimistic locking
    private Long version;
    
    // Getters and setters
}

Use Idempotent Operations:

Instead of locks, make operations idempotent.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// C#: Idempotent operations instead of locks
public class IdempotentPaymentService
{
    private readonly IPaymentRepository _payments;
    
    public async Task<PaymentResult> ProcessPayment(
        string idempotencyKey,
        decimal amount,
        string customerId)
    {
        // Check if payment already processed
        var existing = await _payments.FindByIdempotencyKeyAsync(idempotencyKey);
        if (existing != null)
        {
            // Already processed, return existing result
            return new PaymentResult
            {
                Status = existing.Status,
                TransactionId = existing.TransactionId,
                Message = "Payment already processed (idempotent)"
            };
        }
        
        // Process payment (atomic database operation ensures single processing)
        try
        {
            var payment = new Payment
            {
                IdempotencyKey = idempotencyKey,
                Amount = amount,
                CustomerId = customerId,
                Status = PaymentStatus.Processing,
                CreatedAt = DateTime.UtcNow
            };
            
            // Insert with unique constraint on idempotency_key
            await _payments.InsertAsync(payment);
            
            // Process payment
            var transactionId = await ProcessWithPaymentGateway(amount, customerId);
            
            // Update status
            payment.Status = PaymentStatus.Completed;
            payment.TransactionId = transactionId;
            await _payments.UpdateAsync(payment);
            
            return new PaymentResult
            {
                Status = PaymentStatus.Completed,
                TransactionId = transactionId,
                Message = "Payment processed successfully"
            };
        }
        catch (UniqueConstraintViolationException)
        {
            // Another process already processing, return its result
            var existing = await _payments.FindByIdempotencyKeyAsync(idempotencyKey);
            return new PaymentResult
            {
                Status = existing.Status,
                TransactionId = existing.TransactionId,
                Message = "Payment processed by another instance"
            };
        }
    }
}

Comparison Table#

Approach Correctness Performance Complexity When to Use
Redis SET NX Dangerous Fast Low Never (missing fencing)
Redlock Questionable Moderate High Avoid if possible
ZooKeeper/etcd Correct Moderate Moderate Leader election, coordination
Lease-based Correct Moderate Moderate Longer critical sections
Optimistic Locking Correct Fast Low Database operations
Idempotency Correct Fast Low API operations

Best Practices#

Always Use Fencing Tokens:

Protect resources against delayed lock holders.

Minimize Critical Section Duration:

Hold locks for shortest time possible.

Implement Lock Timeouts:

Prevent indefinite waits.

Use Consensus Systems for Correctness:

Choose ZooKeeper/etcd over Redis for safety-critical locks.

Consider Lock-Free Alternatives:

Optimistic concurrency and idempotency often better.

Monitor Lock Contention:

Track wait times and contention metrics.

Test Failure Scenarios:

Use chaos engineering to validate lock behavior.

Summary#

Distributed locks are difficult to implement correctly and often unnecessary. Redis-based locks without fencing tokens are unsafe. Redlock has fundamental correctness issues. For critical coordination, use consensus-based systems like ZooKeeper or etcd. Better yet, design systems using optimistic concurrency control, idempotent operations, or CRDTs to avoid locks entirely.

Key takeaways:

  • Distributed locks require fencing tokens for correctness
  • Redis SET NX alone is insufficient for safety
  • Redlock has serious problems despite majority quorum
  • ZooKeeper and etcd provide correct distributed locks
  • Optimistic concurrency often better than locks
  • Design for lock-free operation when possible
Contents