Introduction#
Distributed locks are one of the most misunderstood primitives in distributed systems. While they seem conceptually simple, implementing them correctly requires deep understanding of failure modes, timing assumptions, and consistency guarantees. This post explores what makes a distributed lock safe, common pitfalls, and when alternative approaches are better.
Why Distributed Locks Are Hard#
Unlike single-process locks, distributed locks must handle:
Network Partitions: Messages can be delayed, lost, or reordered Process Pauses: GC pauses, context switches, paging Clock Drift: Physical clocks are never perfectly synchronized Partial Failures: Some nodes may fail while others continue Byzantine Faults: Nodes may behave maliciously or incorrectly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Python: Demonstrating why naive distributed locks fail
import time
import threading
import random
class NaiveDistributedLock:
"""
DANGEROUS: This implementation has multiple serious flaws
DO NOT USE IN PRODUCTION
"""
def __init__(self):
self.holder = None
self.expiry = 0
def acquire(self, client_id: str, ttl_seconds: int = 10) -> bool:
"""Attempt to acquire lock"""
current_time = time.time()
# Check if lock is free or expired
if self.holder is None or current_time > self.expiry:
self.holder = client_id
self.expiry = current_time + ttl_seconds
return True
return False
def release(self, client_id: str):
"""Release lock"""
if self.holder == client_id:
self.holder = None
self.expiry = 0
# Demonstrate race condition
def demonstrate_race_condition():
lock = NaiveDistributedLock()
shared_counter = {'value': 0}
def critical_section(client_id):
if lock.acquire(client_id):
print(f"[{client_id}] Acquired lock")
# Simulate work in critical section
current = shared_counter['value']
time.sleep(0.01) # Simulate processing
shared_counter['value'] = current + 1
print(f"[{client_id}] Counter: {shared_counter['value']}")
lock.release(client_id)
else:
print(f"[{client_id}] Failed to acquire lock")
# Multiple threads trying to acquire lock
threads = []
for i in range(5):
t = threading.Thread(target=critical_section, args=(f"client-{i}",))
threads.append(t)
t.start()
for t in threads:
t.join()
print(f"\nFinal counter value: {shared_counter['value']}")
print("Expected: 5, but might be less due to race conditions!")
demonstrate_race_condition()
Safety Properties of Distributed Locks#
A correct distributed lock must guarantee:
Mutual Exclusion: At most one client holds the lock at any time Deadlock-Free: System eventually makes progress Fault Tolerance: Lock remains functional despite node failures Bounded Wait: Lock acquisition doesn’t wait indefinitely
The Fencing Token Pattern#
The key to safe distributed locks is using fencing tokens to detect stale lock holders.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
// Java/Spring Boot: Correct distributed lock with fencing tokens
import java.util.concurrent.atomic.AtomicLong;
import java.time.Instant;
import java.util.UUID;
@Service
public class FencedDistributedLock {
private final AtomicLong fencingTokenCounter = new AtomicLong(0);
private final RedisTemplate<String, String> redis;
private final String lockKey;
@Autowired
public FencedDistributedLock(
RedisTemplate<String, String> redis,
@Value("${lock.key}") String lockKey
) {
this.redis = redis;
this.lockKey = lockKey;
}
/**
* Acquire lock with fencing token
* @return LockHandle with fencing token, or null if acquisition failed
*/
public LockHandle acquire(String clientId, Duration ttl) {
String lockValue = UUID.randomUUID().toString();
// Use Redis SET NX EX for atomic lock acquisition
Boolean acquired = redis.opsForValue().setIfAbsent(
lockKey,
lockValue,
ttl
);
if (Boolean.TRUE.equals(acquired)) {
// Generate monotonically increasing fencing token
long fencingToken = fencingTokenCounter.incrementAndGet();
// Store fencing token with lock
redis.opsForHash().put(
lockKey + ":fence",
lockValue,
String.valueOf(fencingToken)
);
return new LockHandle(
clientId,
lockValue,
fencingToken,
Instant.now().plus(ttl)
);
}
return null;
}
/**
* Release lock only if holder matches
*/
public boolean release(LockHandle handle) {
// Lua script for atomic check-and-delete
String luaScript =
"if redis.call('get', KEYS[1]) == ARGV[1] then " +
" redis.call('del', KEYS[1]) " +
" redis.call('hdel', KEYS[2], ARGV[1]) " +
" return 1 " +
"else " +
" return 0 " +
"end";
Long result = redis.execute(
new DefaultRedisScript<>(luaScript, Long.class),
Arrays.asList(lockKey, lockKey + ":fence"),
handle.lockValue
);
return result != null && result == 1;
}
/**
* Refresh lock TTL
*/
public boolean refresh(LockHandle handle, Duration newTtl) {
String luaScript =
"if redis.call('get', KEYS[1]) == ARGV[1] then " +
" redis.call('expire', KEYS[1], ARGV[2]) " +
" return 1 " +
"else " +
" return 0 " +
"end";
Long result = redis.execute(
new DefaultRedisScript<>(luaScript, Long.class),
Collections.singletonList(lockKey),
handle.lockValue,
String.valueOf(newTtl.getSeconds())
);
if (result != null && result == 1) {
handle.expiry = Instant.now().plus(newTtl);
return true;
}
return false;
}
}
public class LockHandle {
private final String clientId;
private final String lockValue;
private final long fencingToken;
private Instant expiry;
public LockHandle(
String clientId,
String lockValue,
long fencingToken,
Instant expiry
) {
this.clientId = clientId;
this.lockValue = lockValue;
this.fencingToken = fencingToken;
this.expiry = expiry;
}
public long getFencingToken() {
return fencingToken;
}
public boolean isExpired() {
return Instant.now().isAfter(expiry);
}
public String getLockValue() {
return lockValue;
}
}
@Service
public class ProtectedResourceService {
private final FencedDistributedLock lock;
private long lastFencingToken = 0;
/**
* Access resource with fencing token protection
*/
public void updateResource(LockHandle lockHandle, String data) {
long fencingToken = lockHandle.getFencingToken();
// Reject requests with stale fencing tokens
if (fencingToken <= lastFencingToken) {
throw new StaleTokenException(
String.format(
"Stale fencing token %d, current is %d",
fencingToken,
lastFencingToken
)
);
}
// Update resource
performUpdate(data);
// Update last seen token
lastFencingToken = fencingToken;
System.out.printf(
"Resource updated with fencing token %d%n",
fencingToken
);
}
private void performUpdate(String data) {
// Actual resource update logic
}
}
Redlock Algorithm Analysis#
Redis’s Redlock algorithm attempts to provide distributed locks using multiple Redis instances. However, it has serious problems.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// C#: Redlock implementation with commentary on its flaws
public class RedlockImplementation
{
private readonly List<IRedisClient> _redisInstances;
private readonly int _quorumSize;
private const int ClockDriftFactor = 100; // Milliseconds
public RedlockImplementation(List<IRedisClient> redisInstances)
{
_redisInstances = redisInstances;
_quorumSize = (_redisInstances.Count / 2) + 1;
}
public async Task<RedlockHandle?> AcquireLock(
string resource,
TimeSpan ttl,
CancellationToken cancellationToken = default)
{
var lockValue = Guid.NewGuid().ToString();
var startTime = DateTimeOffset.UtcNow;
// Try to acquire lock on majority of instances
int successCount = 0;
var tasks = _redisInstances.Select(async redis =>
{
try
{
var acquired = await redis.SetIfNotExistsAsync(
resource,
lockValue,
ttl,
cancellationToken
);
return acquired;
}
catch
{
return false;
}
});
var results = await Task.WhenAll(tasks);
successCount = results.Count(r => r);
var elapsedTime = DateTimeOffset.UtcNow - startTime;
// Check if we acquired majority before TTL expired
var validityTime = ttl - elapsedTime -
TimeSpan.FromMilliseconds(ClockDriftFactor);
if (successCount >= _quorumSize && validityTime > TimeSpan.Zero)
{
return new RedlockHandle
{
Resource = resource,
LockValue = lockValue,
ValidUntil = DateTimeOffset.UtcNow.Add(validityTime)
};
}
// Failed to acquire, release locks we did get
await ReleaseLock(resource, lockValue);
return null;
}
public async Task ReleaseLock(string resource, string lockValue)
{
// Lua script for atomic check-and-delete
const string luaScript = @"
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
else
return 0
end
";
var tasks = _redisInstances.Select(redis =>
redis.EvalAsync(luaScript, new[] { resource }, new[] { lockValue })
);
await Task.WhenAll(tasks);
}
}
/*
* PROBLEMS WITH REDLOCK:
*
* 1. Timing Assumptions:
* - Assumes bounded network delays (not guaranteed)
* - Assumes bounded process pauses (GC can pause longer than TTL)
* - Assumes synchronized clocks (clock drift causes issues)
*
* 2. Process Pause Problem:
* Client A acquires lock
* Client A pauses (GC/context switch) for > TTL
* Lock expires, Client B acquires lock
* Client A resumes, thinks it still has lock
* BOTH clients in critical section!
*
* 3. No Fencing Tokens:
* - Cannot detect stale lock holders
* - Resource access not protected against delayed operations
*
* 4. Split Brain During Partitions:
* - Network partition can create two majorities
* - Both sides think they have the lock
*
* Martin Kleppmann's critique is correct:
* "Redlock is neither fish nor fowl. It's not suitable for efficiency
* (too expensive), nor is it suitable for correctness (no fencing)."
*/
public class RedlockHandle
{
public string Resource { get; set; }
public string LockValue { get; set; }
public DateTimeOffset ValidUntil { get; set; }
}
Correct Approach: Consensus-Based Locks#
Use consensus systems like Chubby, etcd, or ZooKeeper for correct distributed locks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
// Node.js: ZooKeeper-based distributed lock (correct implementation)
const zookeeper = require('node-zookeeper-client');
class ZooKeeperDistributedLock {
constructor(zkClient, lockPath) {
this.zkClient = zkClient;
this.lockPath = lockPath;
this.sequentialNode = null;
}
async acquire(clientId) {
// Create lock directory if doesn't exist
await this.ensureLockPath();
// Create ephemeral sequential node
const nodePath = await this.createSequentialNode(clientId);
this.sequentialNode = nodePath;
console.log(`[${clientId}] Created node: ${nodePath}`);
// Try to acquire lock
while (true) {
const children = await this.getChildren();
const sortedChildren = children.sort();
const myNode = nodePath.split('/').pop();
const myIndex = sortedChildren.indexOf(myNode);
if (myIndex === 0) {
// We have the lock!
console.log(`[${clientId}] Acquired lock`);
return {
clientId,
nodePath,
release: () => this.release()
};
}
// Watch the node before us
const watchNode = sortedChildren[myIndex - 1];
await this.watchNode(watchNode);
console.log(
`[${clientId}] Waiting for ${watchNode} to release lock`
);
// Wait for notification
await new Promise(resolve => {
this.zkClient.once('nodeDeleted', resolve);
});
}
}
async release() {
if (this.sequentialNode) {
await this.deleteNode(this.sequentialNode);
console.log(`Released lock: ${this.sequentialNode}`);
this.sequentialNode = null;
}
}
async ensureLockPath() {
return new Promise((resolve, reject) => {
this.zkClient.create(
this.lockPath,
Buffer.from('lock'),
zookeeper.CreateMode.PERSISTENT,
(error) => {
if (error && error.getCode() !== zookeeper.Exception.NODE_EXISTS) {
reject(error);
} else {
resolve();
}
}
);
});
}
async createSequentialNode(clientId) {
return new Promise((resolve, reject) => {
const nodePath = `${this.lockPath}/lock-`;
this.zkClient.create(
nodePath,
Buffer.from(clientId),
zookeeper.CreateMode.EPHEMERAL_SEQUENTIAL,
(error, path) => {
if (error) {
reject(error);
} else {
resolve(path);
}
}
);
});
}
async getChildren() {
return new Promise((resolve, reject) => {
this.zkClient.getChildren(
this.lockPath,
(error, children) => {
if (error) {
reject(error);
} else {
resolve(children);
}
}
);
});
}
async watchNode(nodeName) {
return new Promise((resolve, reject) => {
const fullPath = `${this.lockPath}/${nodeName}`;
this.zkClient.exists(
fullPath,
(event) => {
if (event.type === zookeeper.Event.NODE_DELETED) {
this.zkClient.emit('nodeDeleted');
}
},
(error, stat) => {
if (error) {
reject(error);
} else {
resolve(stat);
}
}
);
});
}
async deleteNode(path) {
return new Promise((resolve, reject) => {
this.zkClient.remove(path, -1, (error) => {
if (error) {
reject(error);
} else {
resolve();
}
});
});
}
}
/*
* WHY ZOOKEEPER LOCKS ARE CORRECT:
*
* 1. Consensus-based: Uses Zab protocol (similar to Raft/Paxos)
* 2. Ephemeral nodes: Automatically released on client disconnect
* 3. Sequential ordering: Provides total ordering of lock requests
* 4. Watch mechanism: Efficient waiting without polling
* 5. No timing assumptions: Correct despite arbitrary delays
* 6. Linearizable: Guarantees strong consistency
*/
// Usage example
async function example() {
const client = zookeeper.createClient('localhost:2181');
client.once('connected', async () => {
console.log('Connected to ZooKeeper');
const lock = new ZooKeeperDistributedLock(
client,
'/distributed-locks/my-resource'
);
try {
const handle = await lock.acquire('client-1');
// Critical section
console.log('Executing critical section...');
await new Promise(resolve => setTimeout(resolve, 2000));
// Release lock
await handle.release();
} catch (error) {
console.error('Lock error:', error);
} finally {
client.close();
}
});
client.connect();
}
Lease-Based Locks with Heartbeats#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# Python: Lease-based lock with heartbeat renewal
import time
import threading
from typing import Optional, Callable
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class LeaseInfo:
"""Lease information for distributed lock"""
lock_id: str
client_id: str
expires_at: datetime
fencing_token: int
class LeaseBasedLock:
"""
Lease-based distributed lock with automatic renewal
Safer than fixed TTL because it actively maintains the lease
"""
def __init__(self, lock_key: str, lease_duration_seconds: int = 30):
self.lock_key = lock_key
self.lease_duration = lease_duration_seconds
self.current_lease: Optional[LeaseInfo] = None
self.heartbeat_thread: Optional[threading.Thread] = None
self.should_renew = False
self.fencing_token_counter = 0
def acquire(self, client_id: str) -> Optional[int]:
"""
Acquire lock and return fencing token
Returns None if lock is held by another client
"""
now = datetime.now()
# Check if lock is available
if self.current_lease and self.current_lease.expires_at > now:
if self.current_lease.client_id != client_id:
return None # Lock held by someone else
# Acquire lock
self.fencing_token_counter += 1
lock_id = f"{client_id}-{time.time()}"
self.current_lease = LeaseInfo(
lock_id=lock_id,
client_id=client_id,
expires_at=now + timedelta(seconds=self.lease_duration),
fencing_token=self.fencing_token_counter
)
# Start heartbeat thread to renew lease
self.should_renew = True
self.heartbeat_thread = threading.Thread(
target=self._heartbeat_loop,
args=(lock_id,),
daemon=True
)
self.heartbeat_thread.start()
print(f"[{client_id}] Acquired lock with fencing token {self.fencing_token_counter}")
return self.fencing_token_counter
def release(self, client_id: str):
"""Release lock"""
if self.current_lease and self.current_lease.client_id == client_id:
self.should_renew = False
self.current_lease = None
print(f"[{client_id}] Released lock")
def _heartbeat_loop(self, lock_id: str):
"""Background thread that renews lease periodically"""
renewal_interval = self.lease_duration / 3 # Renew at 1/3 of lease duration
while self.should_renew:
time.sleep(renewal_interval)
if not self.should_renew:
break
# Renew lease
if self.current_lease and self.current_lease.lock_id == lock_id:
self.current_lease.expires_at = (
datetime.now() + timedelta(seconds=self.lease_duration)
)
print(f"[{self.current_lease.client_id}] Renewed lease")
else:
# Our lease was taken over, stop renewing
break
def is_held_by(self, client_id: str) -> bool:
"""Check if lock is currently held by client"""
if not self.current_lease:
return False
if self.current_lease.expires_at < datetime.now():
return False
return self.current_lease.client_id == client_id
class ProtectedResource:
"""Resource protected by fencing tokens"""
def __init__(self):
self.last_fencing_token = 0
self.data = None
def write(self, fencing_token: int, data: str) -> bool:
"""
Write to resource with fencing token protection
Rejects writes with stale tokens
"""
if fencing_token <= self.last_fencing_token:
print(
f"Rejected write with stale token {fencing_token} "
f"(current: {self.last_fencing_token})"
)
return False
# Accept write
self.data = data
self.last_fencing_token = fencing_token
print(f"Accepted write with token {fencing_token}: {data}")
return True
# Demonstrate lease-based lock with fencing
def demonstrate_lease_lock():
print("=== Lease-Based Lock with Fencing Demo ===\n")
lock = LeaseBasedLock("resource-1", lease_duration_seconds=10)
resource = ProtectedResource()
# Client 1 acquires lock
token1 = lock.acquire("client-1")
if token1:
resource.write(token1, "Data from client-1")
time.sleep(2)
# Client 1 still holds lock (heartbeat renewing)
token1_retry = lock.acquire("client-2")
print(f"Client-2 acquire attempt: {'Success' if token1_retry else 'Failed (expected)'}\n")
# Client 1 releases
lock.release("client-1")
time.sleep(1)
# Now Client 2 can acquire
token2 = lock.acquire("client-2")
if token2:
resource.write(token2, "Data from client-2")
# Simulate Client 1 trying to write with old token
resource.write(token1, "Stale data from client-1") # Rejected!
lock.release("client-2")
demonstrate_lease_lock()
When NOT to Use Distributed Locks#
Distributed locks are often the wrong solution. Consider alternatives:
Use Optimistic Concurrency Control Instead:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@Service
public class OptimisticConcurrencyService {
@Autowired
private ResourceRepository repository;
/**
* Update resource using optimistic locking
* No distributed lock needed!
*/
@Transactional
public void updateResource(Long resourceId, String newData) {
int maxRetries = 3;
int attempt = 0;
while (attempt < maxRetries) {
try {
// Read current version
Resource resource = repository.findById(resourceId)
.orElseThrow(() -> new ResourceNotFoundException());
// Update data
resource.setData(newData);
// Save with version check
// If version changed, OptimisticLockException is thrown
repository.save(resource);
System.out.printf(
"Updated resource %d (version %d)%n",
resourceId,
resource.getVersion()
);
return;
} catch (OptimisticLockException e) {
attempt++;
System.out.printf(
"Optimistic lock conflict, retry %d/%d%n",
attempt,
maxRetries
);
if (attempt >= maxRetries) {
throw new ConcurrentModificationException(
"Failed to update after " + maxRetries + " attempts"
);
}
// Exponential backoff
try {
Thread.sleep((long) Math.pow(2, attempt) * 100);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new RuntimeException(ie);
}
}
}
}
}
@Entity
@Table(name = "resources")
public class Resource {
@Id
private Long id;
private String data;
@Version // Optimistic locking
private Long version;
// Getters and setters
}
Use Idempotent Operations:
Instead of locks, make operations idempotent.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// C#: Idempotent operations instead of locks
public class IdempotentPaymentService
{
private readonly IPaymentRepository _payments;
public async Task<PaymentResult> ProcessPayment(
string idempotencyKey,
decimal amount,
string customerId)
{
// Check if payment already processed
var existing = await _payments.FindByIdempotencyKeyAsync(idempotencyKey);
if (existing != null)
{
// Already processed, return existing result
return new PaymentResult
{
Status = existing.Status,
TransactionId = existing.TransactionId,
Message = "Payment already processed (idempotent)"
};
}
// Process payment (atomic database operation ensures single processing)
try
{
var payment = new Payment
{
IdempotencyKey = idempotencyKey,
Amount = amount,
CustomerId = customerId,
Status = PaymentStatus.Processing,
CreatedAt = DateTime.UtcNow
};
// Insert with unique constraint on idempotency_key
await _payments.InsertAsync(payment);
// Process payment
var transactionId = await ProcessWithPaymentGateway(amount, customerId);
// Update status
payment.Status = PaymentStatus.Completed;
payment.TransactionId = transactionId;
await _payments.UpdateAsync(payment);
return new PaymentResult
{
Status = PaymentStatus.Completed,
TransactionId = transactionId,
Message = "Payment processed successfully"
};
}
catch (UniqueConstraintViolationException)
{
// Another process already processing, return its result
var existing = await _payments.FindByIdempotencyKeyAsync(idempotencyKey);
return new PaymentResult
{
Status = existing.Status,
TransactionId = existing.TransactionId,
Message = "Payment processed by another instance"
};
}
}
}
Comparison Table#
| Approach | Correctness | Performance | Complexity | When to Use |
|---|---|---|---|---|
| Redis SET NX | Dangerous | Fast | Low | Never (missing fencing) |
| Redlock | Questionable | Moderate | High | Avoid if possible |
| ZooKeeper/etcd | Correct | Moderate | Moderate | Leader election, coordination |
| Lease-based | Correct | Moderate | Moderate | Longer critical sections |
| Optimistic Locking | Correct | Fast | Low | Database operations |
| Idempotency | Correct | Fast | Low | API operations |
Best Practices#
Always Use Fencing Tokens:
Protect resources against delayed lock holders.
Minimize Critical Section Duration:
Hold locks for shortest time possible.
Implement Lock Timeouts:
Prevent indefinite waits.
Use Consensus Systems for Correctness:
Choose ZooKeeper/etcd over Redis for safety-critical locks.
Consider Lock-Free Alternatives:
Optimistic concurrency and idempotency often better.
Monitor Lock Contention:
Track wait times and contention metrics.
Test Failure Scenarios:
Use chaos engineering to validate lock behavior.
Summary#
Distributed locks are difficult to implement correctly and often unnecessary. Redis-based locks without fencing tokens are unsafe. Redlock has fundamental correctness issues. For critical coordination, use consensus-based systems like ZooKeeper or etcd. Better yet, design systems using optimistic concurrency control, idempotent operations, or CRDTs to avoid locks entirely.
Key takeaways:
- Distributed locks require fencing tokens for correctness
- Redis SET NX alone is insufficient for safety
- Redlock has serious problems despite majority quorum
- ZooKeeper and etcd provide correct distributed locks
- Optimistic concurrency often better than locks
- Design for lock-free operation when possible