Handling Hot Partitions in High-Scale Datastores
Introduction
Hot partitions occur when a small subset of keys receives a disproportionate share of traffic. This creates latency spikes and throttling even when total throughput looks reasonable. The solution is almost always data modeling and access pattern adjustments.
Why Hot Partitions Happen
- Sequential keys such as timestamps
- A small number of tenants dominating traffic
- Skewed access patterns from caching or popular entities
Strategy 1: Add Write Shards
Spread writes across multiple logical partitions by adding a shard suffix to the partition key.
1
2
3
4
5
6
7
8
import random
SHARDS = 10
def build_partition_key(tenant_id: str) -> str:
shard = random.randint(0, SHARDS - 1)
return f"{tenant_id}#{shard}"
Reads can query multiple shards in parallel and merge results.
Strategy 2: Time Bucket Partitioning
Use time buckets to avoid all writes hitting the same partition.
1
2
3
4
5
6
from datetime import datetime
def build_time_bucket_key(metric: str) -> str:
bucket = datetime.utcnow().strftime("%Y%m%d%H")
return f"{metric}#{bucket}"
Strategy 3: Adaptive Capacity Awareness (DynamoDB)
DynamoDB supports adaptive capacity, but it cannot fix extreme skew alone. Track ThrottledRequests and partition-level metrics to detect hotspots early.
Strategy 4: Pre-Aggregation
Instead of writing every event to a single key, aggregate metrics locally and flush in batches.
Example: DynamoDB Write Sharding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import boto3
from datetime import datetime
client = boto3.client("dynamodb")
def write_metric(metric: str, value: float, tenant_id: str):
key = build_partition_key(tenant_id)
client.put_item(
TableName="metrics",
Item={
"partition_key": {"S": key},
"sort_key": {"S": datetime.utcnow().isoformat()},
"metric": {"S": metric},
"value": {"N": str(value)}
}
)
Operational Monitoring
- Track partition-level throughput and throttling.
- Monitor p95 latency per tenant.
- Alert on sudden traffic concentration.
Conclusion
Hot partitions are a data model problem, not an infrastructure problem. Detect skew early, shard keys intentionally, and validate with load tests that simulate real access patterns.