Post

Handling Hot Partitions in High-Scale Datastores

Introduction

Hot partitions occur when a small subset of keys receives a disproportionate share of traffic. This creates latency spikes and throttling even when total throughput looks reasonable. The solution is almost always data modeling and access pattern adjustments.

Why Hot Partitions Happen

  • Sequential keys such as timestamps
  • A small number of tenants dominating traffic
  • Skewed access patterns from caching or popular entities

Strategy 1: Add Write Shards

Spread writes across multiple logical partitions by adding a shard suffix to the partition key.

1
2
3
4
5
6
7
8
import random

SHARDS = 10


def build_partition_key(tenant_id: str) -> str:
    shard = random.randint(0, SHARDS - 1)
    return f"{tenant_id}#{shard}"

Reads can query multiple shards in parallel and merge results.

Strategy 2: Time Bucket Partitioning

Use time buckets to avoid all writes hitting the same partition.

1
2
3
4
5
6
from datetime import datetime


def build_time_bucket_key(metric: str) -> str:
    bucket = datetime.utcnow().strftime("%Y%m%d%H")
    return f"{metric}#{bucket}"

Strategy 3: Adaptive Capacity Awareness (DynamoDB)

DynamoDB supports adaptive capacity, but it cannot fix extreme skew alone. Track ThrottledRequests and partition-level metrics to detect hotspots early.

Strategy 4: Pre-Aggregation

Instead of writing every event to a single key, aggregate metrics locally and flush in batches.

Example: DynamoDB Write Sharding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import boto3
from datetime import datetime

client = boto3.client("dynamodb")


def write_metric(metric: str, value: float, tenant_id: str):
    key = build_partition_key(tenant_id)
    client.put_item(
        TableName="metrics",
        Item={
            "partition_key": {"S": key},
            "sort_key": {"S": datetime.utcnow().isoformat()},
            "metric": {"S": metric},
            "value": {"N": str(value)}
        }
    )

Operational Monitoring

  • Track partition-level throughput and throttling.
  • Monitor p95 latency per tenant.
  • Alert on sudden traffic concentration.

Conclusion

Hot partitions are a data model problem, not an infrastructure problem. Detect skew early, shard keys intentionally, and validate with load tests that simulate real access patterns.

This post is licensed under CC BY 4.0 by the author.