Post

Designing for High Throughput vs Low Latency

Designing for High Throughput vs Low Latency

High throughput and low latency are related but often competing goals. Throughput measures total work per unit time, while latency measures how fast individual requests complete. You must pick which to optimize based on business requirements.

When Throughput Matters More

High throughput is critical for batch processing, analytics pipelines, and ingestion systems. Techniques include:

  • Batching: amortize overhead across multiple requests.
  • Asynchronous processing: decouple ingestion from processing.
  • Parallelism: maximize CPU and I/O utilization.

When Low Latency Matters More

Low latency matters for user-facing APIs, trading systems, and interactive search.

Key techniques include:

  • Avoid queues on the critical path.
  • Precompute and cache frequently accessed data.
  • Keep request fan-out minimal.

Architectural Tradeoffs

Queueing Theory

Queues increase throughput but add tail latency. A queue depth of even a few items can double p99 latency under load.

Load Shedding

To protect low latency, shed load early by rejecting excess requests instead of letting queues grow.

Data Layout

Throughput-oriented systems prefer sequential writes and batch compaction. Latency-oriented systems prefer in-memory access and read-optimized indexes.

Spring Boot Example: Two Modes in the Same Service

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@RestController
@RequestMapping("/reports")
public class ReportController {
    private final ReportService reportService;

    public ReportController(ReportService reportService) {
        this.reportService = reportService;
    }

    @PostMapping("/async")
    public ResponseEntity<Void> submitBatch(@RequestBody ReportRequest request) {
        reportService.enqueue(request);
        return ResponseEntity.accepted().build();
    }

    @GetMapping("/interactive/{id}")
    public ResponseEntity<ReportView> getReport(@PathVariable String id) {
        return ResponseEntity.ok(reportService.getCachedView(id));
    }
}

Observability Strategy

Measure:

  • p50, p95, and p99 latency for user-facing endpoints.
  • Throughput per node and per shard.
  • Queue length and saturation signals.

Practical Guidance

  • If latency is the top priority, keep concurrency moderate and use dedicated resources.
  • If throughput is the top priority, embrace batching and parallelism.
  • For mixed workloads, separate low-latency and high-throughput paths.

Summary

Throughput and latency cannot both be maximized at all times. Mature systems isolate the two workloads and allocate architecture and capacity based on explicit SLOs.

This post is licensed under CC BY 4.0 by the author.