Designing for High Throughput vs Low Latency
Designing for High Throughput vs Low Latency
High throughput and low latency are related but often competing goals. Throughput measures total work per unit time, while latency measures how fast individual requests complete. You must pick which to optimize based on business requirements.
When Throughput Matters More
High throughput is critical for batch processing, analytics pipelines, and ingestion systems. Techniques include:
- Batching: amortize overhead across multiple requests.
- Asynchronous processing: decouple ingestion from processing.
- Parallelism: maximize CPU and I/O utilization.
When Low Latency Matters More
Low latency matters for user-facing APIs, trading systems, and interactive search.
Key techniques include:
- Avoid queues on the critical path.
- Precompute and cache frequently accessed data.
- Keep request fan-out minimal.
Architectural Tradeoffs
Queueing Theory
Queues increase throughput but add tail latency. A queue depth of even a few items can double p99 latency under load.
Load Shedding
To protect low latency, shed load early by rejecting excess requests instead of letting queues grow.
Data Layout
Throughput-oriented systems prefer sequential writes and batch compaction. Latency-oriented systems prefer in-memory access and read-optimized indexes.
Spring Boot Example: Two Modes in the Same Service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@RestController
@RequestMapping("/reports")
public class ReportController {
private final ReportService reportService;
public ReportController(ReportService reportService) {
this.reportService = reportService;
}
@PostMapping("/async")
public ResponseEntity<Void> submitBatch(@RequestBody ReportRequest request) {
reportService.enqueue(request);
return ResponseEntity.accepted().build();
}
@GetMapping("/interactive/{id}")
public ResponseEntity<ReportView> getReport(@PathVariable String id) {
return ResponseEntity.ok(reportService.getCachedView(id));
}
}
Observability Strategy
Measure:
- p50, p95, and p99 latency for user-facing endpoints.
- Throughput per node and per shard.
- Queue length and saturation signals.
Practical Guidance
- If latency is the top priority, keep concurrency moderate and use dedicated resources.
- If throughput is the top priority, embrace batching and parallelism.
- For mixed workloads, separate low-latency and high-throughput paths.
Summary
Throughput and latency cannot both be maximized at all times. Mature systems isolate the two workloads and allocate architecture and capacity based on explicit SLOs.