Designing for High Throughput vs Low Latency

High throughput and low latency are related but often competing goals. Throughput measures total work per unit time, while latency measures how fast individual requests complete. You must pick which t

Designing for High Throughput vs Low Latency#

High throughput and low latency are related but often competing goals. Throughput measures total work per unit time, while latency measures how fast individual requests complete. You must pick which to optimize based on business requirements.

When Throughput Matters More#

High throughput is critical for batch processing, analytics pipelines, and ingestion systems. Techniques include:

  • Batching: amortize overhead across multiple requests.
  • Asynchronous processing: decouple ingestion from processing.
  • Parallelism: maximize CPU and I/O utilization.

When Low Latency Matters More#

Low latency matters for user-facing APIs, trading systems, and interactive search.

Key techniques include:

  • Avoid queues on the critical path.
  • Precompute and cache frequently accessed data.
  • Keep request fan-out minimal.

Architectural Tradeoffs#

Queueing Theory#

Queues increase throughput but add tail latency. A queue depth of even a few items can double p99 latency under load.

Load Shedding#

To protect low latency, shed load early by rejecting excess requests instead of letting queues grow.

Data Layout#

Throughput-oriented systems prefer sequential writes and batch compaction. Latency-oriented systems prefer in-memory access and read-optimized indexes.

Spring Boot Example: Two Modes in the Same Service#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@RestController
@RequestMapping("/reports")
public class ReportController {
    private final ReportService reportService;

    public ReportController(ReportService reportService) {
        this.reportService = reportService;
    }

    @PostMapping("/async")
    public ResponseEntity<Void> submitBatch(@RequestBody ReportRequest request) {
        reportService.enqueue(request);
        return ResponseEntity.accepted().build();
    }

    @GetMapping("/interactive/{id}")
    public ResponseEntity<ReportView> getReport(@PathVariable String id) {
        return ResponseEntity.ok(reportService.getCachedView(id));
    }
}

Observability Strategy#

Measure:

  • p50, p95, and p99 latency for user-facing endpoints.
  • Throughput per node and per shard.
  • Queue length and saturation signals.

Practical Guidance#

  • If latency is the top priority, keep concurrency moderate and use dedicated resources.
  • If throughput is the top priority, embrace batching and parallelism.
  • For mixed workloads, separate low-latency and high-throughput paths.

Summary#

Throughput and latency cannot both be maximized at all times. Mature systems isolate the two workloads and allocate architecture and capacity based on explicit SLOs.

Contents