Latency budgets and load shedding — Migration Playbook — Practical Guide (Mar 17, 2026)

Latency Budgets and Load Shedding — Migration Playbook

Level: Experienced

As of March 17, 2026

Introduction

Modern distributed systems demand carefully designed latency budgets to meet strict service-level objectives (SLOs). Unexpected load spikes or cascading failures can cause latency to spiral, degrading user experience or even leading to service outages. Load shedding—intentionally dropping or deferring requests—is a core strategy to keep latency within budget and maintain overall system health.

This playbook guides you through migrating from naive or manual load shedding approaches into a structured latency-budget-driven load shedding strategy. It is applicable for teams operating microservices, serverless architectures, or monolithic services under high concurrency, and assumes familiarity with latency SLIs, monitoring, and distributed tracing.

Prerequisites

Existing observability infrastructure with end-to-end latency tracing and error rate metrics (e.g., OpenTelemetry, Prometheus, Datadog).
Ability to enforce service-level objectives (SLOs) with defined error budgets and latency thresholds.
Control over runtime request handling logic (middleware, API gateway, service mesh) to implement or hook load shedding logic.
Experience with rate limiting, circuit breaking, and message prioritisation.

Hands-on Steps

1. Define and Measure Your Latency Budget

Start by explicitly defining your latency budgets per API endpoint or service component. For example, if your frontend demands a 500ms P95 response time, subtract client-to-edge and network delays and allocate the remaining time among backend services in the call chain.

Use tracing to attribute latency correctly. Identify baseline latency under normal conditions and observe degradation during peak loads or failures.

{
  "service": "payment-api",
  "endpoint": "/charge",
  "latency_budget_ms": 300,
  "target_percentile": 95
}

2. Implement Early Load Shedding Points

Enforce load shedding as early in the request lifecycle as possible to conserve resources. This usually means implementing shedding in the edge proxy, API gateway, or ingress controller before requests reach costly backend services.

Approaches include:

Rejecting requests beyond configured concurrency or rate limits.
Returning synthetic or degraded responses when load exceeds threshold.

// Example: simple concurrent request cap in Go
var maxConcurrent = 100
var currentConcurrent = int32(0)

func Handler(w http.ResponseWriter, r *http.Request) {
  if atomic.LoadInt32(&currentConcurrent) >= int32(maxConcurrent) {
    http.Error(w, "Service overloaded", http.StatusTooManyRequests)
    return
  }
  atomic.AddInt32(&currentConcurrent, 1)
  defer atomic.AddInt32(&currentConcurrent, -1)
  // process request
}

3. Integrate Latency-Aware Shedding

Static caps can be too rigid. Introduce latency and error rate signals to dynamically adjust shedding thresholds. For example:

If current request latency exceeds 90% of budget, incrementally shed load.
Use moving windows of latency and request volume to avoid oscillation.

This can be implemented via middleware or within a service mesh with programmable filters (e.g., Envoy’s runtime load balancing features).

4. Prioritise Requests

Not all requests are equal. Implement prioritisation strategies:

Critical user-facing vs background batch jobs.
Premium customers vs free-tier users.

Lower priority requests can be shed earlier or deferred to maintain budgets for higher priority traffic.

5. Feedback and Backpressure

Load shedding should be complemented with backpressure mechanisms. For synchronous clients, return explicit status codes like 429 Too Many Requests. For asynchronous workflows, use queue length signals or retry-after headers.

6. Gradual Rollout and Monitoring

Deploy load shedding changes gradually, starting with canary or dark launches. Monitor key metrics:

Latency distributions before and after shedding.
Request failure rate due to shedding.
System resource utilisation (CPU, memory, queue length).

Refine thresholds based on observed impact.

Common Pitfalls

Overly aggressive shedding: may cause excessive error rates and frustrated users.
Under-shedding: leading to cascading failures and SLA breaches.
Ignoring prioritisation: treating all requests equally may degrade critical workflows.
Insufficient observability: inability to correlate shedding events with latency impact.
Static thresholds: not adapting to evolving load profiles or degraded conditions.

Validation

Perform controlled load tests to validate shedding is effective under stress. For example, using tools like k6 or JMeter, ramp up traffic gradually and observe:

If shedding activates near target latency budget thresholds.
Whether high-priority requests continue succeeding despite overall load.
If shed requests receive clear, actionable responses.

Additionally, test failure modes to ensure that partial shedding or fallback paths do not produce cascading failures.

Checklist / TL;DR

Define explicit per-service latency budgets informed by end-to-end traces.
Implement load shedding as early as possible in the request path.
Use dynamic, latency-aware thresholds, not just static caps.
Prioritise requests to protect critical user journeys.
Provide clear feedback to clients on shedding and backpressure.
Monitor shedding impact on latency, error rates, and resources continuously.
Test thoroughly with load and failure injection before full rollout.

When to Choose Latency-Based vs Static Load Shedding

Static load shedding (fixed concurrency or rate limits) is simple, predictable, and suitable for services with stable, known capacity.

Latency-based load shedding is more adaptive and reacts to transient load changes and degraded hardware but requires robust telemetry and more complex control logic.

Start with static limits in early phases; migrate to latency-driven shedding as observability and operational sophistication improve.