SLOs August 28, 2024

When Error Budgets Run Dry: What to Do Before the Pager Goes Off

What to Do When Your Error Budget Is Exhausted

Error budget exhaustion is a lagging signal. By the time the 30-day window turns red on your SLO dashboard, users have already absorbed those errors — and your deploy pipeline is now frozen while the post-mortem queue fills up. The better play is to treat burn rate as a real-time signal during canary deployments, not a retrospective accounting tool.

This piece is about the window between "canary deployed" and "error budget gone." That's the window where automatic detection saves you from finding out from users.

What Error Budget Burn Rate Actually Measures

An error budget is the allowable failure budget derived from your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes of allowable downtime. Burn rate expresses how fast you're consuming that budget relative to the rate that would exhaust it exactly at the 30-day window boundary.

A burn rate of 1.0 means you're consuming budget at exactly the sustainable pace. A burn rate of 10 means you'll exhaust the entire 30-day budget in 3 days. Google's SRE workbook recommends alerting at burn rates of 14.4 (consuming 5% of budget in 1 hour) and 6 (consuming 5% in 6 hours) with different response urgency levels — pager vs. ticket.

The key insight for canary deployments: you don't need to wait until you've consumed a meaningful percentage of the budget. You need to detect the rate at which the canary is burning it, and compare that rate to the stable baseline.

Burn Rate as a Canary Gate

Standard canary analysis compares error rate in isolation: if the canary's error rate exceeds 0.5%, fail the rollout. This is correct but incomplete. A canary that runs at 0.4% error rate on a service with a 99.95% SLO is still burning budget at a catastrophic rate — 0.4% error rate is 10x the allowable error rate for that SLO.

Burn rate-aware canary analysis does the comparison correctly:

canary_burn_rate = canary_error_rate / (1 - slo_target)
# For SLO = 0.9995 (99.95%):
# Allowable error rate = 0.0005 (0.05%)
# If canary error rate = 0.004 (0.4%):
# canary_burn_rate = 0.004 / 0.0005 = 8.0x

A burn rate of 8x during the canary window means that if you promoted this version to 100% traffic, you'd exhaust a 30-day error budget in about 3.75 days. That's not a decision you want to make implicitly by missing a 0.4% threshold on a service with a tight SLO.

In Kubestead's errorBudgetPolicy block, you can express this directly:

errorBudgetPolicy:
  sloTarget: 0.9995
  lookbackWindow: 10m
  maxBurnRate: 2.0
  action: rollback

This fails the canary the moment the 10-minute burn rate exceeds 2x the sustainable pace — before any user-visible SLO breach.

The Difference Between Short-Window and Long-Window Burn Rate

Multi-window burn rate alerting uses two calculation windows simultaneously: a short window (1 hour) to detect fast burns, and a long window (6 hours or more) to detect slow persistent burns that wouldn't trigger short-window alerts.

Canary analysis benefits from the same dual-window approach. A canary that's burning at 15x for 3 minutes might be a cold-start latency spike that will resolve. A canary burning at 3x consistently for 20 minutes is not a transient spike — it's a regression.

Kubestead evaluates both windows concurrently and requires both to be below threshold before advancing, configurable per step. The typical setup is a 5-minute short window for fast failure detection and a 20-minute long window for slow regressions that stay below the short-window threshold.

What Burn Rate Doesn't Catch

We're not suggesting burn rate replaces all other canary metrics. Burn rate is error-count-based. It won't catch latency regressions that don't produce errors — a canary where P99 latency doubled but error rate stayed flat will show a burn rate of 1.0 and sail through a burn-rate-only gate.

The right configuration layers burn rate alongside latency comparison metrics. Burn rate catches the "errors got worse" signal; P99 ratio between canary and stable catches the "latency got worse without erroring" signal. Neither alone is sufficient for production.

Practical: When Your Budget Is Already Low Before a Deploy

The scenario that burns teams most often: you're already at 40% error budget remaining three weeks into the 30-day window. A deploy kicks off, the canary looks nominally clean (error rate 0.3%, which is below the hard threshold), but the burn rate during the canary window is 4x. You promote to 100% traffic, and the remaining 40% budget is gone in two days. Pipeline freeze, incident review, and now you're 2 weeks from the end of the 30-day window with no budget left to absorb any additional instability.

The right intervention is a pre-deploy budget check that tightens canary thresholds dynamically when budget is constrained. If you have 60%+ budget remaining, allow a burn rate of up to 3x during canary. If you have under 25% remaining, only allow 1.5x. This prevents the "nominally passing canary that depletes a nearly-empty budget" failure mode.

Kubestead supports this via an errorBudgetPolicy.remainingBudgetThreshold parameter that scales the maxBurnRate limit automatically based on the current budget state. You set it once in the analysis template; it adjusts on every deploy without manual intervention.

Reading the Rollback as a Budget Signal

Every rollback Kubestead triggers includes a structured event with the burn rate at trigger time, the current remaining budget, and the estimated budget that would have been consumed if the rollout had been promoted to 100%. This isn't just forensics — it's a feedback loop for tuning your thresholds over time.

If you see a pattern of rollbacks triggered at burn rates of 1.8-2.2x, your threshold of 2.0 is probably about right. If you're seeing rollbacks triggered at 1.1-1.3x with no corresponding user impact, your threshold is too aggressive and you're rolling back deploys that would have been fine. The data to calibrate this is in your rollback telemetry — it just needs to be read.