SLO Thomas Rivera

Stop alerting on errors. Start alerting on error budget burn rate.

Error-rate alerts during canary analysis cause both false positives and false negatives. Budget burn rate tells you what actually matters: how fast you're spending your reliability capital.

Alert thresholds overlaid on an error budget burn chart

Error-rate alerts feel intuitive. Your service is returning 5xx responses at 2% — that's twice your usual baseline — so you page someone. The logic is clean and the configuration is simple. The problem is that error-rate alerts are optimized for a different problem than the one you actually have. They're optimized to detect that something is wrong right now. What you need during a canary deployment is to detect whether this version will cause you to miss your reliability commitments.

Those are related but distinct signals. Error rate tells you the intensity of a problem at a point in time. Error budget burn rate tells you the trajectory: at this pace, when do you run out of reliability capital?

The fundamental problem with error-rate alerting

Consider a service with a 99.9% availability SLO over a 30-day window. That's 43.2 minutes of allowable downtime per month, or roughly 0.1% of total requests allowed to fail.

Now consider two scenarios on the same day:

Scenario A: It's the 3rd of the month. Your budget is fully intact. Your canary pod is running at 0.15% error rate — 50% above baseline. Your alert threshold is 0.2%, so no alert fires. You promote the canary. It runs at 0.15% for the month. You've consumed 150% of your budget. You miss your SLO.

Scenario B: It's the 28th of the month. You've had a rough few weeks and your budget is 95% consumed. Your canary pod is running at 0.08% error rate — below your baseline of 0.10%. Your alert threshold of 0.2% doesn't fire. But that 0.08% on the remaining 5% budget is burning it at nearly 1× rate. You're on the edge.

In both cases, the error-rate alert gave you either the wrong signal or no signal at all. Neither scenario involves an alert that actually helps you make the right deployment decision.

What burn rate actually measures

Burn rate is a ratio: how fast are you consuming your error budget relative to how fast the budget replenishes? A burn rate of 1 is neutral — you'll exhaust exactly your budget by the end of the window. A burn rate of 14.4 means you'll exhaust a full 30-day monthly budget in two hours.

The canonical formulation for a 99.9% SLO is:

# Burn rate over a 1-hour window
# (error_rate / error_rate_budget) tells you the multiplier
burn_rate_1h = (
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
) / 0.001   # 0.001 = 1 - SLO target (0.999)

If this evaluates to 2.5, you're burning budget at 2.5× the sustainable rate. If you sustain that for the remaining window, you'll exhaust 2.5× what you're allowed to burn — and you'll miss your SLO.

The Google SRE Workbook's multi-window burn rate alert pattern layers a fast window (1h or 5m) and a slow window (6h or 1h) to distinguish spikes from sustained burns. A high 1h burn rate that isn't confirmed by a high 6h burn rate is likely transient. A high 6h burn rate is a real problem regardless of the current instantaneous rate.

Applying burn rate logic to canary analysis

The same principle applies directly to canary deployment analysis. During a canary, you have two populations: canary pods (serving some percentage of traffic) and stable pods (serving the rest). You want to know whether the canary pods are burning error budget at a meaningfully higher rate than the stable pods.

A PromQL expression for canary-vs-stable burn rate comparison:

# Canary pod error budget burn rate
(
  sum(rate(http_requests_total{
    status=~"5..",
    pod=~"payment-api-canary-.*"
  }[10m]))
  /
  sum(rate(http_requests_total{
    pod=~"payment-api-canary-.*"
  }[10m]))
) / 0.001

# Stable pod error budget burn rate (same window)
(
  sum(rate(http_requests_total{
    status=~"5..",
    pod=~"payment-api-[0-9a-z]+-[0-9a-z]+",
    pod!~"payment-api-canary-.*"
  }[10m]))
  /
  sum(rate(http_requests_total{
    pod!~"payment-api-canary-.*"
  }[10m]))
) / 0.001

The ratio of canary burn rate to stable burn rate is your decision signal. At 1.0 or below, the canary is performing at least as well as stable. At 2.0, the canary is burning budget twice as fast — concerning but potentially within tolerance if you have budget to spare. At 5.0+, you're looking at a real regression and the canary should not advance.

The multi-window problem for short-lived canaries

The multi-window approach that works well for production alerting has a wrinkle in canary analysis: canary steps are typically 10-20 minutes long. A 6-hour slow window is meaningless when you're trying to make a go/no-go decision in 15 minutes.

For canary analysis, a practical adaptation is to use a 5m and a 15m window rather than 1h and 6h:

  • The 5m window detects acute incidents — a canary that's immediately throwing errors needs to fail fast, not wait 15 minutes.
  • The 15m window filters out transient spikes — a single upstream timeout blip at minute 3 of a 15-minute analysis window shouldn't fail a healthy canary.

Require both windows to show elevated burn rate before triggering rollback. Require either window to show extreme burn rate (10×+ above stable) for immediate rollback. This gives you fast detection on obvious regressions while protecting against the false rollbacks that erode trust in automated deployment systems.

Where burn rate alerting falls short

We're not saying burn rate replaces all other signals. It doesn't.

Burn rate on availability tells you about HTTP 5xx responses — it says nothing about latency degradation, correctness issues, or business metric regressions. A canary that returns HTTP 200 with 3× the p99 latency will pass an availability burn rate gate cleanly. You need latency SLOs (p99 < Xms over the request distribution) and business metric tracking alongside availability burn rate.

Burn rate also requires accurate SLO definitions. If your 99.9% SLO doesn't reflect your actual user-visible reliability — if your health check passes when your database is degraded, or if you're measuring uptime on a non-critical path — then burn rate is measuring noise. Fixing your SLO definitions is a prerequisite to meaningful burn rate analysis.

Finally, burn rate can be noisy at low request volumes. A service that handles 10 requests per minute has an error rate that jumps between 0% and 10% with single request failures. At that scale, even a 10-minute window gives you poor confidence intervals. For low-volume services, consider minimum request count gates before evaluating burn rate, or use a longer observation window at early canary steps.

A more useful alerting model

The shift from error-rate alerting to burn-rate alerting isn't primarily about the PromQL query — it's about reorienting your reliability model around what your SLO actually commits to.

Error rate answers: is the service degraded right now?
Burn rate answers: are we on track to meet our reliability commitment?

For production alerting, both matter — burn rate for page-worthy reliability risk, error rate for fast detection of severe incidents. For canary analysis specifically, burn rate is the more relevant frame because you're not just asking "is something wrong?" but "will promoting this version leave us worse off against our commitments than staying on the current version?"

That's a trajectory question. And trajectory requires burn rate, not threshold.