Canary Raj Mehta

Canary analysis gates: designing with SLO budgets instead of thresholds

Hard error thresholds (p99 > 200ms = fail) are brittle. SLO budgets give you a dynamic gate that adjusts to your baseline. Here's how to model it in a RolloutPolicy.

Dashboard showing canary SLO budget consumption over time

Most canary analysis implementations work on a simple premise: define a success condition, measure the metric, pass or fail. If your p99 latency stays below 200ms for the observation window, the canary advances. If it exceeds 200ms, it rolls back. The logic is easy to understand and easy to configure — which is probably why it's the default in most orchestration tools.

The problem is that "p99 < 200ms" is asking the wrong question. It's asking whether the canary is acceptable in absolute terms. The question you actually want answered is: "Is this canary consuming reliability capital at a rate that will leave us out of budget before the end of the measurement period?" That's an SLO budget question, not a threshold question.

Why static thresholds break in practice

Static thresholds create two distinct failure modes that are both expensive.

False rollback. Your service has a p99 of 190ms at steady state. You set a canary success condition of 180ms to be conservative. Traffic spikes during a routine rollout window — nothing to do with the canary — and the canary pod p99 temporarily reads 195ms for six minutes. The rollout fails, the on-call engineer gets paged at 11 PM, and two hours later they've confirmed it was traffic load not a regression. The canary was fine. Your threshold was too tight for real traffic variance.

False pass. Your SLO is 99.9% availability over 30 days. Your error budget is 43.2 minutes per month. On March 22nd, your budget is 85% consumed — you've already had a rough month. You deploy a canary that runs at 0.3% error rate. Your success condition is "error rate < 0.5%." The canary passes. But at 0.3% errors across your full request volume, that canary is burning the remaining 15% of your budget in about six hours. You've promoted a release that will guarantee an SLO miss for March.

The threshold model can't see either of these failure modes because it has no concept of baseline, variance, or budget position.

Error budgets as analysis gates

An error budget is the operationalization of an SLO: given a target of 99.9% availability over 30 days, you have 43.2 minutes of allowable downtime. The budget can be expressed in time, in request count (total_requests × 0.001), or as a ratio of the rolling window.

Budget burn rate is a more useful signal than instantaneous error rate. If you have 20% of your monthly budget remaining and a canary is burning that budget at 3× the rate of your stable baseline, you're going to be out of budget within hours of a full promotion. The canary should not advance.

The canonical Google SRE formulation (from the Site Reliability Workbook, chapter 5) models burn rate as a multiplier: a burn rate of 1 means you'll exhaust exactly your budget over the measurement period; a burn rate of 14.4 means you exhaust 100% of a monthly budget in two hours. These are the numbers you want your canary gate to understand, not a raw latency threshold.

Modeling budget gates in a RolloutPolicy

In Kubestead's RolloutPolicy, you define your SLO target once and Kubestead evaluates budget consumption at each step. A minimal configuration looks like this:

apiVersion: delivery.kubestead.io/v1alpha1
kind: RolloutPolicy
metadata:
  name: payment-api
spec:
  slo:
    target: 0.999          # 99.9% availability
    window: 30d            # rolling 30-day window
    budgetConsumedAlert: 0.80   # warn when 80% consumed
  canary:
    steps:
      - weight: 5
        analysisWindow: 10m
      - weight: 20
        analysisWindow: 15m
      - weight: 50
        analysisWindow: 20m
      - weight: 100
  analysis:
    maxBurnRateMultiplier: 2.0   # canary burn rate must be < 2x stable

The maxBurnRateMultiplier field is the core gate. It says: "If the canary pods are burning error budget at more than twice the rate of the stable pods over the analysis window, fail the analysis." This is a relative gate — it adjusts automatically to your current error rate without you touching the policy when load patterns change.

The budgetConsumedAlert field changes behavior when you're close to SLO miss: when 80% of budget is consumed, the multiplier tightens automatically to 1.0, meaning the canary must burn at the same rate or better than stable to advance. If you're already out of budget, rollouts pause entirely until the window resets.

What the analysis engine evaluates at each step

At each step, Kubestead runs three checks in sequence:

  1. Budget position check. What percentage of the SLO budget is currently consumed? If above the alert threshold, tighten the multiplier.
  2. Canary burn rate. Calculate the error budget burn rate for canary pods only, over the analysis window, and compare to stable pod burn rate for the same window.
  3. Confidence gate. At low traffic weights (typically <10%), confidence intervals on error rates are wide. Kubestead delays advancement if the canary hasn't accumulated sufficient request volume for a statistically meaningful comparison.

If all three pass, the rollout advances. If burn rate exceeds the multiplier, the rollout pauses and emits an event with the computed burn rates for both canary and stable, so you have context without digging into dashboards.

The nuance: SLO budgets are not sufficient on their own

We want to be clear about the limits of this model. SLO budgets work when your SLO is correctly defined and your metrics are actually measuring user-visible behavior. If your availability SLO is based on a health endpoint that doesn't reflect real request success, budget burn rate is measuring the wrong thing. Garbage in, garbage out.

Budget-based gates also don't help you catch correctness regressions — a canary that returns HTTP 200 with subtly wrong data will pass any availability gate. You still need application-level assertions, integration tests, or business metric checks (conversion rate, transaction success rate) in your analysis portfolio. Budget analysis is necessary but not sufficient.

Similarly, if your SLO window is 30 days but you deploy dozens of times per day, the 30-day budget may have a lot of noise from previous incidents that aren't relevant to the current canary's risk profile. Some teams run a tighter rolling window — 7 days or even 24 hours — for their deployment gates specifically, while keeping the 30-day window for SLA reporting. Both are valid; they answer different questions.

A practical starting configuration

If you're adopting SLO-based analysis for the first time and don't have tight numbers yet, a conservative starting point:

  • Set maxBurnRateMultiplier: 3.0 — allows the canary to burn at up to 3× the stable rate before flagging. This is loose enough to prevent false rollbacks while still catching real regressions.
  • Set budgetConsumedAlert: 0.70 — tighten when you're within 30% of budget exhaustion.
  • Use a 10-minute analysis window at 5% weight, 15 minutes at 20%, 20 minutes at 50%. Shorter windows at low weight prevent slow rollouts; longer windows at high weight give you more confidence before full promotion.

After running this for a few weeks, look at your rollout event logs. If you see frequent pauses at the budget-consumed threshold, your budget windows may be too tight or your baseline error rate is already concerning for unrelated reasons. If you see no pauses at all, consider whether your multiplier threshold is catching anything real or just rubber-stamping every rollout.

The goal isn't zero rollbacks — it's rollbacks that correlate with real regressions. Budget-based gates give you the context to make that distinction reliably.