SLOs September 2, 2025

Writing Error Budget Policies That Actually Stop Deployments

Writing Error Budget Policies That Actually Get Enforced

Most engineering teams have an SLO document. Fewer have an error budget policy. Even fewer have an error budget policy that actually does something — that stops deployments, freezes the release pipeline, or triggers an incident review. The gap between "we have SLOs" and "our SLOs affect our behavior" is exactly this: the policy that translates an SLO measurement into a deploy-time decision.

This post is about closing that gap. Specifically: how to translate the SLO document in your wiki into a Kubestead errorBudgetPolicy block that actually stops a canary rollout when the budget is being burned faster than it should be.

What an Error Budget Policy Is (and Isn't)

An error budget policy is a decision rule: when this condition is true about our error budget state, take this action. It's not an SLO — the SLO is the target. It's not a burn rate alert — that's a notification. An error budget policy is an enforcement mechanism: deploy blocked, pipeline frozen, on-call paged.

The Google SRE workbook distinguishes three types of error budget policy responses: (1) stop deploying, (2) prioritize reliability work over features, (3) freeze all changes. Most teams only need the first one for their deploy pipeline. The other two are organizational decisions that live outside the toolchain.

For Kubestead, the error budget policy applies at canary deploy time: if the current error budget state (remaining budget, current burn rate) doesn't meet the policy threshold, the rollout is blocked before it starts.

Translating Your SLO Into Policy Parameters

Start with your SLO document. Take a concrete service: checkout API, 99.9% availability SLO, 28-day rolling window. The error budget is:

error_budget_minutes = 28 * 24 * 60 * (1 - 0.999)
                     = 40,320 * 0.001
                     = 40.3 minutes per 28-day window

Your policy parameters need to answer two questions: how much remaining budget must exist before a deploy is allowed, and what burn rate is acceptable during the canary window?

A reasonable starting point for the remaining budget threshold: if less than 20% of the 28-day budget remains (8 minutes in this case), block all non-critical deploys. The 20% floor gives you reserve capacity for unexpected instability in the remaining days of the window.

For the canary burn rate threshold: during a canary window (typically 10-30 minutes), you'll tolerate some burn above baseline. A burn rate of 3x during the canary window means that if you promoted this canary to 100% traffic, you'd consume budget at 3x the sustainable pace. For most services, 2-3x is the right canary threshold — aggressive enough to catch regressions quickly, not so sensitive that normal traffic variance triggers false rollbacks.

The Kubestead errorBudgetPolicy Block

Translating the above into a Kubestead policy:

errorBudgetPolicy:
  sloTarget: 0.999
  window: 28d
  metricsBackend: prometheus
  remainingBudgetMinFloor: 0.20   # block deploy if < 20% budget remaining
  canaryBurnRateMax: 2.5           # fail canary if burn rate > 2.5x during analysis
  burnRateLookbackWindow: 10m     # evaluation window for burn rate calculation
  action:
    onBudgetFloor: block           # prevent rollout from starting
    onBurnRateExceeded: rollback   # trigger automatic rollback

The metricsBackend: prometheus directive tells Kubestead to query your Prometheus for the current error count and total request count over the SLO window, compute the remaining budget, and evaluate the policy thresholds before allowing the canary to start.

The Budget Floor Problem: When You Can't Deploy

The onBudgetFloor: block action creates a real operational tension. When your budget floor is hit three weeks into the 28-day window, you're now blocked from deploying anything for potentially 5-10 days until the window resets. Engineering teams find this frustrating — it means critical bug fixes also get blocked, not just feature work.

The standard solution is a policy override path: a designated team member can approve a deploy over the budget floor block, with a required post-deploy review. Kubestead supports this via a requireApproval: true flag on the policy — the block becomes a gate that requires explicit human approval rather than a hard stop. The approval is logged in the rollout event stream for audit purposes.

We're not suggesting every budget floor block should be overridable — that would defeat the purpose. We're saying that "block everything" policies get circumvented entirely when they're too strict, and a structured override path is better than informal bypassing.

Calibrating Thresholds Over Time

The thresholds in the initial policy are starting points, not constants. After running the policy through a full 28-day window, review the rollout event log: how many rollbacks were triggered by the burn rate threshold? Of those, how many correlated with real user-visible regressions, and how many were false positives from traffic spikes?

If you see 4+ false positive rollbacks in a window with no corresponding user impact, your canaryBurnRateMax is too aggressive. Raise it from 2.5 to 3.5. If you see zero rollbacks but also notice post-deploy incidents that weren't caught during the canary window, your burn rate threshold might be too permissive — or your analysis window is too short.

This is not set-and-forget configuration. Error budget policies need the same calibration discipline as any alerting threshold. The rollback event schema provides the data to drive that calibration: trigger burn rate, baseline burn rate at trigger time, canary traffic percentage, service name. Run that analysis quarterly and your thresholds will improve steadily over time.

Multi-Service Policy Inheritance

For platform teams managing 50+ microservices, writing a custom error budget policy for each service is not practical. Kubestead supports policy inheritance: a default cluster-level policy applies to all services without an explicit policy, and service-level policies override specific fields while inheriting the rest.

A typical setup: the cluster policy sets canaryBurnRateMax: 3.0 and remainingBudgetMinFloor: 0.15 as defaults. Services with tighter SLOs (99.99% availability, payment processing) override to canaryBurnRateMax: 1.5. Internal tooling services with no SLO inherit the cluster policy but set onBudgetFloor: warn instead of block. The tier structure mirrors the service criticality, not the platform team's preference.