Canary February 10, 2026

What We Shipped in February 2026: Multi-Cluster Orchestration and Burn Rate v2

Kubestead v0.8.0: Multi-Cluster Rollouts and the New REST API

v0.8.0 is our biggest release since the initial controller launch. The two headline features — multi-cluster rollout orchestration and the redesigned burn rate policy engine — address the two things we heard most consistently from teams scaling past 3-4 clusters: "we need coordinated rollouts across regions" and "our burn rate thresholds feel like magic numbers we don't know how to tune." This post covers what shipped, what the design rationale was, and what we deliberately didn't build yet.

Multi-Cluster Rollout Orchestration

The original Kubestead controller was single-cluster by design. If you ran 4 clusters and wanted to roll out a change across all of them, you ran 4 independent Kubestead controllers and coordinated the sequencing manually — or didn't coordinate, which meant the clusters could be in different states during the rollout window.

v0.8.0 introduces a cluster federation model. One controller acts as the orchestrator; it issues rollout commands to agent controllers running in each member cluster. The orchestrator manages the global rollout state: which clusters are in the canary phase, which have promoted to stable, which triggered rollbacks. Analysis is evaluated per-cluster with per-cluster metrics backends; the promotion decision is global — all clusters advance together, or any cluster can independently trigger rollback while others hold.

The spec change is minimal. You add a clusters block to your rollout spec that lists the member cluster endpoints and specifies whether to roll out serially (cluster by cluster, each validating before the next starts) or in parallel (all clusters start the canary simultaneously):

rollout:
  name: checkout-service
  clusters:
    mode: serial           # or: parallel
    members:
      - name: us-east-1
        endpoint: https://kubestead-agent.us-east-1.internal
      - name: us-west-2
        endpoint: https://kubestead-agent.us-west-2.internal
      - name: eu-west-1
        endpoint: https://kubestead-agent.eu-west-1.internal

Serial mode is the safer default for high-traffic multi-region deployments. You validate the canary in us-east-1 first, promote to stable, then start the canary in us-west-2. If us-east-1 rollback triggers, eu-west-1 never starts. The blast radius of a bad deploy is bounded to a single region before it reaches the rest.

Parallel mode is faster but all-or-nothing: all regions run the canary simultaneously. Any single cluster triggering rollback causes a coordinated rollback across all clusters. Useful for releases where the change is low-risk and you need the fastest possible global rollout time.

The Limitation: Cross-Cluster Metric Comparison

Multi-cluster analysis runs independently in each cluster. The orchestrator can't compare canary error rate in us-east-1 against canary error rate in eu-west-1 as a single composite signal. Each cluster's analysis template evaluates against that cluster's metrics backend. This is a deliberate constraint — forcing cross-cluster metric federation would require a centralized metrics store that we didn't want to mandate as a deployment dependency.

If you need cross-cluster signal aggregation (e.g., roll back all clusters if any single cluster crosses a threshold), configure the promotion threshold in each cluster's analysis template conservatively enough that the global blast radius is bounded. We're working on an optional centralized analysis mode for a future release.

Burn Rate Policy Engine v2

The original errorBudgetPolicy block required you to configure burn rate thresholds manually: set maxBurnRate to a number, pick a lookback window, and hope you calibrated it correctly. Most teams set it to 2.0 and never touched it again.

Burn Rate v2 replaces the static maxBurnRate with a dynamic threshold that adjusts based on remaining budget. The new burnRatePolicy.mode: adaptive configuration does the math for you:

errorBudgetPolicy:
  sloTarget: 0.999
  window: 28d
  burnRatePolicy:
    mode: adaptive
    canaryAnalysisWindow: 15m
    maxBudgetConsumptionPct: 1.0   # don't allow canary to consume > 1% of budget

Instead of specifying a burn rate multiplier, you specify the maximum percentage of the 28-day budget that the canary window is allowed to consume. Kubestead computes the implied burn rate threshold at deploy time based on current remaining budget, the canary traffic percentage, and the analysis window duration. As budget depletes, the allowed burn rate threshold tightens automatically.

The practical effect: early in the window (80%+ budget remaining), canaries can tolerate higher transient error rates. Late in the window (under 30% remaining), the threshold tightens and marginal canaries that would have passed are now blocked. You don't have to think about this — it happens based on the budget math.

CLI Read Operations Without kubeconfig

A small but frequently-requested change: kbs status, kbs rollout list, and kbs rollout inspect now support authentication via API token instead of requiring a kubeconfig with cluster-admin access. This matters for two situations: developers who need rollout status visibility without access to the production kubeconfig, and CI pipelines that query rollout state as part of a deployment gate.

export KBS_TOKEN=your-api-token
kbs rollout list --namespace production

Write operations (start rollout, trigger rollback, approve a blocked rollout) still require a Kubernetes ServiceAccount with appropriate RBAC permissions. The token-auth path is read-only by design.

What We Didn't Ship

Two features were scoped for v0.8.0 and deferred: the web dashboard rollout timeline view and automatic analysis template generation from existing Prometheus recording rules. Both are in active development for the v0.9.0 release. The dashboard timeline view was nearly complete but had a rendering performance issue at high rollout history volumes (1,000+ rollout events) that we didn't have time to fix correctly before the release. We'd rather ship it right than ship it broken.

Automatic analysis template generation from recording rules is a harder problem than we initially estimated — the heuristics for inferring success/failure thresholds from existing alert rules are not reliable enough to ship as a recommendation without significant false-positive risk. We're continuing the research.

v0.8.0 is available on the Scale plan and above. Multi-cluster orchestration requires at least one orchestrator instance at Scale tier and agent instances at Team tier or above in each member cluster. Full upgrade instructions are in the quickstart guide.