How Kubestead works

A Kubernetes operator that sits in your cluster, reads metrics from Prometheus or Datadog, and makes canary advancement and rollback decisions automatically. It is not a proxy, not a sidecar, and not in your request path — it manages replica counts and watches your existing metrics endpoints.

The architecture

Architecture diagram showing Kubestead controller reading from Prometheus and Datadog, managing canary pods in a Kubernetes cluster
1

Install (5 min)

kubectl apply -f kubestead-controller.yaml — no sidecar, no additional agent. Kubestead reads existing Kubernetes events and your metrics endpoint.

2

Write a RolloutPolicy

A single CRD in your repo defines canary steps, metric sources, thresholds, and rollback behavior. Everything is code-reviewed before it affects production.

3

Canary traffic splitting

Kubestead manages replica counts to split traffic at configurable percentages — no service mesh required. When you need sub-percentage canary slices, weighted routing via Istio, Linkerd, or Envoy Gateway is supported. Works with any ingress: nginx, Istio, Envoy Gateway.

4

Metric evaluation window

After each step, Kubestead waits a configurable soak time (default 5 min) and queries your metrics. Pass → advance to next step. Fail → instant rollback, rollback report generated automatically.

5

Audit trail

Every canary decision — step advancement, threshold breach, rollback trigger, or manual override — is written to a structured, tamper-evident JSON event log. The log includes the exact metric query result and timestamp at the moment of the decision. Exportable via webhook to your SIEM. Available in all paid plans.

Common questions

No. It works with standard Kubernetes Deployments and replica-count traffic splitting. Service mesh support (Istio, Linkerd) is available for weighted routing when you need sub-percentage canary slices.
Kubestead treats metric-source unavailability as a safe-to-rollback condition by default. The reasoning: an analysis system that can't read metrics cannot verify the canary is healthy, so continuing is the wrong call. You can override this to pause instead — meaning the canary halts at the current step and waits for metrics to recover — by setting onMetricUnavailable: pause in your RolloutPolicy.
Yes. RolloutPolicy supports arbitrary PromQL, Datadog query strings, and New Relic NRQL. You write the query; Kubestead evaluates the threshold.
Argo Rollouts is a mature, flexible controller that handles traffic splitting reliably. Its AnalysisTemplate model is powerful but places the analysis-to-decision wiring on you — you define the template, the metrics provider, and the pass/fail evaluation, then decide what Argo does with the result. Most teams end up with a human watching a dashboard before promoting. Kubestead is more opinionated: the metric evaluation and rollback decision are built into the controller. You define thresholds in a RolloutPolicy; Kubestead acts on them without requiring any external AnalysisTemplate or manual gate. If you want maximum flexibility in how analysis works, Argo Rollouts is the right choice. If you want the decision loop to run itself with predictable behavior, Kubestead is built for that.

Ready to automate your canaries?

Start free for up to 5 services. No credit card required.