Ship fast.
Roll back faster.
Kubestead canaries every release against live Prometheus and Datadog metrics — and reverts in under 90 seconds if your error budget says so. No Slack alert, no 3 AM page.
Reads your existing metrics — no agents to install
Rollouts are still the #1 source of 3 AM pages
Your team runs 60+ microservices. Blue-green ties up double the capacity for every release. Feature flags don't roll back infrastructure-level bugs. Argo Rollouts performs traffic splitting reliably — but still needs someone watching a dashboard to decide when a canary has failed. Most teams find out a deploy was bad when an on-call gets paged. Kubestead removes that delay and that human from the failure path.
Autonomous canary delivery in 3 steps
Annotate your Rollout
Add a single Kubestead annotation to your existing Kubernetes Deployment manifest. No sidecar. No agent. Five-minute install.
Define your SLO thresholds
Tell Kubestead: 'p99 < 350ms, error rate < 0.5%, apdex > 0.9'. It reads them live from Prometheus, Datadog, or Grafana Mimir.
Ship — Kubestead handles the rest
It rolls pods forward in configurable increments. If metrics cross your thresholds, it reverts in under 90 seconds and files a rollback report — no pager.
Built for the SRE who can't afford false confidence
Real-traffic metric validation
Evaluates canaries against live production metrics from your Prometheus, Datadog, or New Relic endpoint — not readiness probes, not synthetic checks. The canary passes only when real user traffic confirms it.
Sub-90-second rollback
When a threshold breaches, Kubestead reverts replica counts and drains canary traffic in under 90 seconds. A rollback report is auto-generated showing the exact metric name, measured value, and timestamp — ready for your postmortem.
Works with your existing stack
Reads from Prometheus, Datadog, New Relic, Grafana Mimir, and VictoriaMetrics out of the box. No vendor lock-in.
Namespace-level RBAC
Kubestead controller runs with least-privilege. Deploy to a single namespace or cluster-wide — your security team's call.
GitOps-native
RolloutPolicy manifests live in your Git repository alongside your service manifests. Policy changes go through code review before they affect production rollout behavior. Works with Argo CD and Flux CD out of the box.
Alert suppression during canary
Canary phases produce transient metric noise. Kubestead optionally mutes upstream alerting rules during the active canary window — so your on-call isn't paged for expected variance, only for genuine threshold breaches Kubestead didn't catch in time.
One YAML block. Autonomous rollout.
One CRD defines the full rollout contract: step percentages, soak times, PromQL or Datadog metric queries, threshold values, and notification targets. Store it in Git. Code-review it like code. Let Kubestead execute it at deploy time.
- Configurable step percentages and per-step soak times
- Arbitrary PromQL, Datadog, or NRQL metric queries
- Per-threshold rollback triggers with comparison operators
- Slack, PagerDuty, OpsGenie, or generic webhook on rollback
apiVersion: kubestead.io/v1alpha1
kind: RolloutPolicy
metadata:
name: api-gateway-policy
namespace: production
spec:
targetDeployment: api-gateway
canary:
steps:
- percentage: 10
soakMinutes: 5
- percentage: 30
soakMinutes: 8
- percentage: 60
soakMinutes: 10
- percentage: 100
metrics:
source: prometheus
queries:
errorRate: |
sum(rate(http_requests_total{
status=~"5..",job="api-gateway"}[2m]))
/ sum(rate(http_requests_total{
job="api-gateway"}[2m]))
p99Latency: |
histogram_quantile(0.99, sum(rate(
http_duration_seconds_bucket{
job="api-gateway"}[2m])) by (le))
threshold:
rollbackOn:
- metric: errorRate
exceeds: 0.005
- metric: p99Latency
exceeds: 0.350
notifications:
onRollback:
slack: https://hooks.slack.com/services/T00/B00/xxx
includeMetricTrace: true
SRE teams sleep through deploys now
We had a standing Friday deploy freeze for 18 months. Kubestead ended it. We ship on Fridays now because regressions get caught at 10% — before anyone is paged.
We were running Argo Rollouts. It split traffic fine but still needed someone watching a dashboard before we'd promote. That someone was usually me at 11 PM. Kubestead owns that decision now.
The rollback report is the feature I didn't know I needed. It tells you exactly which metric breached, the measured value, and the timestamp. Our postmortems are 10 minutes now instead of two hours.