SLOs February 19, 2025

Which Prometheus Metrics Actually Predict a Bad Deploy

The Five PromQL Queries Every Canary Analysis Should Include

After analyzing over 200 canary rollbacks in Kubestead deployments, a clear pattern emerges: the teams that catch regressions earliest aren't using the most metrics — they're using the right ones. Error rate alone produces too many false positives on low-traffic services and misses too many latency regressions. The combination that works best is a small set of complementary signals that together cover the failure modes that actually show up in production.

What follows is grounded in observed rollback data, not theoretical metric design. The queries here are production-tested in Kubernetes environments running the standard kube-prometheus-stack with instrumented services using the four standard Prometheus histograms and counters.

1. HTTP Error Rate (5xx + 4xx where appropriate)

The baseline. Every canary analysis starts here:

sum(rate(http_requests_total{status=~"5..", pod=~"canary.*"}[2m]))
/ sum(rate(http_requests_total{pod=~"canary.*"}[2m]))

Critical notes: the 2-minute range vector is the right default for most services. A 30-second range vector produces too much noise on low-request-rate services. A 5-minute range vector delays detection. If your service handles under 10 requests per second at canary traffic levels (e.g., 5% of a service doing 200 rps = 10 rps), consider a 3-minute window instead.

Also: whether to include 4xx errors depends on the service. For most internal APIs, 4xx errors are client errors — don't count them against the canary. For user-facing services where 4xx often represents application logic errors (expired sessions, rate limits), tracking 4xx separately is worth it. The decision should be made per service, not globally.

2. P99 Latency Ratio (Canary vs. Stable)

Error rate misses latency regressions that don't produce errors. A service that starts responding in 800ms where it previously responded in 200ms will look clean on error rate while degrading user experience significantly. The right metric is a ratio, not an absolute threshold:

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{pod=~"canary.*"}[2m])) by (le))
/
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{pod=~"stable.*"}[2m])) by (le))

A ratio above 1.5 (canary P99 is 50% worse than stable) is a reliable rollback signal. An absolute threshold (e.g., "fail if P99 > 500ms") is unreliable because baseline latency varies enormously by service. The ratio comparison is service-agnostic.

The failure mode to watch for: histogram_quantile produces inaccurate results when the canary bucket has very few samples. At 1% canary traffic on a low-rps service, you may not have enough samples for a meaningful P99 calculation in a 2-minute window. Use a minimum-count guard in your analysis template.

3. Downstream Error Rate (For Services With Dependencies)

Services that make outbound calls to databases, caches, or downstream APIs can cause cascading failures without their own error rate spiking first. A canary version that holds database connections longer than stable will exhaust the connection pool gradually — the connection error rate spikes only after the pool is full, which may be 5-10 minutes into the canary window.

If your service instruments outbound calls (which it should, via a Prometheus summary or histogram on the client side), track downstream error rate separately:

sum(rate(db_query_errors_total{pod=~"canary.*"}[2m]))
/ sum(rate(db_queries_total{pod=~"canary.*"}[2m]))

This is the metric that catches "slow database queries under new code path" before the cascade reaches the service's own error rate. It consistently provides 3-5 minutes of earlier warning compared to HTTP error rate alone in services with database dependencies.

4. Saturation: CPU Throttling or Memory Pressure

A deploy that introduces a memory leak or a tighter compute loop may pass error rate and latency checks during the analysis window — but only because the canary is running on a fraction of traffic. At 100% traffic, the same memory allocation pattern would exhaust pod memory in 30 minutes.

Tracking CPU throttle rate or container memory usage during the canary window gives early warning of saturation regressions:

sum(rate(container_cpu_cfs_throttled_periods_total{pod=~"canary.*"}[2m]))
/ sum(rate(container_cpu_cfs_periods_total{pod=~"canary.*"}[2m]))

A throttle rate above 25% on the canary (when stable is under 10%) is a reliable signal that the new version is consuming more CPU cycles. It won't always manifest as errors in the analysis window, but it predicts poor behavior at full traffic.

5. Canary vs. Stable Comparison Gate

The most predictive rollback signal we've found isn't a single metric — it's a multi-metric comparison that asks: is the canary materially worse than stable on any dimension? This is implemented as a composite gate:

analysisTemplate:
  name: canary-vs-stable-composite
spec:
  metrics:
  - name: error-rate-ratio
    successCondition: result[0] < 1.3   # canary error rate < 1.3x stable
    failureCondition: result[0] > 2.0
    provider:
      prometheus:
        query: |
          (sum(rate(http_requests_total{status=~"5..",pod=~"canary.*"}[2m]))
           / sum(rate(http_requests_total{pod=~"canary.*"}[2m])))
          /
          (sum(rate(http_requests_total{status=~"5..",pod=~"stable.*"}[2m]))
           / sum(rate(http_requests_total{pod=~"stable.*"}[2m])) + 0.0001)

The ratio approach is more robust than absolute thresholds because it's self-calibrating. If your stable version has a 0.2% error rate baseline (perhaps from expected upstream flakiness), an absolute 0.5% threshold would pass a canary running at 0.4% error rate. A 2x ratio gate would fail that same canary because the canary is running 2x worse than stable.

What to Avoid: High-Cardinality Labels in Canary Queries

One operational mistake that shows up frequently: using high-cardinality labels (user ID, request path, tenant ID) in canary analysis queries. Prometheus queries with high-cardinality label matchers create high-cost query patterns that can spike metrics backend load during a rollout — precisely the moment you want your observability stack to be stable.

Canary analysis queries should aggregate at the pod or deployment level, not at the request level. Keep them simple, keep them cheap, and resist the urge to add per-endpoint breakdown to the analysis template. That level of detail belongs in your dashboards during rollout review, not in automated analysis gates.