Observability Lucia Ferreira

Prometheus metrics that actually tell you your canary is healthy

Not all Prometheus metrics are equally useful for rollout decisions. We walk through the 5 query patterns that consistently separate "looks fine" from "will fail at scale."

Prometheus query builder for canary latency metric

Prometheus gives you a blank canvas: you can scrape almost any metric from a running service, write a PromQL query against it, and use the result as a canary gate. The flexibility is real, but it also means you can build a canary analysis setup that appears rigorous and provides essentially no signal. The query structure matters as much as the metrics you choose. Certain patterns consistently distinguish a genuinely healthy canary from one that's "fine for now but will hurt you later."

This isn't a tutorial on PromQL syntax. It assumes you're already running Prometheus and have basic instrumentation in your services. The focus is on which metric signals carry the most diagnostic weight for rollout decisions, and how to query them in ways that avoid common analysis mistakes.

Query 1: Error rate with request count denominator, not time

The most common mistake in canary error rate queries is dividing by a fixed time window instead of total requests. This produces a rate per second, which conflates traffic volume changes with actual error behavior.

Prefer a ratio of error requests to total requests over the analysis window:

# Correct: error ratio (dimensionless)
sum(
  rate(http_requests_total{
    job="payment-api",
    status=~"5..",
    pod=~"payment-api-canary-.*"
  }[10m])
)
/
sum(
  rate(http_requests_total{
    job="payment-api",
    pod=~"payment-api-canary-.*"
  }[10m])
)

This gives you a number between 0 and 1 regardless of traffic volume. A canary serving 200 RPS at 1% error rate and a canary serving 2000 RPS at 1% error rate produce the same output — as they should, because the failure proportion is identical. Rate-per-second queries obscure this by mixing traffic intensity with error intensity.

Always split canary pods from stable pods using label selectors. If your canary pods are labeled with rollout-type: canary by your orchestrator, use that label. If not, use pod name matching against the known canary ReplicaSet suffix. Mixing canary and stable metrics in a single query defeats the purpose of comparative analysis.

Query 2: Latency percentiles using histograms, not summaries

Prometheus client libraries expose latency in two ways: summary (pre-computed quantiles calculated in the client) and histogram (bucketed observation counts that let you compute quantiles server-side). For canary analysis, you want histograms.

Summaries compute quantiles per-process and can't be aggregated across pod replicas. If you're averaging summary p99s across three canary pods, you're averaging quantiles — a mathematically incorrect operation. Histogram observations aggregate correctly:

# p99 latency from histogram across all canary pods
histogram_quantile(
  0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{
      job="order-service",
      pod=~"order-service-canary-.*"
    }[10m])
  )
)

The sum by (le) aggregates across pods before passing to histogram_quantile. This is the correct aggregation. Verify your services use histogram instrumentation — *_bucket, *_count, *_sum suffix pattern — before relying on this query pattern.

Query 3: Canary-to-stable latency ratio

A canary p99 of 180ms sounds acceptable if your success condition is 200ms. But if your stable p99 is 120ms, the canary is running 50% slower — a meaningful regression that the absolute threshold missed. Relative comparison is almost always more informative than absolute thresholds for canary gates.

# Canary p99 / stable p99 ratio
# Values above 1.2 suggest latency regression worth investigating

histogram_quantile(
  0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{
      job="order-service",
      pod=~"order-service-canary-.*"
    }[10m])
  )
)
/
histogram_quantile(
  0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket{
      job="order-service",
      pod!~"order-service-canary-.*"
    }[10m])
  )
)

A ratio of 1.0 means the canary matches stable. A ratio of 1.3 means the canary is 30% slower at p99. Set your analysis threshold on this ratio (typically 1.1 to 1.3 depending on your latency sensitivity) rather than on the absolute value. This threshold remains valid regardless of daily load patterns that shift your baseline latency.

Query 4: Saturation signals — queue depth and connection pool usage

Error rate and latency are lagging indicators. By the time they degrade, you've already hurt users. Saturation metrics — how full your critical resources are — often predict latency and error degradation before it appears.

For services with thread pools or connection pools, pool exhaustion is a leading indicator of timeout errors and latency spikes. If your service exposes pool metrics (many Java runtimes do via JMX, most custom pools can expose them trivially):

# Connection pool utilization for canary pods
sum(db_connection_pool_active{pod=~"payment-api-canary-.*"})
/
sum(db_connection_pool_max{pod=~"payment-api-canary-.*"})

If this ratio is significantly higher for canary pods than stable pods, the new version is holding connections longer — a pattern that will degrade under higher load even if current latency looks acceptable. Similarly, goroutine count, open file descriptors, and heap utilization patterns in canary pods at 10% traffic can predict OOM conditions that won't be visible until 100% promotion.

Query 5: Downstream dependency error rates from the canary's perspective

A healthy canary can trigger downstream failures. If your new version calls a dependency more frequently (due to a caching regression or a new feature flag activating), the downstream service may be fine at 5% canary weight and saturated at 100%. This doesn't show up in canary pod error rates — it shows up in the dependency's metrics.

If your services use a consistent HTTP client instrumentation pattern that records outbound request outcomes:

# Outbound error rate from canary pods to payment-processor
sum(
  rate(http_client_requests_total{
    source_pod=~"order-service-canary-.*",
    target="payment-processor",
    status=~"5.."
  }[10m])
)
/
sum(
  rate(http_client_requests_total{
    source_pod=~"order-service-canary-.*",
    target="payment-processor"
  }[10m])
)

Comparing this to the same metric for stable pods tells you whether your canary is stressing its dependencies differently. A canary that calls a downstream service 20% more frequently than stable will hit that service's rate limits faster during full promotion. Catching this at 5% weight is far cheaper than discovering it post-promote.

What these queries don't cover

Five queries can't substitute for understanding your specific service's failure modes. These patterns are a useful starting set, not a complete analysis portfolio. For most services, you'll want to add at least one business metric — transaction success rate, order completion rate, search result relevance score — that connects service health to user outcomes rather than infrastructure behavior.

Prometheus metrics also can't tell you about logical correctness. A canary that returns HTTP 200 with a subtly wrong JSON response body will pass every query above. Functional correctness requires contract tests, integration test suites, or business metric tracking that goes beyond availability and latency.

That said, these five patterns — request error ratio, histogram-based latency percentiles, canary-to-stable latency ratio, saturation indicators, and downstream error rates — form a defensible baseline for automated canary analysis. Services that pass all five consistently are meaningfully less likely to cause production incidents post-promote than services evaluated on raw error rate alone.