Canary December 18, 2025

How We Instrument Canary Rollouts with OpenTelemetry

Instrumenting Canary Deployments with OpenTelemetry

Prometheus metrics tell you whether the canary is worse than stable. OpenTelemetry traces tell you why. When the error rate spikes on the canary, a distributed trace that covers the full request path — from the load balancer through the canary pod and down to the database — shows you exactly which operation started failing and what the failure looks like. That's a very different forensics posture than staring at a rate counter and guessing.

This post walks through the instrumentation pattern that makes OTel traces useful for canary analysis: propagating the deployment version through trace context, filtering traces by version in the analysis template, and using span error rates as a canary gate.

The Deployment Version Attribute

The foundation is a resource attribute that identifies which version of the service generated a given span. OpenTelemetry's semantic conventions define service.version as a standard resource attribute. Setting it correctly at service startup is the prerequisite for everything else:

// Go example using OTel SDK
resource, err := resource.New(ctx,
    resource.WithAttributes(
        semconv.ServiceNameKey.String("checkout-service"),
        semconv.ServiceVersionKey.String(os.Getenv("SERVICE_VERSION")),
        attribute.String("deployment.stage", os.Getenv("DEPLOY_STAGE")), // "canary" or "stable"
    ),
)

The SERVICE_VERSION and DEPLOY_STAGE environment variables are injected by Kubestead at canary pod creation time. Every span emitted by the canary pod carries service.version=v2.14.1 and deployment.stage=canary. Every span from the stable pods carries deployment.stage=stable. This makes it trivial to split traces by deployment stage in any OTel-compatible backend.

Trace Context Propagation Across Services

When the canary pod makes a downstream call, the OTel SDK propagates the trace context via the W3C TraceContext header (traceparent) and the OpenTelemetry baggage API (deployment.stage=canary). Downstream services that are instrumented with OTel will attach this baggage to their own spans, creating a trace where the entire call chain is tagged with the originating deployment stage.

This is the property that makes cross-service canary analysis possible. A canary checkout pod makes a call to a stable inventory service. The inventory service span is tagged deployment.stage=canary via baggage propagation, even though the inventory service itself is running the stable version. You can now query: are inventory service errors higher when called from canary checkout pods than from stable checkout pods? That's a signal that the checkout canary is causing downstream problems — even if the downstream service's own error rate looks clean in aggregate.

Span Error Rate as a Canary Gate

Once spans are tagged with deployment stage, you can query span error counts via your OTel backend. Kubestead's analysis templates support a generic HTTP provider that can query any JSON-returning endpoint. For teams using Jaeger or Tempo as their trace backend, a Jaeger gRPC query or a Tempo TraceQL query can return span error counts filterable by deployment.stage:

analysisTemplate:
  name: otel-span-error-rate
spec:
  metrics:
  - name: canary-span-error-rate
    interval: 60s
    successCondition: result[0] < 0.005
    failureCondition: result[0] > 0.02
    provider:
      job:
        spec:
          template:
            spec:
              containers:
              - name: query
                image: kubestead/otel-query:0.3.0
                args:
                  - --backend=tempo
                  - --query='{deployment.stage="canary"} | rate() | error()'
                  - --window=5m

This analysis template computes the error rate of spans tagged with deployment.stage=canary over the last 5 minutes. The kubestead/otel-query image is a lightweight wrapper that handles authentication, TraceQL query construction, and result normalization — returning a simple float that the analysis template threshold can evaluate.

Span Duration P99 Comparison

Span-level latency comparison is the other high-value use case. A canary where a specific database query started taking 300ms instead of 50ms will show up in the P99 of that specific operation's spans, even if the overall HTTP response time at the API layer hasn't budged yet (because only 5% of traffic is on the canary).

TraceQL supports duration predicates: {deployment.stage="canary" && span.name="db:SELECT users"} | histogram_quantile(0.99, duration). Comparing this to the same query on stable spans gives you a per-operation latency comparison that's more granular than what HTTP-level Prometheus histograms can provide.

This level of granularity isn't required for every service or every deploy. It's most valuable for services with complex call graphs where a latency regression on a specific internal operation might be masked by the aggregate HTTP latency distribution. Database-heavy services, services with multiple downstream calls per request, and services with complex caching logic are the primary candidates.

Sampling Considerations at Low Canary Percentages

At 1-5% canary traffic, sampling affects span analysis results significantly. If your services are configured with head-based sampling at 10%, a 1% canary traffic slice produces roughly 0.1% of all sampled spans being from the canary. On a service doing 500 requests per second, that's ~0.5 canary spans per second — about 30 canary spans over a 60-second analysis window. That's marginally sufficient for rate calculations but too thin for reliable P99 latency estimates.

The options are: raise the canary traffic percentage before span-based analysis becomes the primary gate (use Prometheus metrics at 1-5%, switch to span analysis at 10%+), or configure tail-based sampling that samples the canary deployment stage at a higher rate than stable. Tail-based sampling based on deployment.stage=canary is supported in OpenTelemetry Collector via a probabilistic_sampler processor with attribute-based decision logic.

We're not suggesting span-based analysis should replace Prometheus metric analysis for every team — Prometheus is simpler, lower latency, and less operationally complex. Span-based analysis is the right additional layer for teams that already run a distributed tracing backend and want the per-operation visibility that aggregate metrics can't provide.