Kubernetes October 10, 2024

Kubernetes Rolling Updates vs Blue-Green vs Canary: A Practical Comparison

Kubernetes Rollout Strategies: Rolling, Blue-Green, and Canary Compared

The Kubernetes documentation describes three deployment strategies — RollingUpdate, Recreate, and (with tooling) blue-green and canary — but stops short of telling you which one to use for which services. For a team running 50+ microservices with varying traffic profiles, SLOs, and database dependency patterns, that choice has real consequences at 2 AM.

This is an operational comparison, not a documentation summary. The goal is to help you pick a strategy that matches your actual failure risk profile, not the one that feels conceptually cleanest.

RollingUpdate: Kubernetes Native, But Not Risk-Free

RollingUpdate is the default Kubernetes strategy. When you push a new Deployment spec, the controller creates new pods using maxSurge and terminates old pods using maxUnavailable — you can control the overlap window with these two parameters.

The characteristics:

Zero infrastructure cost overhead (no duplicate environment to maintain)
No traffic manipulation required — pods are just added/removed from the Service selector
Rollback is another rolling update in reverse — which means the rollback window has the same latency as the original deploy
No metric gating — there's no native mechanism to stop mid-rollout if error rate spikes

The failure mode that bites teams most often: a rolling update proceeds while error rate climbs, but because traffic to new pods is proportional to pod count, you're at 60% new pods before anyone triggers a manual rollback. You've already exposed the majority of your traffic to the regression before the signal gets acted on.

RollingUpdate is appropriate for services where the delta between versions is low-risk — internal utilities, batch processors, stateless read-only APIs where any version mismatch is transient. For customer-facing services with tight SLOs, the lack of metric gating is a structural gap.

Blue-Green: Fast Rollback, High Cost

Blue-green keeps two complete environments running simultaneously. "Blue" is production. "Green" is the new version, receiving no traffic. When you're ready, you flip the load balancer (or Service selector, or Ingress backend) to route all traffic to green. If something goes wrong, you flip back — rollback in seconds.

The practical properties:

Rollback is instantaneous — no re-deploy, just a traffic switch
Full environment validation possible before any traffic switch
Database migrations are the hard problem — you need the schema to be compatible with both versions simultaneously during the cutover window
Resource cost is exactly 2x for the duration of the deployment window

For a team with 50 microservices, maintaining a second full environment for every concurrent deployment is operationally and financially expensive. Most teams that start with blue-green for all services end up narrowing it to 3-5 highest-risk services within six months. That's a reasonable outcome — blue-green at scale requires careful infrastructure accounting.

Blue-green also doesn't help with gradual validation. You're validating against zero traffic (pre-cutover) or 100% traffic (post-cutover). There's no exposure window that catches bugs at low blast radius before they hit every user simultaneously.

Canary: Validation Under Real Traffic, With Automation

Canary deployments route a fraction of production traffic to the new version while keeping the majority on stable. The validation happens under real load, with real users, against real data — not a pre-production environment. If the canary exhibits a regression, only a small fraction of users are impacted before automatic rollback.

What canary adds over rolling update and blue-green:

Metric-gated advancement — rollout only proceeds if error rate, latency, and custom metrics stay within thresholds
Proportional blast radius — a regression at 5% traffic affects 5% of users, not 100%
Automatic rollback on metric breach — no human in the loop at 3 AM
Configurable traffic shape — progressive steps (1% → 5% → 20% → 100%) reduce risk at each stage

The tradeoffs are real. Canary requires traffic splitting infrastructure — either a service mesh (Istio, Linkerd) or Ingress-level weighted routing (NGINX, Traefik). Without a mesh, you're splitting traffic by pod count ratio, which loses precision at low percentages (1% traffic with 10 pods means 0.1 pod — not possible). And canary is operationally more complex to configure correctly than a rolling update.

A Decision Framework for Mixed Service Types

No single strategy is right for every service in a 50-service mesh. A workable allocation:

RollingUpdate: internal services with no SLO, low traffic, internal callers only. No customer-visible impact on failure.
Blue-Green: services with large, irreversible state mutations where any user impact is unacceptable and you'd rather validate in full before switching. Use sparingly — only where the 2x cost is justified.
Canary with metric gating: all customer-facing services, especially those with SLOs, high traffic, or database writes.

The classification exercise is worth doing explicitly. Categorize each service by: does it have a user-facing SLO? Does it write to a database? Does it have downstream callers that could cascade a failure? Services that answer yes to any of these belong in the canary category.

Canary Is Not a Magic Solution for Schema Changes

One counterpoint worth stating directly: canary does not solve database schema migrations automatically. A canary rollout where stable pods run schema version N and canary pods require schema version N+1 will break if the migration runs before the canary validates — old pods can't read the new schema. If the migration runs after full promotion, you've already promoted without validating against the schema change.

The correct pattern is an expand-migrate-contract sequence executed independently of the canary traffic split. This is a solvable problem but it requires explicit sequencing, and tools like Kubestead support schema-aware rollout phases. We'll cover this in detail in a later post on zero-downtime database migrations. The point here is: canary strategy selection and schema migration strategy are two separate decisions that need to be made together.

On Rollback Semantics Across Strategies

Rollback speed across the three strategies is not symmetric. Blue-green rolls back in under 5 seconds — it's a routing flip. Canary rolls back in 15-60 seconds depending on analysis latency and pod termination speed. Rolling update rollback takes as long as another rolling update — potentially 5-15 minutes depending on maxSurge and pod startup time.

That rollback asymmetry matters for SLO exposure. A rolling update that takes 8 minutes to promote and 8 minutes to roll back will have spent 16 minutes in a degraded state if the rollback is triggered halfway through. A canary at 5% traffic that's rolled back in 30 seconds will have exposed 5% of traffic for under 2 minutes from trigger to clean state. The math on SLO burn rate looks very different.