Schema changes are where zero-downtime deployments fall apart. You can have perfect canary traffic splitting, airtight metric analysis, and sub-60-second rollback — and still cause downtime the moment a database migration runs that's incompatible with the stable pods still running the old code. The traffic split doesn't matter if the schema isn't compatible with both versions simultaneously.
The pattern that works is called expand-migrate-contract (sometimes expand-and-contract). It separates schema evolution into multiple deployments instead of doing everything at once. It's more work up front. It's the only approach that's genuinely zero-downtime for non-trivial schema changes.
Why Single-Step Schema Migration Fails
The common failure scenario: a team deploys a new version that requires renaming a database column. The migration runs before the canary starts receiving traffic. Now the stable pods — still running the old code that references the old column name — start throwing SQL errors. The canary analysis sees this, triggers rollback, but the schema has already changed. The stable pods are now broken by a migration that was intended for the new version.
Even if you migrate after promoting to 100%, you have a symmetrical problem in reverse: if the migration fails partway through, you now need to roll back the application code but the schema is in a partially-migrated state.
Single-step schema changes and canary deployments are fundamentally incompatible for any change that isn't purely additive.
The Expand Phase
The expand phase adds the new column, table, or constraint while keeping the old structure intact. Both old and new column exist simultaneously. The new version of the code writes to both columns; old pods read only the old column.
For a column rename (old: user_name, new: display_name):
-- Expand migration (safe to run against stable pods)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Backfill for existing rows (run as a background job, not in migration transaction)
UPDATE users SET display_name = user_name WHERE display_name IS NULL;
The canary version of the code writes to both columns. Stable pods continue reading and writing only to user_name. The schema is now compatible with both versions simultaneously. This is the deploy that goes through canary analysis — if the canary rolls back, no damage done: the old column is still present and the stable code keeps working.
The Migrate Phase
Once the canary has promoted to 100% and all pods are running the new version, you can run the migrate phase. This is the deploy that switches the new code to read from display_name instead of user_name. No schema change happens in this deploy — it's a pure application-level switch.
This step also goes through a canary analysis. The risk here is application-logic errors: the backfill from the expand phase might have missed rows created during a race condition, or the new column might have a different collation that breaks sorting. The canary will catch these before you hit 100% of users.
The Contract Phase
After the migrate phase has been promoted and stable for your confidence window (typically 1-2 deploy cycles, a few days), the contract phase removes the old column:
-- Contract migration (only safe after all pods read from display_name)
ALTER TABLE users DROP COLUMN user_name;
This runs as a separate deploy, again through canary analysis. The canary analysis here is mostly a sanity check — if any old code path still references the dropped column, you'll see SQL errors in the canary window before they hit production at scale.
Wiring This Into Kubestead
Kubestead supports schema-aware rollout sequencing via pre-deploy hooks. In the rollout spec, a preDeployHook can reference a Kubernetes Job that runs the migration before the canary starts receiving traffic:
rollout:
name: users-service
preDeployHooks:
- name: run-expand-migration
jobTemplate: db-migration-job
failureAction: abort # abort rollout if migration job fails
canarySteps:
- setWeight: 5
- pause:
duration: 5m
- analysis:
template: http-error-rate-gate
- setWeight: 100
The failureAction: abort is critical. If the expand migration fails (unexpected schema state, lock timeout, constraint violation), the rollout aborts before any canary traffic is routed to the new pods. The stable version continues running against the unmodified schema.
The Hard Part: Backfills Under Live Traffic
The expand-migrate-contract pattern assumes the backfill (populating the new column for existing rows) completes before the migrate phase deploys. For large tables — hundreds of millions of rows, terabyte-scale Postgres instances — the backfill can't run in a single transaction without locking the table.
The production-safe approach is a batched backfill job that runs asynchronously, in chunks of 1,000-10,000 rows, with a sleep between batches to avoid replication lag. This can take hours or days for large tables. The expand deployment can't complete until the backfill finishes, which creates a real tension with rapid deploy cadences.
We're not suggesting there's a magic solution to this. Backfills on large tables are operationally expensive regardless of your deployment strategy. What the expand-migrate-contract pattern gives you is safe concurrency — the backfill can run while both old and new code are deployed in production, because the expand schema is compatible with both versions. That's the key property that single-step migrations lack.
When Expand-Migrate-Contract Is Overkill
Purely additive schema changes — adding a new column with a default value, adding a new table, adding a non-unique index — don't require the full three-phase pattern. Adding a column is backward-compatible: the old code ignores the new column; the new code uses it. A standard canary rollout handles this without special schema sequencing.
The three-phase pattern is required for: column renames, column type changes, removing a column that existing code references, changing constraints that existing writes might violate, and any migration that changes the behavior of existing rows rather than just adding new schema.