Platform Engineering November 18, 2024

The On-Call Automation Playbook: What Can Actually Be Automated (and What Can't)

How to Automate 80% of Your On-Call Runbook

The on-call rotation exists to catch what automation can't. But most runbooks are written as though humans should handle everything — even the decisions that follow a deterministic flowchart every time. When you audit what actually wakes people up at 3 AM, the majority of pages trace back to a small number of repeatable patterns that could have been handled without a human in the loop.

We're not arguing that everything should be automated. We're arguing that the boundary between "automate this" and "wake a human" is drawn in the wrong place for most teams. Here's a framework for repositioning that line.

The Automatable Decision Test

A decision is safe to automate when it satisfies three conditions simultaneously:

The input signal is reliable and well-understood — not "alert X fired" but "alert X fired because condition Y, which has a known recovery procedure"
The recovery action is reversible with low blast radius (restart a pod vs. run a database migration)
The action has been manually executed successfully at least 10 times without edge cases

If a condition fails any of these, it should stay in the human runbook — at least until you've collected more data on the edge cases and built confidence in the signal.

Deployment-triggered rollbacks satisfy all three, which is why automated canary rollback is one of the highest-confidence automation targets. The input signal is canary error rate — reliable, well-understood, directly caused by the deploy. The action (scale canary pods to zero) is reversible. And most teams have manually executed this hundreds of times before automating it.

High-Confidence Automation: Deployment Rollback

Deployment-related pages — canary error rate spike, latency regression, CrashLoopBackOff on new pods — are almost always resolved by one action: roll back the deployment. The variation is not in the action, it's in the detection latency.

A platform engineering team at a mid-size logistics software company tracked their overnight pages over six months. Roughly 65% traced to deployment events within the prior 2 hours. Of those, over 90% were resolved by reverting the deployment — no investigation required, no judgment call. The automation was straightforward: metric-gated canary rollout with automatic rollback on threshold breach. Deployment-related pages dropped to near zero.

The automation wins here because the decision tree is: did metrics degrade after this deploy? Yes → rollback. That's not a judgment call — it's a lookup.

Medium-Confidence Automation: Pod Restart on CrashLoopBackOff

CrashLoopBackOff on a pod that was running cleanly yesterday and wasn't touched by a deploy is worth automating a first-line restart — but with a circuit breaker. If the pod restarts successfully and stays healthy for 10 minutes, resolve the alert automatically. If it continues to crash after two restart attempts, escalate to the on-call engineer.

This is not novel — Kubernetes does this natively to a point. The automation layer is in the alerting and escalation: don't page the on-call until the automated recovery has been given a chance to work. Most teams page immediately on CrashLoopBackOff without waiting to see if the pod recovers, which creates false-urgency pages at 3 AM.

The edge case that breaks this: CrashLoopBackOff caused by a bad config change pushed 10 minutes ago. A blind restart won't fix it, and after two restarts you want a human. Your automation needs to check the recent deploy history before classifying this as "restart and monitor" vs. "wake the on-call."

Low-Confidence Automation: Capacity-Related Events

Node out-of-memory, persistent volume at 90% capacity, HPA hitting max replicas — these are situations where an automated action might fix the immediate symptom but mask a trend that needs human investigation.

We're not saying capacity events should never be automated. We're saying they require a richer signal model before automation is safe. An HPA at max replicas once during a flash sale is expected. An HPA at max replicas three days in a row at baseline traffic is a signal that the service has outgrown its resource allocation. Automation that adds more nodes every time will keep the lights on but hide the growth trend until the budget is blown.

For capacity events, the better pattern is: automate the immediate relief and simultaneously create a ticket for human review, not resolve-and-close.

What Can't Be Automated: Coordination Decisions

Service degradation that requires cross-team coordination — a downstream API that's unavailable, a shared database cluster that's slow — cannot be resolved without human judgment about business priorities. Which traffic do you shed? Do you take the payment service down to protect the read path? Who decides?

These decisions aren't automatable because they involve value tradeoffs that aren't captured in metrics. The automation's job here is to get the right humans notified quickly, with context about what's degraded, the blast radius, and what options exist — not to make the call itself.

The runbook entry for these scenarios should read: "Notify team leads for services X, Y, Z. Provide current error rates and estimated user impact. Wait for human decision before any traffic manipulation." That's not a failure of automation — that's the correct boundary.

Wiring Automated Rollback into the Escalation Chain

One operational question that comes up when teams first introduce automated rollback: if the rollback happens at 3 AM and resolves cleanly, do you still page the engineer?

The answer depends on context. Kubestead's rollback events include a structured payload: which service, which deployment version, which metric triggered the rollback, and the error rate at trigger time. A clean automatic rollback — error rate spiked, rolled back within 60 seconds, traffic clean for 10 minutes post-rollback — probably doesn't need a 3 AM page. It needs a morning Slack notification and a deployment audit the next day.

A rollback where metrics are still elevated 5 minutes after the rollback completes — meaning the issue may not be the deploy — should escalate to the on-call immediately. The automation resolved the most likely cause; a human needs to investigate why the metrics haven't cleared.

Building this tiered escalation into your alert routing is the difference between automation that eliminates pages and automation that just moves them downstream in the incident timeline.