Write Production Incident Runbook & SOP

Production incidents happen at the worst times. Services go down at 3am. Databases fill unexpectedly during peak traffic. Security breaches are discovered on weekends. In these high-stress moments, clear runbooks mean the difference between minutes and hours of downtime. According to Atlassian research, teams with comprehensive runbooks resolve incidents 60% faster than those relying on institutional knowledge. Writing effective runbooks protects your systems, your customers, and your on-call engineers' sanity.

What Problem Do Runbooks Solve?

Runbooks document how to respond to specific incidents or operational tasks. Without runbooks, on-call engineers waste precious time figuring out where logs live, which commands restart services, or who to escalate to. Panic compounds as downtime lengthens. Engineers make mistakes under pressure. Runbooks provide tested procedures reducing cognitive load during crisis. Following documented steps prevents common mistakes that worsen situations.

Runbooks enable anyone on-call to handle incidents, not just expert engineers familiar with every system. When only one person knows how to fix problems, that person becomes bottleneck and burnout risk. Runbooks distribute knowledge. Junior engineers can follow runbooks successfully. Senior engineers save time not explaining procedures repeatedly. Team resilience improves dramatically when knowledge lives in docs, not only heads.

Well-maintained runbooks reduce toil through automation opportunities. Documenting manual procedures reveals repetitive tasks ripe for automation. Many teams automate runbook steps over time, eventually replacing manual runbooks with self-healing systems. But automation starts with understanding processes, which requires documentation. Runbooks are stepping stones toward full automation.

What Must Every Runbook Include?

Every runbook needs six core sections: incident detection and symptoms, severity classification, immediate response steps, detailed troubleshooting procedures, escalation contacts, and post-incident tasks. Missing any section creates confusion during incidents. Incomplete runbooks are marginally better than no runbooks. Complete runbooks transform incident response from chaos into organized problem-solving.

Incident detection section describes how you know this incident is happening. List monitoring alerts, symptoms users report, or system behaviors indicating problems. Include alert names, dashboard links, and log search queries. During incidents, engineers need quick confirmation they are dealing with suspected problem. Clear detection criteria prevent misdiagnosis and wasted troubleshooting effort.

Severity classification helps prioritize response. Define severity levels with clear criteria: SEV1 for customer-facing outages requiring immediate response, SEV2 for degraded performance, SEV3 for issues that can wait for business hours. Document who must be notified at each severity. Classification prevents over-reaction to minor issues and under-reaction to critical ones. Consistent severity definitions improve communication across incidents.

Immediate response section lists first actions to take. These steps stabilize situations and reduce impact: restart unhealthy servers, scale capacity, enable feature flags, route traffic away from problem components. Immediate steps buy time for deeper investigation. Order steps by priority. Number them clearly. Stressed engineers follow checklists better than prose. Make first actions fast and safe.

How Should You Document Troubleshooting Steps?

Troubleshooting section walks through diagnostic procedures step by step. Use decision tree format: if this symptom, check this, if result is X then do Y, if result is Z then do W. Decision trees guide engineers through systematic investigation rather than random debugging. Include copy-pasteable commands for checking logs, testing connectivity, or querying databases. Commands should work exactly as written without modification.

Document common causes and their fixes. If this incident happened before, explain what caused it and how it was resolved. Include specific commands or configuration changes that solved problems. Many incidents recur. Documenting past solutions prevents rediscovering same fixes repeatedly. Historical context also helps identify patterns suggesting systemic issues requiring permanent solutions.

Include diagnostic commands with expected outputs. Show what healthy systems look like versus unhealthy. Engineers can recognize deviations when they know baselines. One effective pattern: "Run this command. Healthy output looks like this. If you see this instead, it means X is wrong." Examples eliminate ambiguity about what constitutes abnormal.

Provide rollback procedures if changes caused incidents. Document how to revert deployments, restore from backups, or undo configuration changes. Rollback should be safe, tested procedure engineers can execute confidently. Fear of making things worse prevents action. Clear rollback procedures give engineers confidence to fix problems quickly.

What Escalation Information Belongs in Runbooks?

Escalation section documents when and how to escalate. Define escalation triggers: incident unresolved after X minutes, specific errors appear, or situations beyond on-call engineer's expertise. List escalation contacts with multiple communication methods: Slack handles, phone numbers, PagerDuty schedules. Include backup contacts if primary are unavailable. Outdated contact information wastes critical minutes during incidents.

Document external escalations: when to contact vendors, cloud providers, or security teams. Include account IDs, support ticket systems, and escalation procedures for third-party services. Many incidents involve external dependencies. Clear vendor escalation paths prevent delays figuring out how to reach support.

Explain what information to include in escalation messages. Escalated engineers need context quickly: what is broken, what you tried, what results you observed. Provide escalation message template. During incidents, engineers struggle composing coherent summaries under pressure. Templates ensure complete, useful information reaches escalated engineers immediately.

How Should You Handle Post-Incident Tasks?

Post-incident section lists cleanup and follow-up tasks. Document what to do after resolving incidents: update status pages, notify stakeholders, schedule post-mortems, create bug tickets, or update monitoring. Post-incident tasks often get forgotten in relief after resolution. Documenting them ensures teams close loops properly.

Include post-mortem template or link. Post-mortems analyze what happened, why, and how to prevent recurrence. Templates ensure consistent, thorough analysis. Key post-mortem elements: timeline of events, root cause, impact, what went well, what went poorly, and action items. Action items should have owners and deadlines. Post-mortems without follow-through waste opportunity for improvement.

Document evidence preservation for serious incidents. Explain what logs, metrics, or system state to save before cleaning up. Some incidents require investigation beyond immediate resolution. Security incidents especially need thorough forensics. Clear evidence preservation procedures prevent accidentally destroying information needed later.

What Makes Runbooks Actually Useful During Incidents?

Write in imperative mood, not descriptive. Say "Restart the service" not "The service can be restarted." During incidents, engineers need instructions, not descriptions. Command-style writing reduces cognitive load. Engineers can follow instructions without translating prose into actions. Every sentence should tell engineer what to do.

Use checklists for multi-step procedures. Checkbox lists prevent skipping steps under pressure. Engineers can track progress through complex procedures. Mark dangerous steps with warnings. Highlight irreversible actions requiring extra caution. Visual formatting matters. During stressful incidents, engineers scan rather than read carefully. Make important information jump out.

Include time estimates for each troubleshooting step. Engineers need to judge if approaches are productive or wasting time. "This command takes 30 seconds" versus "this analysis takes 10 minutes" helps prioritize. Time estimates also set expectations. Steps taking longer than expected signal problems, prompting escalation.

Keep language simple and direct. Avoid jargon when clear alternatives exist. Junior engineers or engineers unfamiliar with systems should understand runbooks. Complexity confuses during high-stress situations. Simple writing ensures maximum audience can follow procedures successfully. Save cleverness for less critical documentation.

How Should You Test and Maintain Runbooks?

Test runbooks regularly through incident drills or game days. Engineers follow runbooks resolving simulated incidents. Testing reveals outdated information, missing steps, or unclear instructions. Update runbooks based on test findings. Runbooks only work if tested procedures actually resolve problems. Untested runbooks give false confidence.

Update runbooks immediately after real incidents. Fresh from resolution, engineers remember what helped and what confused them. Capture improvements while knowledge is fresh. Many teams update runbooks during post-mortem meetings. Runbooks improve through use. Each incident teaches something. Capture those lessons before they fade.

Assign runbook owners responsible for maintenance. Without ownership, runbooks decay as systems change. Owners review runbooks quarterly, testing commands still work and contacts remain current. They update runbooks when systems change. Clear ownership prevents tragedy-of-commons where everyone assumes someone else maintains docs.

Version runbooks and track changes. Runbook history helps understand how procedures evolved. It also supports rollback if updates introduce errors. Some teams use git for runbooks, applying same version control discipline as code. Others use documentation platforms with built-in versioning. Choose tools matching your team's workflow, but version control is not optional.

What Common Runbook Mistakes Should You Avoid?

Avoid writing runbooks only after disasters. Waiting until major incidents hit means documenting under pressure, likely forgetting important details. Write runbooks proactively. Document procedures as you design systems. Refine runbooks through drills before real incidents test them. Proactive documentation prevents learning through painful failure.

Never assume engineers know where things are. Runbooks should include full paths, complete URLs, exact commands. Do not write "check the logs." Write "SSH to production-server-1, run: tail -f /var/log/application/error.log." Specificity eliminates guesswork. Engineers unfamiliar with systems need complete instructions, not hints.

Do not create one giant runbook for entire system. Break runbooks into specific scenarios: database failover, deployment rollback, cache invalidation, DNS changes. Specific runbooks are faster to find and follow. Giant documents overwhelm engineers hunting for relevant procedures. Organize runbooks by service or incident type, making each document focused and actionable.

Runbooks transform chaotic incident response into systematic problem-solving. Strong runbooks enable fast resolution, distribute knowledge, and reduce stress for on-call engineers. Invest time writing comprehensive runbooks before incidents strike. Clear procedures written calmly prevent mistakes made under pressure. Use River's documentation tools to create runbooks that save your team when it matters most.

How to Write a Complete Runbook / SOP for Production Incidents

What Problem Do Runbooks Solve?

What Must Every Runbook Include?

How Should You Document Troubleshooting Steps?

What Escalation Information Belongs in Runbooks?

How Should You Handle Post-Incident Tasks?

What Makes Runbooks Actually Useful During Incidents?

How Should You Test and Maintain Runbooks?

What Common Runbook Mistakes Should You Avoid?

Chandler Supple

Related AI Writing Tools

Add 50 technical writing comments

Explain HTTP status codes

Generate a Blameless Post-Mortem Report in 5 Minutes

Generate changelog from Git commits

Generate code comments

Generate commit messages

Related Articles

How to Add 50 Targeted Technical Writing Comments Without Rewriting

How to Check Spelling & Typos Before Publishing Documentation

How to Document Environment Variables in Configuration Tables

Ready to write better, faster?