Resilient Operations & Chaos Engineering

Chaos engineering experiments, disaster recovery automation, advanced IaC with policy enforcement, and resilience testing.

Milestone: Acceleration

intermediate

MTTR

CFR

Job to be done: When an incident occurs, I want a practiced playbook with clear roles and automated recovery steps, so the team restores service quickly without blame and systematically closes gaps discovered in the postmortem.

For engineers

You will establish incident roles, automate runbook steps, create blameless postmortem templates that feed action items into your tracking system, and run drills to rehearse critical failure scenarios.

What you’ll implement

These are the roadmap epic features, organized as a starter backlog.

Chaos Engineering Practices

Automated Disaster Recovery

Circuit Breaker Patterns

Adaptive Rate Limiting

Graceful Degradation Strategies

Execution guide

Practical guidance aligned to the Execution Kit Definition of Done.

Outcome

Incidents are handled predictably with clear roles, comms, runbooks, and learning loops.

Before to After Transformation

× BEFOREChaotic incident response, no learning loops

War-room chaos, unclear roles, hours to restore, blame-focused postmortems, same incidents repeat

# Before: Ad-hoc incident chaos

Incident response:
- Alert fires, everyone joins #incident channel
- 12 people typing, no clear leader
- "Who owns this service?" (no answer for 20 minutes)
- SSH to production servers to debug
- 3 hours to identify root cause
- 5 hours to restore service

Post-incident:
- Quick meeting: "Let's not do that again"
- No written postmortem
- Finger-pointing about who broke what
- Action items lost in Slack
- Same incident happens 6 weeks later

Problems:
- No practiced runbooks (every incident novel)
- Unclear command structure
- Customer comms delayed/missing
- No systematic learning
- Repeat incidents common

Metrics:
- MTTR: 5+ hours
- % incidents with postmortems: 30%
- Repeat incident rate: 40%
- Postmortem action completion: <20%

AFTERStructured incident response with continuous learning

Clear roles, automated runbooks, 15-minute MTTR, blameless postmortems with tracked action items

# After: Structured incident management

Incident response:
- Alert fires with runbook link
- Incident Commander (IC) assigned automatically
- IC follows playbook:
  1. Assess severity (SEV1/2/3)
  2. Assemble responders (paging oncall)
  3. Execute runbook steps (mostly automated)
  4. Update status page every 15 minutes
  5. Customer comms via template
- 15 minutes to restore (rehearsed mitigation)

Post-incident:
- Automatic postmortem template created
- Blameless review within 48 hours
- Timeline auto-generated from incident log
- Action items tracked with owners and due dates
- Quarterly review of patterns

Benefits:
- Fast restoration (practiced runbooks)
- Clear accountability (no blame)
- Continuous improvement (actions tracked)
- Customer trust (transparent comms)

Metrics:
- MTTR: 15 minutes (automated runbooks)
- % incidents with postmortems: 100%
- Repeat incident rate: 8%
- Postmortem action completion: 85%

Symptoms

Ad-hoc incident response

Slow restoration

Blame-heavy postmortems

Prerequisites

Basic monitoring/alerting

On-call ownership

Team commitment to continuous improvement

Implementation steps

Week 1

Adopt incident runbook template and comms cadence
Define severity levels and IC role
Create postmortem process

Week 2

Automate common runbook steps
Run an incident simulation
Create action item tracking

Week 3

Define DR drills cadence
Integrate SLOs into incident reviews
Publish incident metrics dashboard

Definition of Done

Runbook template used
Postmortems produce action items
At least one game day executed
Practice integrated into team workflow
Practice integrated into team workflow

Metrics

Leading Indicators

% incidents with postmortems
On-call readiness checks

Lagging Indicators

MTTR
Repeat incident rate

Failure modes

Postmortems become performative

No follow-through on actions

Runbooks not kept current

Ownership

SRE/Operations

Own incident process
Run drills

Teams

Maintain runbooks
Deliver action items

What good looks like (by org scale)

Small Teams

Clear incident roles
Simple comms cadence

Medium Orgs

Regular drills
Action items with SLAs

Enterprise

Automated runbooks
DR KPIs tied to SLOs

References

PagerDuty Incident Response

Resources

Templates and related materials for this kit.

Templates

Copy/paste artifacts that support this kit.

Game Day Plan

A template for running a game day: objectives, scenarios, comms, and learning outcomes.

Incident Review (Blameless Postmortem)

A blameless incident review template that produces actionable follow-ups and learning.

Incident Runbook

A standard incident template: triage, comms, mitigation, and post-incident actions.

On-Call Rotation Setup

A template for establishing and documenting on-call rotations with escalation paths.

SLO / SLI Template

A practical template for defining SLOs, SLIs, and error budgets with rollout guidance.

Related capabilities

Capabilities tracked under this epic in the roadmap.

Chaos Engineering Practices
>= 60% of critical services undergo monthly chaos experiments (pod failures, network latency, resource exhaustion).
Automated Disaster Recovery
>= 80% of critical services have automated DR failover tested quarterly with RTO < 1hr and RPO < 15min.
Circuit Breaker Patterns
>= 75% of service-to-service calls protected by circuit breakers (Istio, Envoy, Resilience4j) preventing cascade failures.
Adaptive Rate Limiting
>= 80% of public APIs have adaptive rate limiting protecting against traffic spikes and abuse.
Graceful Degradation Strategies
>= 70% of services implement degraded mode (serve cached data, disable non-critical features) during dependency failures.

Related kits

Other kits in the same milestone or with similar DORA impact.

SLO-Driven Observability & Error Budgets

Acceleration

MTTR

CFR

Advanced Testing & Performance Validation

Acceleration

CFR

Progressive Delivery & Advanced Deployment

Acceleration

MTTR

Secure & Performant Build Pipelines

Acceleration

Before to After Transformation

× BEFOREChaotic incident response, no learning loops

War-room chaos, unclear roles, hours to restore, blame-focused postmortems, same incidents repeat

# Before: Ad-hoc incident chaos

Incident response:
- Alert fires, everyone joins #incident channel
- 12 people typing, no clear leader
- "Who owns this service?" (no answer for 20 minutes)
- SSH to production servers to debug
- 3 hours to identify root cause
- 5 hours to restore service

Post-incident:
- Quick meeting: "Let's not do that again"
- No written postmortem
- Finger-pointing about who broke what
- Action items lost in Slack
- Same incident happens 6 weeks later

Problems:
- No practiced runbooks (every incident novel)
- Unclear command structure
- Customer comms delayed/missing
- No systematic learning
- Repeat incidents common

Metrics:
- MTTR: 5+ hours
- % incidents with postmortems: 30%
- Repeat incident rate: 40%
- Postmortem action completion: <20%

AFTERStructured incident response with continuous learning

Clear roles, automated runbooks, 15-minute MTTR, blameless postmortems with tracked action items

# After: Structured incident management

Incident response:
- Alert fires with runbook link
- Incident Commander (IC) assigned automatically
- IC follows playbook:
  1. Assess severity (SEV1/2/3)
  2. Assemble responders (paging oncall)
  3. Execute runbook steps (mostly automated)
  4. Update status page every 15 minutes
  5. Customer comms via template
- 15 minutes to restore (rehearsed mitigation)

Post-incident:
- Automatic postmortem template created
- Blameless review within 48 hours
- Timeline auto-generated from incident log
- Action items tracked with owners and due dates
- Quarterly review of patterns

Benefits:
- Fast restoration (practiced runbooks)
- Clear accountability (no blame)
- Continuous improvement (actions tracked)
- Customer trust (transparent comms)

Metrics:
- MTTR: 15 minutes (automated runbooks)
- % incidents with postmortems: 100%
- Repeat incident rate: 8%
- Postmortem action completion: 85%

Implementation steps

Week 1

Adopt incident runbook template and comms cadence
Define severity levels and IC role
Create postmortem process

Week 2

Automate common runbook steps
Run an incident simulation
Create action item tracking

Week 3

Define DR drills cadence
Integrate SLOs into incident reviews
Publish incident metrics dashboard