SLO-Driven Observability & Error Budgets

Production SLOs with error budgets, advanced monitoring, distributed tracing, and proactive alerting with noise reduction.

Milestone: Acceleration

intermediate

MTTR

CFR

Job to be done: When deciding whether to release a new feature, I want a clear, measured budget for how much reliability we can spend, so I can make data-driven tradeoffs between delivery speed and production stability.

For engineers

You will define Service Level Indicators for critical user journeys, create burn-rate alerts that page only when error budgets deplete, and tie release gates to those budgets so engineering can decide risk tradeoffs explicitly.

What you’ll implement

These are the roadmap epic features, organized as a starter backlog.

Distributed Tracing

SLO Error Budget Management

ML-Based Anomaly Detection

Business KPI Monitoring

Advanced Log Analysis

Execution guide

Practical guidance aligned to the Execution Kit Definition of Done.

Outcome

Reliability targets become explicit and enforceable, enabling safe delivery speed with error budget policies.

Before to After Transformation

× BEFORENo reliability agreement, reactive incident response

Unclear reliability expectations, alert fatigue from noisy monitors, incidents without priority framework

# Before: Reactive fire-fighting

Reliability posture:
- No agreed reliability targets
- "Five nines!" demanded but not measured
- 847 alerts per week (95% noise)
- Every outage is priority 1
- No data-driven release decisions

Typical incident:
1. Alert fires: "Server CPU high" (could be anything)
2. Engineer pages through dashboards looking for cause
3. No runbook, no clear mitigation
4. 4 hours to restore service
5. No learning, same incident repeats next month

Problems:
- Can't prioritize features vs. reliability work
- No objective measure of "good enough"
- Release fear (any deploy might break things)
- Alert fatigue (engineers ignore pages)

Metrics:
- MTTR: 4+ hours
- Change failure rate: Unknown
- Repeat incidents: 40%
- Alert noise: 95%

AFTERSLO-driven reliability with error budget policy

Clear reliability targets, actionable alerts, data-driven release decisions based on error budget consumption

# After: SLO-driven engineering

Reliability posture:
- SLO: 99.9% availability (43min downtime/month)
- Error budget: Quantified and tracked
- 12 alerts per week (100% actionable)
- Incidents prioritized by SLO impact
- Release gates tied to budget

Typical SLO response:
1. Burn alert fires: "Consuming error budget at 14x rate"
2. Engineer checks SLO dashboard (clear signals)
3. Runbook linked from alert (known mitigation)
4. 12 minutes to restore (rehearsed response)
5. Postmortem tied to SLO impact, action items tracked

Benefits:
- Data-driven reliability vs. features tradeoff
- Release confidence (error budget allows risk)
- Focus on high-impact issues only
- Clear incident priorities

Metrics:
- MTTR: 12 minutes (automated runbooks)
- Change failure rate: 8% (within budget)
- Repeat incidents: 5%
- Alert noise: 0% (SLO-based alerts only)

Symptoms

No agreed definition of reliability

Alert fatigue

Frequent incidents with unclear priorities

Prerequisites

Basic observability (logs/metrics)

Agreement on critical user journeys

Team commitment to continuous improvement

Implementation steps

Week 1

Define one SLI/SLO for the top journey
Create dashboards and burn alerts
Document error budget policy draft

Week 2

Expand to 2–3 services
Add release gates aligned to budget
Train teams on the policy

Week 3

Introduce incident tie-in (postmortems map to SLO)
Review and tune thresholds
Publish org-wide SLO starter kit

Definition of Done

At least one SLO in production
Burn alerts actionable
Error budget policy used for release decisions
Practice integrated into team workflow
Practice integrated into team workflow

Metrics

Leading Indicators

% services with SLOs
Alert noise (alerts/week)
Error budget burn rate

Lagging Indicators

MTTR
Change failure rate

Failure modes

SLOs defined without ownership

Too many SLOs too early

Alerts without runbooks

Ownership

SRE/Operations

Define SLI sources
Maintain dashboards and alerting

Teams

Own reliability outcomes
Act on burn policies

Leadership

Support freeze windows and prioritization

What good looks like (by org scale)

Small Teams

One SLO for the top journey
Simple dashboards + burn alerts

Medium Orgs

Error budget policy influences releases

Enterprise

SLOs across portfolios with standardized policy

References

Google SRE Book

Resources

Templates and related materials for this kit.

Templates

Copy/paste artifacts that support this kit.

Capacity Planning Template

A template for forecasting and planning infrastructure capacity based on growth projections.

SLO / SLI Template

A practical template for defining SLOs, SLIs, and error budgets with rollout guidance.

Related capabilities

Capabilities tracked under this epic in the roadmap.

Distributed Tracing
>= 85% of services instrumented for distributed tracing (Jaeger, Tempo) with trace sampling >= 10% of requests.
SLO Error Budget Management
>= 80% of services track error budgets monthly with alerts when 50% budget consumed and deployment freezes at 90%.
ML-Based Anomaly Detection
>= 60% of critical metrics use ML anomaly detection (DeepAR, ARIMA) for dynamic thresholds instead of static alerts.
Business KPI Monitoring
>= 70% of services expose business KPIs (orders/min, revenue, conversions) in observability platform alongside technical metrics.
Advanced Log Analysis
>= 80% of log queries use structured log fields with indexed tags for <3 second query response on 30-day data.

Related kits

Other kits in the same milestone or with similar DORA impact.

Resilient Operations & Chaos Engineering

Acceleration

MTTR

CFR

Advanced Testing & Performance Validation

Acceleration

CFR

Progressive Delivery & Advanced Deployment

Acceleration

MTTR

Secure & Performant Build Pipelines

Acceleration

Before to After Transformation

× BEFORENo reliability agreement, reactive incident response

Unclear reliability expectations, alert fatigue from noisy monitors, incidents without priority framework

# Before: Reactive fire-fighting

Reliability posture:
- No agreed reliability targets
- "Five nines!" demanded but not measured
- 847 alerts per week (95% noise)
- Every outage is priority 1
- No data-driven release decisions

Typical incident:
1. Alert fires: "Server CPU high" (could be anything)
2. Engineer pages through dashboards looking for cause
3. No runbook, no clear mitigation
4. 4 hours to restore service
5. No learning, same incident repeats next month

Problems:
- Can't prioritize features vs. reliability work
- No objective measure of "good enough"
- Release fear (any deploy might break things)
- Alert fatigue (engineers ignore pages)

Metrics:
- MTTR: 4+ hours
- Change failure rate: Unknown
- Repeat incidents: 40%
- Alert noise: 95%

AFTERSLO-driven reliability with error budget policy

Clear reliability targets, actionable alerts, data-driven release decisions based on error budget consumption

# After: SLO-driven engineering

Reliability posture:
- SLO: 99.9% availability (43min downtime/month)
- Error budget: Quantified and tracked
- 12 alerts per week (100% actionable)
- Incidents prioritized by SLO impact
- Release gates tied to budget

Typical SLO response:
1. Burn alert fires: "Consuming error budget at 14x rate"
2. Engineer checks SLO dashboard (clear signals)
3. Runbook linked from alert (known mitigation)
4. 12 minutes to restore (rehearsed response)
5. Postmortem tied to SLO impact, action items tracked

Benefits:
- Data-driven reliability vs. features tradeoff
- Release confidence (error budget allows risk)
- Focus on high-impact issues only
- Clear incident priorities

Metrics:
- MTTR: 12 minutes (automated runbooks)
- Change failure rate: 8% (within budget)
- Repeat incidents: 5%
- Alert noise: 0% (SLO-based alerts only)

Implementation steps

Week 1

Define one SLI/SLO for the top journey
Create dashboards and burn alerts
Document error budget policy draft

Week 2

Expand to 2–3 services
Add release gates aligned to budget
Train teams on the policy

Week 3

Introduce incident tie-in (postmortems map to SLO)
Review and tune thresholds
Publish org-wide SLO starter kit