SRE / Operations

Improve reliability while enabling change: SLOs, incident maturity, and toil reduction.

For engineers

You'll define SLOs and error budgets to ship faster within reliability bounds, measure MTTR and toil percentage, and automate the manual parts: releases, handoffs, instrumentation defaults. The page covers SLO math, incident runbooks and blameless reviews, and foundations like distributed tracing and structured logging.

Reliability as a Feature

SRE balances reliability with velocity using error budgets. If you're within budget, ship faster. If you're burning budget, slow down and fix. Elite teams achieve 99.9%+ availability while deploying multiple times per day.

DORA 2025: AI Changes Incident Response

Key insight from the 2025 State of DevOps report

AI is transforming how SRE teams operate. AI-assisted incident analysis can reduce MTTR by correlating signals across logs, metrics, and traces in seconds. But AI is an amplifier; without solid observability foundations, AI tools have nothing meaningful to analyze.

faster incident resolution when AI assists triage in teams with mature observability

Thrashing

teams see little AI benefit; poor data quality in = poor insights out

Invest in SLOs, structured logging, and distributed tracing first; they're prerequisites for AI-assisted operations.

SLOs & Error Budgets

Reliability targets with tradeoffs

Define SLIs that match user experience.

Use error budgets to balance stability vs velocity.

Incident Maturity

Faster learning, less chaos

Runbooks, comms, and blameless reviews.

Measure MTTR and reduce repeat incidents.

Toil Reduction

Automate the boring parts

Instrumentation defaults and self-service workflows.

Kill manual releases and fragile handoffs.

SLO Quick Reference

Error Budget Math

SLO: 99.9% availability

Error Budget: 0.1% = 43.8 min/month

At 99.5% SLO: Error Budget 0.5% = 219 min/month

Higher SLO = less room for error = slower iteration

Common SLIs

Availabilitysuccessful requests / total

Latencyp99 < threshold

Throughputrequests/sec sustained

Correctnessvalid responses / total

Common Anti-Patterns

Avoid

SLO of 100% (impossible, blocks releases)

Alerting on every metric spike

Blame-focused incident reviews

Instead

Set SLO based on user tolerance (99.9% is often fine)

Alert on error budget burn rate

Blameless postmortems focused on systems

SRE Metrics That Matter

MTTR

Mean time to restore

Target: <1 hour

Toil %

Manual operational work

Target: <50%

SLO Attainment

% time meeting SLO

Target: >95%

Error Budget

Remaining budget %

Target: >25%

Relevant Resources

SLOs & Error Budgets

Define and track reliability targets

Incident Management

Response and postmortem patterns

Operability & Monitoring

Metrics, logs, traces setup

Ready to Improve Reliability?

Assess your current SRE maturity and identify high-impact improvements.