Observability & Monitoring Foundations

Logs, metrics, traces instrumentation. Golden signals dashboards, healthchecks, SLO drafts, and incident response runbooks.

Milestone: Foundation

foundational

MTTR

CFR

Job to be done: When a service starts degrading, I want a live dashboard showing latency, errors, and resource use so I can diagnose the problem and find the runbook before customer impact spreads.

For engineers

You will instrument golden signals (latency, traffic, errors, saturation) across services, create live dashboards linked from alerts, wire health checks into your deployment pipeline, and standardize actionable alerting backed by runbooks.

What you’ll implement

These are the roadmap epic features, organized as a starter backlog.

Centralized Logging

Application Metrics

Health Check Endpoints

Alerting Rules

Service Level Objectives

Observability Dashboards

Execution guide

Practical guidance aligned to the Execution Kit Definition of Done.

Outcome

Teams can detect, diagnose, and recover quickly using golden signals dashboards, health checks, and operational readiness.

Before to After Transformation

× BEFORECustomers discover incidents, blind to system health

No dashboards, noisy alerts, unknown service owners, incidents found via support tickets

# Before: Reactive detection via customers

Monitoring posture:
- No service dashboards
- 500+ alerts configured (95% noise)
- Alerts page engineers with no context
- "Check the logs" is the only runbook
- No health checks (deployments are YOLO)

Typical incident discovery:
1. Support ticket: "Checkout is broken"
2. Engineer checks... where? (no dashboard)
3. SSH to random server, tail logs
4. 90 minutes to find root cause
5. Service owner unknown (takes 30 min to find)

Problems:
- Customer-discovered incidents (not monitoring)
- No visibility into system health
- Alert fatigue (engineers ignore pages)
- No operational readiness checks
- Every incident is detective work

Metrics:
- Time to detect: 45+ minutes (via customers)
- MTTR: 3+ hours
- Alert noise: 95%
- % services with dashboards: 10%

AFTERProactive detection with golden signals and health checks

Real-time dashboards, actionable alerts, health checks gate deployments, sub-5-minute detection

# After: Proactive golden signals monitoring

Monitoring posture:
- Every service has golden signals dashboard
- 15 alerts configured (100% actionable)
- Alerts include runbook links
- Health checks on all services
- Pre-deploy operational readiness checks

Typical incident detection:
1. Alert fires: "Error rate 5% (threshold: 1%)"
2. Engineer clicks dashboard link (golden signals)
3. Sees spike in 500 errors from payment API
4. Clicks runbook link: "Payment API 500s"
5. Executes runbook step 1: Check circuit breaker
6. 4 minutes to restore (automated remediation)

Benefits:
- Sub-5-minute incident detection
- Clear service ownership
- Actionable alerts with runbooks
- Deployment confidence (health checks)
- No detective work (dashboards show everything)

Metrics:
- Time to detect: <5 minutes
- MTTR: 12 minutes
- Alert noise: 0%
- % services with dashboards: 100%

Symptoms

Incidents discovered by customers

No dashboards or poor alerts

Unknown service owners

Prerequisites

Logging/metrics basics

Service ownership

Team commitment to continuous improvement

Implementation steps

Week 1

Define golden signals per service
Create dashboards and baseline alerts
Add basic health checks

Week 2

Add runbook links from alerts
Instrument key journeys
Establish on-call readiness

Week 3

Tune alerts to reduce noise
Add tracing for critical paths
Adopt SLOs for top journeys

Definition of Done

Golden signals dashboard exists
Alerts are actionable
Runbooks linked
Health checks standard implemented
Practice integrated into team workflow

Metrics

Leading Indicators

% services with dashboards
Alert noise
Time to detect

Lagging Indicators

MTTR
Change failure rate

Failure modes

Too many alerts

Dashboards without ownership

Missing runbooks

Ownership

SRE/Platform

Provide observability tooling
Standardize dashboards

Teams

Instrument services
Own runbooks

What good looks like (by org scale)

Small Teams

Dashboards for top services
Basic alerts

Medium Orgs

Runbooks linked
Tracing for critical paths

Enterprise

Standard observability platform
SLO-driven alerting

References

Four Golden Signals

Resources

Templates and related materials for this kit.

Templates

Copy/paste artifacts that support this kit.

Capacity Planning Template

A template for forecasting and planning infrastructure capacity based on growth projections.

Game Day Plan

A template for running a game day: objectives, scenarios, comms, and learning outcomes.

Incident Review (Blameless Postmortem)

A blameless incident review template that produces actionable follow-ups and learning.

Incident Runbook

A standard incident template: triage, comms, mitigation, and post-incident actions.

On-Call Rotation Setup

A template for establishing and documenting on-call rotations with escalation paths.

Related capabilities

Capabilities tracked under this epic in the roadmap.

Centralized Logging
>= 90% of services send structured logs to centralized platform (ELK, Loki, CloudWatch) with retention >= 30 days.
Application Metrics
>= 80% of services expose RED metrics (Rate, Errors, Duration) in Prometheus/StatsD format.
Health Check Endpoints
100% of services expose /health and /ready endpoints for liveness and readiness probes.
Alerting Rules
>= 80% of services have alerting for high error rate (>= 5% 5xx), high latency (p95 >= 1s), and down status.
Service Level Objectives
>= 60% of user-facing services have defined SLOs with >= 99% availability target and <= 500ms latency target.
Observability Dashboards
>= 80% of services have Grafana/Datadog dashboards showing RED metrics, resource usage, and business KPIs.

Related kits

Other kits in the same milestone or with similar DORA impact.

Code Quality & Review Standards

Foundation

CFR

Deployment Automation Foundations

Foundation

MTTR

Infrastructure & Operations Baseline

Foundation

MTTR

Testing Strategy & Quality Gates

Foundation

CFR

Before to After Transformation

× BEFORECustomers discover incidents, blind to system health

No dashboards, noisy alerts, unknown service owners, incidents found via support tickets

# Before: Reactive detection via customers

Monitoring posture:
- No service dashboards
- 500+ alerts configured (95% noise)
- Alerts page engineers with no context
- "Check the logs" is the only runbook
- No health checks (deployments are YOLO)

Typical incident discovery:
1. Support ticket: "Checkout is broken"
2. Engineer checks... where? (no dashboard)
3. SSH to random server, tail logs
4. 90 minutes to find root cause
5. Service owner unknown (takes 30 min to find)

Problems:
- Customer-discovered incidents (not monitoring)
- No visibility into system health
- Alert fatigue (engineers ignore pages)
- No operational readiness checks
- Every incident is detective work

Metrics:
- Time to detect: 45+ minutes (via customers)
- MTTR: 3+ hours
- Alert noise: 95%
- % services with dashboards: 10%

AFTERProactive detection with golden signals and health checks

Real-time dashboards, actionable alerts, health checks gate deployments, sub-5-minute detection

# After: Proactive golden signals monitoring

Monitoring posture:
- Every service has golden signals dashboard
- 15 alerts configured (100% actionable)
- Alerts include runbook links
- Health checks on all services
- Pre-deploy operational readiness checks

Typical incident detection:
1. Alert fires: "Error rate 5% (threshold: 1%)"
2. Engineer clicks dashboard link (golden signals)
3. Sees spike in 500 errors from payment API
4. Clicks runbook link: "Payment API 500s"
5. Executes runbook step 1: Check circuit breaker
6. 4 minutes to restore (automated remediation)

Benefits:
- Sub-5-minute incident detection
- Clear service ownership
- Actionable alerts with runbooks
- Deployment confidence (health checks)
- No detective work (dashboards show everything)

Metrics:
- Time to detect: <5 minutes
- MTTR: 12 minutes
- Alert noise: 0%
- % services with dashboards: 100%