Observability & Monitoring Foundations
Logs, metrics, traces instrumentation. Golden signals dashboards, healthchecks, SLO drafts, and incident response runbooks.
Job to be done: When a service starts degrading, I want a live dashboard showing latency, errors, and resource use so I can diagnose the problem and find the runbook before customer impact spreads.
You will instrument golden signals (latency, traffic, errors, saturation) across services, create live dashboards linked from alerts, wire health checks into your deployment pipeline, and standardize actionable alerting backed by runbooks.
What you’ll implement
These are the roadmap epic features, organized as a starter backlog.
Execution guide
Practical guidance aligned to the Execution Kit Definition of Done.
Outcome
Teams can detect, diagnose, and recover quickly using golden signals dashboards, health checks, and operational readiness.
Before to After Transformation
No dashboards, noisy alerts, unknown service owners, incidents found via support tickets
# Before: Reactive detection via customers
Monitoring posture:
- No service dashboards
- 500+ alerts configured (95% noise)
- Alerts page engineers with no context
- "Check the logs" is the only runbook
- No health checks (deployments are YOLO)
Typical incident discovery:
1. Support ticket: "Checkout is broken"
2. Engineer checks... where? (no dashboard)
3. SSH to random server, tail logs
4. 90 minutes to find root cause
5. Service owner unknown (takes 30 min to find)
Problems:
- Customer-discovered incidents (not monitoring)
- No visibility into system health
- Alert fatigue (engineers ignore pages)
- No operational readiness checks
- Every incident is detective work
Metrics:
- Time to detect: 45+ minutes (via customers)
- MTTR: 3+ hours
- Alert noise: 95%
- % services with dashboards: 10%Real-time dashboards, actionable alerts, health checks gate deployments, sub-5-minute detection
# After: Proactive golden signals monitoring
Monitoring posture:
- Every service has golden signals dashboard
- 15 alerts configured (100% actionable)
- Alerts include runbook links
- Health checks on all services
- Pre-deploy operational readiness checks
Typical incident detection:
1. Alert fires: "Error rate 5% (threshold: 1%)"
2. Engineer clicks dashboard link (golden signals)
3. Sees spike in 500 errors from payment API
4. Clicks runbook link: "Payment API 500s"
5. Executes runbook step 1: Check circuit breaker
6. 4 minutes to restore (automated remediation)
Benefits:
- Sub-5-minute incident detection
- Clear service ownership
- Actionable alerts with runbooks
- Deployment confidence (health checks)
- No detective work (dashboards show everything)
Metrics:
- Time to detect: <5 minutes
- MTTR: 12 minutes
- Alert noise: 0%
- % services with dashboards: 100%Symptoms
Prerequisites
Implementation steps
- Define golden signals per service
- Create dashboards and baseline alerts
- Add basic health checks
- Add runbook links from alerts
- Instrument key journeys
- Establish on-call readiness
- Tune alerts to reduce noise
- Add tracing for critical paths
- Adopt SLOs for top journeys
Definition of Done
- Golden signals dashboard exists
- Alerts are actionable
- Runbooks linked
- Health checks standard implemented
- Practice integrated into team workflow
Metrics
- % services with dashboards
- Alert noise
- Time to detect
- MTTR
- Change failure rate
Failure modes
Ownership
- Provide observability tooling
- Standardize dashboards
- Instrument services
- Own runbooks
What good looks like (by org scale)
- Dashboards for top services
- Basic alerts
- Runbooks linked
- Tracing for critical paths
- Standard observability platform
- SLO-driven alerting
References
Resources
Templates and related materials for this kit.
Related capabilities
Capabilities tracked under this epic in the roadmap.
- Centralized Logging>= 90% of services send structured logs to centralized platform (ELK, Loki, CloudWatch) with retention >= 30 days.
- Application Metrics>= 80% of services expose RED metrics (Rate, Errors, Duration) in Prometheus/StatsD format.
- Health Check Endpoints100% of services expose /health and /ready endpoints for liveness and readiness probes.
- Alerting Rules>= 80% of services have alerting for high error rate (>= 5% 5xx), high latency (p95 >= 1s), and down status.
- Service Level Objectives>= 60% of user-facing services have defined SLOs with >= 99% availability target and <= 500ms latency target.
- Observability Dashboards>= 80% of services have Grafana/Datadog dashboards showing RED metrics, resource usage, and business KPIs.
Related kits
Other kits in the same milestone or with similar DORA impact.