Skip to main content
    DevOps
    Way of Working
    1. Home
    2. Kits
    3. Observability Monitoring

    Observability & Monitoring Foundations

    Logs, metrics, traces instrumentation. Golden signals dashboards, healthchecks, SLO drafts, and incident response runbooks.

    Milestone: Foundation
    foundational
    MTTR
    CFR

    Job to be done: When a service starts degrading, I want a live dashboard showing latency, errors, and resource use so I can diagnose the problem and find the runbook before customer impact spreads.

    For engineers

    You will instrument golden signals (latency, traffic, errors, saturation) across services, create live dashboards linked from alerts, wire health checks into your deployment pipeline, and standardize actionable alerting backed by runbooks.

    What you’ll implement

    These are the roadmap epic features, organized as a starter backlog.

    1
    Centralized Logging
    2
    Application Metrics
    3
    Health Check Endpoints
    4
    Alerting Rules
    5
    Service Level Objectives
    6
    Observability Dashboards

    Execution guide

    Practical guidance aligned to the Execution Kit Definition of Done.

    Outcome

    Teams can detect, diagnose, and recover quickly using golden signals dashboards, health checks, and operational readiness.

    Before to After Transformation

    × BEFORECustomers discover incidents, blind to system health

    No dashboards, noisy alerts, unknown service owners, incidents found via support tickets

    # Before: Reactive detection via customers
    
    Monitoring posture:
    - No service dashboards
    - 500+ alerts configured (95% noise)
    - Alerts page engineers with no context
    - "Check the logs" is the only runbook
    - No health checks (deployments are YOLO)
    
    Typical incident discovery:
    1. Support ticket: "Checkout is broken"
    2. Engineer checks... where? (no dashboard)
    3. SSH to random server, tail logs
    4. 90 minutes to find root cause
    5. Service owner unknown (takes 30 min to find)
    
    Problems:
    - Customer-discovered incidents (not monitoring)
    - No visibility into system health
    - Alert fatigue (engineers ignore pages)
    - No operational readiness checks
    - Every incident is detective work
    
    Metrics:
    - Time to detect: 45+ minutes (via customers)
    - MTTR: 3+ hours
    - Alert noise: 95%
    - % services with dashboards: 10%
    AFTERProactive detection with golden signals and health checks

    Real-time dashboards, actionable alerts, health checks gate deployments, sub-5-minute detection

    # After: Proactive golden signals monitoring
    
    Monitoring posture:
    - Every service has golden signals dashboard
    - 15 alerts configured (100% actionable)
    - Alerts include runbook links
    - Health checks on all services
    - Pre-deploy operational readiness checks
    
    Typical incident detection:
    1. Alert fires: "Error rate 5% (threshold: 1%)"
    2. Engineer clicks dashboard link (golden signals)
    3. Sees spike in 500 errors from payment API
    4. Clicks runbook link: "Payment API 500s"
    5. Executes runbook step 1: Check circuit breaker
    6. 4 minutes to restore (automated remediation)
    
    Benefits:
    - Sub-5-minute incident detection
    - Clear service ownership
    - Actionable alerts with runbooks
    - Deployment confidence (health checks)
    - No detective work (dashboards show everything)
    
    Metrics:
    - Time to detect: <5 minutes
    - MTTR: 12 minutes
    - Alert noise: 0%
    - % services with dashboards: 100%

    Symptoms

    Incidents discovered by customers
    No dashboards or poor alerts
    Unknown service owners

    Prerequisites

    Logging/metrics basics
    Service ownership
    Team commitment to continuous improvement

    Implementation steps

    Week 1
    • Define golden signals per service
    • Create dashboards and baseline alerts
    • Add basic health checks
    Week 2
    • Add runbook links from alerts
    • Instrument key journeys
    • Establish on-call readiness
    Week 3
    • Tune alerts to reduce noise
    • Add tracing for critical paths
    • Adopt SLOs for top journeys

    Definition of Done

    • Golden signals dashboard exists
    • Alerts are actionable
    • Runbooks linked
    • Health checks standard implemented
    • Practice integrated into team workflow

    Metrics

    Leading Indicators
    • % services with dashboards
    • Alert noise
    • Time to detect
    Lagging Indicators
    • MTTR
    • Change failure rate

    Failure modes

    Too many alerts
    Dashboards without ownership
    Missing runbooks

    Ownership

    SRE/Platform
    • Provide observability tooling
    • Standardize dashboards
    Teams
    • Instrument services
    • Own runbooks

    What good looks like (by org scale)

    Small Teams
    • Dashboards for top services
    • Basic alerts
    Medium Orgs
    • Runbooks linked
    • Tracing for critical paths
    Enterprise
    • Standard observability platform
    • SLO-driven alerting

    References

    Four Golden Signals

    Resources

    Templates and related materials for this kit.

    Templates
    Copy/paste artifacts that support this kit.
    Capacity Planning Template
    A template for forecasting and planning infrastructure capacity based on growth projections.
    Game Day Plan
    A template for running a game day: objectives, scenarios, comms, and learning outcomes.
    Incident Review (Blameless Postmortem)
    A blameless incident review template that produces actionable follow-ups and learning.
    Incident Runbook
    A standard incident template: triage, comms, mitigation, and post-incident actions.
    On-Call Rotation Setup
    A template for establishing and documenting on-call rotations with escalation paths.

    Related capabilities

    Capabilities tracked under this epic in the roadmap.

    • Centralized Logging
      >= 90% of services send structured logs to centralized platform (ELK, Loki, CloudWatch) with retention >= 30 days.
    • Application Metrics
      >= 80% of services expose RED metrics (Rate, Errors, Duration) in Prometheus/StatsD format.
    • Health Check Endpoints
      100% of services expose /health and /ready endpoints for liveness and readiness probes.
    • Alerting Rules
      >= 80% of services have alerting for high error rate (>= 5% 5xx), high latency (p95 >= 1s), and down status.
    • Service Level Objectives
      >= 60% of user-facing services have defined SLOs with >= 99% availability target and <= 500ms latency target.
    • Observability Dashboards
      >= 80% of services have Grafana/Datadog dashboards showing RED metrics, resource usage, and business KPIs.

    Related kits

    Other kits in the same milestone or with similar DORA impact.

    Code Quality & Review Standards
    Foundation
    LT
    CFR
    Deployment Automation Foundations
    Foundation
    DF
    MTTR
    Infrastructure & Operations Baseline
    Foundation
    DF
    MTTR
    Testing Strategy & Quality Gates
    Foundation
    CFR
    LT
    DevOps
    Way of Working

    DevOps practices for the entire delivery lifecycle

    © 2019-2026 devopswow.com. Created by Burhan Öcüt

    PartnersAboutPrivacyTermsCookies