Infrastructure & Operations Baseline

Infrastructure as Code for all infrastructure, runbook standards, and operational readiness practices.

Milestone: Foundation

foundational

MTTR

Job to be done: When infrastructure is created manually via portal with no version control or audit trail, I want to define everything as code in git with automated provisioning, so I can rebuild environments consistently and recover from failures predictably.

For engineers

You will document your current infrastructure in code using Terraform or CloudFormation, set up an automated IaC pipeline with drift detection, implement tested backup and restore procedures, and establish disaster recovery runbooks that can rebuild your entire environment in minutes.

What you’ll implement

These are the roadmap epic features, organized as a starter backlog.

Infrastructure as Code

Operational Runbooks

On-Call Rotation

Autoscaling Configuration

Backup and Recovery

Execution guide

Practical guidance aligned to the Execution Kit Definition of Done.

Outcome

Infrastructure is defined as code, version-controlled, and provisioned through automated pipelines with basic monitoring and backup strategies.

Before to After Transformation

× BEFOREClickOps infrastructure

Infrastructure created manually via portal, no version control, configuration drift, and disaster recovery relies on documentation

# Infrastructure management:
- Create resources via Azure/AWS portal
- Document steps in Confluence (maybe)
- Configuration drift across environments
- DR plan: "We think we know how to rebuild it"
- Backup strategy: Manual snapshots

Pain points:
- Env parity violations
- 3+ hours to provision new environment
- No audit trail
- Bus factor: 1-2 people

AFTERInfrastructure as Code

All infrastructure version-controlled in git, automated provisioning via pipelines, drift detection, and automated backups

# IaC with Terraform:
terraform apply # 5 minutes to provision
git history # full audit trail
terraform plan # preview changes
automated drift detection # daily scans

Benefits:
- Env parity: 100% identical configs
- Provisioning: 5-10 minutes (automated)
- Disaster recovery: Tested quarterly
- Compliance: Policy-as-code enforcement
- Team knowledge: Codified, shareable

Symptoms

Infrastructure provisioned manually through cloud console

Configuration changes not tracked or auditable

Inconsistent infrastructure across environments

No disaster recovery plan or tested backups

Infrastructure changes cause unexpected outages

Prerequisites

Cloud account or infrastructure platform access

Version control system (Git)

Basic understanding of infrastructure requirements

CI/CD pipeline capability

Implementation steps

Week 1

Audit current infrastructure and document as code (Terraform, CloudFormation, Pulumi)
Set up IaC repository with version control
Implement basic infrastructure modules (network, compute, storage)
Create infrastructure CI/CD pipeline for validation

Week 2

Apply IaC to non-production environment
Implement automated backup strategy for critical data
Add infrastructure monitoring (CPU, memory, disk)
Document disaster recovery procedures

Week 3

Establish infrastructure change approval process
Test disaster recovery and backup restoration
Apply IaC to production with change management
Set up cost monitoring and optimization alerts

Definition of Done

90%+ of infrastructure defined in version-controlled IaC
Infrastructure changes deployed through CI/CD pipeline
Automated backups for all critical data with tested restoration
Infrastructure monitoring in place for all resources
Disaster recovery plan documented and tested
Infrastructure provisioning is repeatable and consistent

Metrics

Leading Indicators

Infrastructure change frequency
IaC coverage (%)
Backup success rate

Lagging Indicators

Infrastructure-related incidents
Mean time to restore infrastructure
Configuration drift incidents

Failure modes

IaC without state management (lost track of infrastructure)

No testing of disaster recovery (backup fails when needed)

Credentials hardcoded in IaC (security vulnerability)

Infrastructure changes bypass IaC (manual drift)

Ownership

Platform/DevOps

Develop and maintain IaC codebase
Manage infrastructure CI/CD pipelines
Implement backup and monitoring strategies

SRE/Operations

Define infrastructure requirements and standards
Test disaster recovery procedures
Monitor infrastructure health and costs

Security

Review infrastructure security configurations
Manage secrets and credentials for IaC
Audit infrastructure changes for compliance

What good looks like (by org scale)

Small Teams

Basic Terraform/Bicep for core infrastructure
Version-controlled IaC in git
Manual terraform apply with peer review

Medium Orgs

Automated IaC pipelines with drift detection
Modular IaC with reusable components
Automated backup/restore procedures

Enterprise

Self-service infrastructure via IaC catalog
Policy-as-code enforcement (OPA/Sentinel)
Automated compliance scanning and remediation

References

Terraform Best Practices

Infrastructure as Code Principles

Resources

Templates and related materials for this kit.

Templates

Copy/paste artifacts that support this kit.

Architecture Decision Record (ADR)

A short ADR template for recording decisions and keeping architecture aligned over time.

Service Onboarding Checklist (Golden Path)

A checklist for onboarding a new service into the platform: ownership, CI/CD, observability, and security.

Related capabilities

Capabilities tracked under this epic in the roadmap.

Infrastructure as Code
>= 70% of infrastructure managed via IaC (Terraform, Pulumi, CloudFormation) in version control.
Operational Runbooks
>= 80% of critical services have runbooks for deployment, incident response, and disaster recovery.
On-Call Rotation
>= 90% of production services have defined on-call rotation with < 15min incident response SLA.
Autoscaling Configuration
>= 70% of stateless services have horizontal autoscaling based on CPU/memory or custom metrics.
Backup and Recovery
>= 90% of stateful services (databases, volumes) have automated backups with tested recovery procedures.

Related kits

Other kits in the same milestone or with similar DORA impact.

Deployment Automation Foundations

Foundation

MTTR

Backlog Quality & Planning Enablement

Foundation

CI/CD & Build Automation

Foundation

Observability & Monitoring Foundations

Foundation

MTTR

CFR

Before to After Transformation

× BEFOREClickOps infrastructure

Infrastructure created manually via portal, no version control, configuration drift, and disaster recovery relies on documentation

# Infrastructure management:
- Create resources via Azure/AWS portal
- Document steps in Confluence (maybe)
- Configuration drift across environments
- DR plan: "We think we know how to rebuild it"
- Backup strategy: Manual snapshots

Pain points:
- Env parity violations
- 3+ hours to provision new environment
- No audit trail
- Bus factor: 1-2 people

AFTERInfrastructure as Code

All infrastructure version-controlled in git, automated provisioning via pipelines, drift detection, and automated backups

# IaC with Terraform:
terraform apply # 5 minutes to provision
git history # full audit trail
terraform plan # preview changes
automated drift detection # daily scans

Benefits:
- Env parity: 100% identical configs
- Provisioning: 5-10 minutes (automated)
- Disaster recovery: Tested quarterly
- Compliance: Policy-as-code enforcement
- Team knowledge: Codified, shareable

Implementation steps

Week 1

Audit current infrastructure and document as code (Terraform, CloudFormation, Pulumi)
Set up IaC repository with version control
Implement basic infrastructure modules (network, compute, storage)
Create infrastructure CI/CD pipeline for validation

Week 2

Apply IaC to non-production environment
Implement automated backup strategy for critical data
Add infrastructure monitoring (CPU, memory, disk)
Document disaster recovery procedures

Week 3

Establish infrastructure change approval process
Test disaster recovery and backup restoration
Apply IaC to production with change management
Set up cost monitoring and optimization alerts

Definition of Done

90%+ of infrastructure defined in version-controlled IaC

Infrastructure changes deployed through CI/CD pipeline

Automated backups for all critical data with tested restoration

Infrastructure monitoring in place for all resources

Disaster recovery plan documented and tested

Infrastructure provisioning is repeatable and consistent

Ownership

Platform/DevOps

Develop and maintain IaC codebase
Manage infrastructure CI/CD pipelines
Implement backup and monitoring strategies

SRE/Operations

Define infrastructure requirements and standards
Test disaster recovery procedures
Monitor infrastructure health and costs

Security

Review infrastructure security configurations
Manage secrets and credentials for IaC
Audit infrastructure changes for compliance