Infrastructure & Operations Baseline
Infrastructure as Code for all infrastructure, runbook standards, and operational readiness practices.
Job to be done: When infrastructure is created manually via portal with no version control or audit trail, I want to define everything as code in git with automated provisioning, so I can rebuild environments consistently and recover from failures predictably.
You will document your current infrastructure in code using Terraform or CloudFormation, set up an automated IaC pipeline with drift detection, implement tested backup and restore procedures, and establish disaster recovery runbooks that can rebuild your entire environment in minutes.
What you’ll implement
These are the roadmap epic features, organized as a starter backlog.
Execution guide
Practical guidance aligned to the Execution Kit Definition of Done.
Outcome
Infrastructure is defined as code, version-controlled, and provisioned through automated pipelines with basic monitoring and backup strategies.
Before to After Transformation
Infrastructure created manually via portal, no version control, configuration drift, and disaster recovery relies on documentation
# Infrastructure management:
- Create resources via Azure/AWS portal
- Document steps in Confluence (maybe)
- Configuration drift across environments
- DR plan: "We think we know how to rebuild it"
- Backup strategy: Manual snapshots
Pain points:
- Env parity violations
- 3+ hours to provision new environment
- No audit trail
- Bus factor: 1-2 peopleAll infrastructure version-controlled in git, automated provisioning via pipelines, drift detection, and automated backups
# IaC with Terraform:
terraform apply # 5 minutes to provision
git history # full audit trail
terraform plan # preview changes
automated drift detection # daily scans
Benefits:
- Env parity: 100% identical configs
- Provisioning: 5-10 minutes (automated)
- Disaster recovery: Tested quarterly
- Compliance: Policy-as-code enforcement
- Team knowledge: Codified, shareableSymptoms
Prerequisites
Implementation steps
- Audit current infrastructure and document as code (Terraform, CloudFormation, Pulumi)
- Set up IaC repository with version control
- Implement basic infrastructure modules (network, compute, storage)
- Create infrastructure CI/CD pipeline for validation
- Apply IaC to non-production environment
- Implement automated backup strategy for critical data
- Add infrastructure monitoring (CPU, memory, disk)
- Document disaster recovery procedures
- Establish infrastructure change approval process
- Test disaster recovery and backup restoration
- Apply IaC to production with change management
- Set up cost monitoring and optimization alerts
Definition of Done
- 90%+ of infrastructure defined in version-controlled IaC
- Infrastructure changes deployed through CI/CD pipeline
- Automated backups for all critical data with tested restoration
- Infrastructure monitoring in place for all resources
- Disaster recovery plan documented and tested
- Infrastructure provisioning is repeatable and consistent
Metrics
- Infrastructure change frequency
- IaC coverage (%)
- Backup success rate
- Infrastructure-related incidents
- Mean time to restore infrastructure
- Configuration drift incidents
Failure modes
Ownership
- Develop and maintain IaC codebase
- Manage infrastructure CI/CD pipelines
- Implement backup and monitoring strategies
- Define infrastructure requirements and standards
- Test disaster recovery procedures
- Monitor infrastructure health and costs
- Review infrastructure security configurations
- Manage secrets and credentials for IaC
- Audit infrastructure changes for compliance
What good looks like (by org scale)
- Basic Terraform/Bicep for core infrastructure
- Version-controlled IaC in git
- Manual terraform apply with peer review
- Automated IaC pipelines with drift detection
- Modular IaC with reusable components
- Automated backup/restore procedures
- Self-service infrastructure via IaC catalog
- Policy-as-code enforcement (OPA/Sentinel)
- Automated compliance scanning and remediation
References
Resources
Templates and related materials for this kit.
Related capabilities
Capabilities tracked under this epic in the roadmap.
- Infrastructure as Code>= 70% of infrastructure managed via IaC (Terraform, Pulumi, CloudFormation) in version control.
- Operational Runbooks>= 80% of critical services have runbooks for deployment, incident response, and disaster recovery.
- On-Call Rotation>= 90% of production services have defined on-call rotation with < 15min incident response SLA.
- Autoscaling Configuration>= 70% of stateless services have horizontal autoscaling based on CPU/memory or custom metrics.
- Backup and Recovery>= 90% of stateful services (databases, volumes) have automated backups with tested recovery procedures.
Related kits
Other kits in the same milestone or with similar DORA impact.