Glossary

209 DevOps terms and definitions to build shared vocabulary.

Jump to:

Showing 209 of 209 terms

A

Agentic DevOps

advancedfundamentals

Agent-augmented and optimized DevOps using AI agents with human-in-the-loop governance.

The evolution of DevOps where autonomous AI agents handle toil, code review, testing, and operations while humans provide guardrails, approvals, and strategic oversight. Key pillars: AI governance, agent auditability, human-in-loop metrics, and prompt governance. Enables self-healing infrastructure and AI-assisted SDLC.

When to reach for it: When AI tools feel chaotic without guardrails, understand Agentic DevOps so you can direct autonomous agents for toil while humans keep strategic control.

Agentic Workflows

advancedpractices

AI-powered automation where agents perform multi-step tasks with human oversight.

Agents can plan, execute, and iterate on complex tasks. Requires guardrails, approval gates, and auditability. Examples: automated code review, self-healing infrastructure.

When to reach for it: When tedious multi-step tasks pile up, understand agentic workflows so you can let AI handle the toil while humans focus on judgment calls.

AGENTS.md

foundationalpractices

A file providing AI coding agents with project-specific instructions and context.

Contains architecture decisions, coding conventions, and workflow guidance. Helps AI assistants understand repo structure and make appropriate changes.

When to reach for it: When AI assistants drift from your project's norms, understand AGENTS.md so you can encode architecture and conventions in a file agents actually read.

AI Amplifier Effect

intermediatefundamentals

DORA 2025 finding that AI tools magnify both existing strengths AND weaknesses.

Teams with strong foundations see 2-3x AI benefit vs. struggling teams. AI accelerates good practices but also accelerates tech debt, poor quality, and security vulnerabilities if foundations are weak.

When to reach for it: When you adopt AI tools in your DevOps practice, understand the amplifier effect so you can recognize that AI accelerates both good habits and accumulated debt.

AI Capabilities (DORA 2025)

intermediatefundamentals

Seven AI development capabilities measured by DORA: code completion, generation, explanation, test data, tests, docs, optimization.

Elite teams adopt more AI capabilities with greater effect. Key: AI capability adoption correlates with platform adoption (90% of elite teams have platforms).

When to reach for it: When you evaluate which AI capabilities to adopt in development, understand the seven measured capabilities so you can target investments toward practices that elite teams use.

AIOps

intermediatefundamentals

Using AI/ML to enhance IT operations through automated insights and actions.

Applies machine learning to analyze operational data, predict issues, automate responses, and reduce MTTR. Includes anomaly detection, intelligent alerting, and root cause analysis.

When to reach for it: When operations drowns in alerts and logs, understand AIOps so you can let machine learning detect anomalies and suggest fixes faster than humans can read dashboards.

Apache Kafka

advancedtools

A distributed event streaming platform for high-throughput data pipelines.

Publish-subscribe messaging, stream processing, durable storage. Foundation for event-driven architectures.

When to reach for it: When you build event-driven systems with high throughput, understand Apache Kafka so you can stream events durably across services and scale data pipelines.

Sources:

Apache Kafka Official Documentation

API Gateway

intermediatetools

A single entry point that routes requests to backend services.

Handles auth, rate limiting, request routing. Examples: Kong, AWS API Gateway, Envoy.

When to reach for it: When clients need to access multiple backend services, understand an API gateway so you can provide a single entry point that handles routing, authentication, rate limiting, and request transformation.

Argo Rollouts

advancedtools

A Kubernetes controller for advanced deployment strategies.

Blue-green, canary, and progressive delivery with analysis. Integrates with service meshes.

When to reach for it: When you need fine-grained control over deployment strategies in Kubernetes, understand progressive delivery patterns so you can safely deploy with automated analysis and instant rollback.

Sources:

Argo Rollouts Project

ArgoCD

advancedtools

A declarative GitOps continuous delivery tool for Kubernetes.

Syncs Kubernetes cluster state with Git repositories. Supports multi-cluster, RBAC, SSO.

When to reach for it: When you manage Kubernetes cluster state, understand declarative GitOps synchronization so you can keep deployed state aligned with Git source of truth.

Sources:

ArgoCD Official Documentation

Artifact Repository

foundationaltools

A storage system for build outputs, packages, and container images.

Examples: Artifactory, Nexus, GitHub Packages, AWS ECR. Central to reproducible builds and supply chain.

When to reach for it: When you establish build infrastructure, understand artifact repositories so you can store and distribute immutable build outputs that enable reproducible builds and supply chain integrity.

AWS CDK

intermediatetools

Cloud Development Kit for defining AWS infrastructure using familiar programming languages.

Synthesizes to CloudFormation. Supports TypeScript, Python, Java, C#, Go. High-level constructs for common patterns.

When to reach for it: When you define AWS infrastructure, understand high-level programming constructs so you can write infrastructure faster with language familiarity and type safety.

Sources:

AWS CDK Official Documentation

Azure DevOps

intermediatetools

Microsoft's integrated DevOps platform with boards, repos, pipelines, and artifacts.

YAML or classic pipelines. Integrates with Azure, GitHub, and third-party tools.

When to reach for it: When you manage projects in Microsoft's ecosystem, understand integrated DevOps so you can coordinate planning, code, pipelines, and artifacts in one platform.

Sources:

B

Backstage

advancedtools

An open-source platform for building developer portals.

Service catalog, software templates, TechDocs. Created by Spotify, now CNCF incubating.

When to reach for it: When you need a unified interface for your engineering teams, understand developer portals so you can provide a software catalog and golden paths that reduce cognitive load.

Sources:

BFF (Backend for Frontend)

intermediatepractices

A backend service tailored to the needs of a specific frontend.

Optimizes API responses for each client type (web, mobile, etc.).

When to reach for it: When different frontend clients (web, mobile, IoT) have different API requirements, understand BFF so you can build tailored backend services that optimize responses for each client type.

Bicep

intermediatetools

A domain-specific language for deploying Azure resources declaratively.

Compiles to ARM templates. Cleaner syntax than JSON. First-class Azure tooling support.

When to reach for it: When you define Azure infrastructure as code, understand that Bicep provides a cleaner language alternative to ARM templates so you can write IaC faster and collaborate more effectively.

Sources:

Bicep - Azure Resource Manager

Blameless Postmortem

foundationalculture

An incident review focused on system improvement rather than individual blame.

Assumes people made reasonable decisions with available information. Focus on "how" not "who".

When to reach for it: When you conduct incident reviews, understand blameless postmortems as a structured process that investigates system failures without attributing fault to individuals so you can learn from incidents and improve systems.

Sources:

Blast Radius

foundationalpractices

The scope of impact when something goes wrong.

Smaller blast radius = fewer users affected. Techniques: canaries, feature flags, circuit breakers.

When to reach for it: When something fails in production, understand the scope of impact so you can use canaries, feature flags, and circuit breakers to keep blast radius small and minimize user exposure.

Blue-Green Deployment

intermediatepractices

Running two identical production environments, switching traffic between them for releases.

Enables zero-downtime deployments and instant rollback by switching the router to the previous environment.

When to reach for it: When zero-downtime deployments are critical, understand how to maintain two identical environments so you can switch traffic instantly and rollback without user impact.

Branch Protection

foundationalpractices

Rules that enforce policies on specific branches.

Require PR reviews, status checks, signed commits. Prevents direct pushes to main branch.

When to reach for it: When someone pushes directly to main, understand branch protection rules so you can enforce policy and prevent accidents.

Buildpacks

intermediatetools

A tool for transforming source code into container images without Dockerfiles.

Detects language, applies best practices. CNCF project. Reproducible builds.

When to reach for it: When you need to create container images without managing Dockerfiles, understand buildpacks so you can automatically detect language and build best-practice images with a single command.

Sources:

Cloud Native Buildpacks Project

Bulkhead Pattern

intermediatepractices

Isolating system components to prevent failures from spreading.

Like ship bulkheads. Separate thread pools, connection pools, or deployments.

When to reach for it: When a failure in one component could crash your entire system, understand the bulkhead pattern so you can isolate resources and prevent failures from spreading across your architecture.

Burndown Chart

foundationalmetrics

A graph showing remaining work over time.

Common in Scrum. Helps visualize sprint progress and predict completion.

When to reach for it: When you track sprint progress, understand burndown charts so you can visualize remaining work and predict whether your team will meet sprint goals.

C

Canary Release

intermediatepractices

Gradually rolling out changes to a small subset of users before full deployment.

Reduces blast radius by detecting issues early; typically 1-5% of traffic initially.

When to reach for it: When deploying risky changes, understand how to gradually roll out to a small user subset so you can detect issues early and limit the blast radius before affecting everyone.

Cattle vs Pets

foundationalculture

A metaphor for treating servers as disposable (cattle) vs unique and irreplaceable (pets).

Cloud-native mindset: servers are numbered, replaceable. Pets have names and are hand-maintained.

When to reach for it: When you scale infrastructure, understand the cattle-vs-pets metaphor so you can shift from hand-maintaining servers to treating them as replaceable resources.

CD (Continuous Delivery)

intermediatefundamentals

Keep software always releasable with automated quality gates.

The pipeline can produce a deployable artifact at any time; releases become low-risk and repeatable.

When to reach for it: When you want to ship changes without manual ceremony, understand the difference between CD and deployment so you know when you can release on-demand with confidence.

Sources:

Continuous Delivery

CFR (Change Failure Rate)

foundationalmetrics

Percentage of deployments that cause a failure in production.

One of the four DORA metrics. Elite performers: 0-5%. Improved by testing, progressive delivery, and feature flags.

When to reach for it: When you track deployment quality, understand Change Failure Rate so you can quantify the impact of testing and quality gates on production stability.

Sources:

DORA - State of DevOps Research

Change Failure Rate

foundationalmetrics

Percentage of deployments that cause a failure in production.

One of the four DORA metrics. Elite performers: 0-5%. Improved by testing, progressive delivery, and feature flags.

When to reach for it: When you measure deployment stability, understand Change Failure Rate so you can track the percentage of releases that cause production problems and identify where quality gates need strengthening.

Sources:

DORA - State of DevOps Research

Chaos Engineering

intermediatepractices

Intentionally injecting failures to test system resilience.

Tools: Chaos Monkey, LitmusChaos, Gremlin. Goal: build confidence in system behavior under failure.

When to reach for it: When you need to know how your system behaves under failure, understand chaos engineering so you can run controlled experiments that reveal weaknesses before customers do.

Sources:

Principles of Chaos Engineering

Checkov

intermediatetools

A static analysis tool for infrastructure as code security.

Scans Terraform, CloudFormation, Kubernetes, and more for misconfigurations and compliance.

When to reach for it: When you deploy infrastructure-as-code to cloud platforms, understand IaC security scanning so you can catch misconfigurations before they reach production.

Sources:

Checkov GitHub Repository

CI (Continuous Integration)

foundationalfundamentals

Frequently merge small changes and validate them automatically.

Typical signals: fast build + unit tests on every change; trunk-based development; reproducible builds.

When to reach for it: When you need to move fast without breaking things, understand CI's merge-small-changes-often pattern so you can catch integration problems early and keep your main branch deployable.

Sources:

Cilium

advancedtools

eBPF-based networking, security, and observability for Kubernetes.

High-performance CNI. Service mesh without sidecars. Network policies and Hubble observability.

When to reach for it: When you need network security, observability, and service mesh capabilities in Kubernetes, understand eBPF-based networking so you can achieve high performance without sidecar injection.

Sources:

Cilium Project

CircleCI

intermediatetools

A cloud-native CI/CD platform with powerful caching and parallelism.

YAML config, orbs for reuse, insights dashboard, and flexible compute options.

When to reach for it: When you run CI/CD pipelines on a hosted platform, understand cloud-native CI so you can parallelize builds and cache dependencies to speed up feedback.

Sources:

Circuit Breaker

intermediatepractices

A pattern that prevents cascading failures by stopping calls to failing services.

States: closed, open, half-open. Gives failing services time to recover.

When to reach for it: When one service is failing and causing cascading failures in dependent services, understand circuit breaker so you can stop requests to the failing service and allow it time to recover.

Cloud-Native

intermediatefundamentals

Designing applications to fully exploit cloud computing advantages.

Characteristics: containerized, dynamically orchestrated, microservices-oriented. CNCF definition.

When to reach for it: When you design new applications for cloud platforms, understand cloud-native principles so you can build systems that scale automatically and recover from failures.

Sources:

CNCF Cloud Native Definition v1.1

CNAPP (Cloud-Native Application Protection Platform)

advancedsecurity

Unified security platform combining CSPM, CWPP, and application security.

Consolidates cloud security tools. Covers infrastructure, workloads, and code.

When to reach for it: When you need unified cloud security coverage across infrastructure and code, understand CNAPP as a consolidated platform that spans posture management, workload protection, and application security so you can reduce tool sprawl.

Sources:

Code Review

foundationalpractices

The practice of having peers examine code changes before merging.

Catches bugs, improves code quality, shares knowledge. Can be human, AI-assisted, or both.

When to reach for it: When bugs slip past testing, understand code review so you can catch issues early and spread knowledge across the team.

CODEOWNERS

foundationalpractices

A file defining who must review changes to specific parts of a codebase.

Automatically requests reviews from designated owners. Ensures domain experts review relevant changes.

When to reach for it: When critical code changes surprise domain experts, understand CODEOWNERS so you can automatically route reviews to people who care most.

Cognitive Load

intermediateculture

The mental effort required to operate or understand a system.

Team Topologies concept. Reducing cognitive load improves flow and reduces errors.

When to reach for it: When you design systems and teams, understand cognitive load as the mental effort required to operate or understand something so you can reduce unnecessary complexity and improve decision-making.

Sources:

Complicated Subsystem Team

intermediateculture

A team responsible for a component requiring specialist knowledge.

Examples: ML models, video codecs, cryptography. Shields complexity from other teams.

When to reach for it: When you work in an organization with specialized domains (ML models, video codecs, cryptography), understand team structure so you can allocate specialists to reduce cognitive load on stream-aligned teams.

Sources:

Team Topologies: Organizing Business and Technology Teams for Fast Flow

Container

foundationaltools

A lightweight, isolated environment for running applications.

Shares OS kernel, faster than VMs. Standard packaging format for cloud-native apps.

When to reach for it: When you need consistent runtime environments across development, testing, and production, understand containers so you can package applications with their dependencies for fast, portable deployment.

Sources:

OCI Image Specification

Container Registry

foundationaltools

A repository for storing and distributing container images.

Examples: Docker Hub, GitHub Container Registry, AWS ECR, Azure ACR. Supports tags, scanning, signing.

When to reach for it: When you need to store and distribute container images across your infrastructure, understand container registries so you can ensure reproducible deployments and enable artifact scanning and signing.

Sources:

Open Container Initiative

containerd

intermediatetools

An industry-standard container runtime.

Used by Docker and Kubernetes. Handles container lifecycle, storage, networking.

When to reach for it: When you need a container runtime for Kubernetes or standalone use, understand containerd so you can manage the full container lifecycle from image download to network and storage setup.

Sources:

containerd Project

Context Engineering

advancedpractices

The practice of optimizing what information is provided to AI models for better outcomes.

Critical for agentic workflows. Includes structuring prompts, managing context windows, and providing relevant project knowledge via AGENTS.md files or MCP servers.

When to reach for it: When AI agents hallucinate or miss context, understand context engineering so you can structure information to steer model behavior toward accuracy.

Continuous Deployment

intermediatefundamentals

Automatically deploy to production after passing quality checks.

Every change that passes automated tests is deployed to production without manual intervention.

When to reach for it: When every deployment requires courage, understand continuous deployment so you can turn deployments into a boring daily routine backed by passing tests.

Contract Testing

intermediatepractices

Testing that services honor their API contracts with consumers.

Tools: Pact, Spring Cloud Contract. Catches integration issues without full E2E tests.

When to reach for it: When services evolve independently, understand contract testing so you can catch breaking API changes early without running full E2E tests for every integration.

Sources:

Pact Documentation

Conway's Law

foundationalculture

Organizations design systems that mirror their communication structure.

"Inverse Conway Maneuver" deliberately shapes org structure to achieve desired architecture.

When to reach for it: When you design system architecture, understand how organization structure influences design so you can deliberately structure your teams to achieve desired architecture.

Sources:

Team Topologies: Organizing Business and Technology Teams for Fast Flow

Cortex

intermediatetools

An internal developer portal focused on service quality and ownership.

Service catalog, scorecards, CQL queries, and integrations with engineering tools.

When to reach for it: When you track service ownership and quality metrics, understand service quality platforms so you can measure reliability and drive accountability across engineering teams.

Sources:

Cortex - Official Site

CQRS (Command Query Responsibility Segregation)

advancedpractices

Separating read and write operations into different models.

Optimizes each side independently. Often combined with event sourcing.

When to reach for it: When you have read-heavy or write-heavy workloads with different optimization requirements, understand CQRS so you can separate read and write models and scale them independently.

CRD (Custom Resource Definition)

advancedtools

Extending Kubernetes API with custom resource types.

Foundation for operators and GitOps tools. Declarative management of anything.

When to reach for it: When you extend Kubernetes functionality, understand CRDs so you can define custom resource types that enable declarative management of anything through the Kubernetes API.

Sources:

Kubernetes Official Documentation - Custom Resources

Crossplane

advancedtools

A Kubernetes-native infrastructure control plane for managing cloud resources.

Extends Kubernetes API to provision and manage any infrastructure. Enables platform teams to offer self-service infrastructure.

When to reach for it: When you build internal developer platforms, understand Kubernetes-native infrastructure composition so you can enable self-service infrastructure provisioning.

Sources:

Crossplane Official Documentation

CSPM (Cloud Security Posture Management)

intermediatesecurity

Continuous monitoring of cloud infrastructure for misconfigurations and compliance violations.

Tools: Prisma Cloud, Wiz, AWS Security Hub. Automated remediation and drift detection.

When to reach for it: When you operate cloud infrastructure at scale, understand CSPM as continuous scanning for configuration errors so you can detect misconfigurations before they become breaches.

Sources:

Cumulative Flow Diagram (CFD)

intermediatemetrics

A chart showing work items in different states over time.

Reveals bottlenecks, WIP limits violations, and flow problems. Key Kanban metric.

When to reach for it: When you analyze workflow metrics, understand cumulative flow diagrams so you can detect bottlenecks and visualize how work moves through your system.

CWPP (Cloud Workload Protection Platform)

intermediatesecurity

Security for cloud workloads including VMs, containers, and serverless.

Runtime protection, vulnerability management, and compliance for cloud workloads.

When to reach for it: When you protect containers and virtual machines in cloud environments, understand CWPP as runtime security that monitors and blocks threats at the workload layer so you can defend against container breakouts and lateral movement.

Sources:

Gartner Cloud Workload Protection Platform

Cycle Time

intermediatemetrics

Time from starting work on a task to completing it.

Different from lead time (which includes queue time). Measures active work duration.

When to reach for it: When you measure team productivity, understand cycle time so you can distinguish active work duration from total lead time and identify where work stalls.

Sources:

DORA Metrics Guide

D

Dagger

intermediatetools

A programmable CI/CD engine that runs pipelines in containers.

Write pipelines in your language of choice. Run locally or in any CI system. Created by Docker founder.

When to reach for it: When you write CI/CD pipelines that must run identically locally and in CI systems, understand that Dagger containerizes pipeline execution so you can eliminate CI-specific quirks and test workflows on your machine.

Sources:

Dagger - A Programmable CI/CD Engine

DAST (Dynamic Application Security Testing)

intermediatesecurity

Testing a running application for security vulnerabilities.

Tools: OWASP ZAP, Burp Suite. Finds issues that only appear at runtime.

When to reach for it: When you test running applications, understand DAST so you can identify security vulnerabilities that only appear during execution against live systems.

Sources:

OWASP - Dynamic Application Security Testing

Datadog

intermediatetools

A SaaS monitoring and analytics platform for cloud-scale applications.

Unified metrics, traces, and logs. APM, infrastructure monitoring, and security. Wide integrations.

When to reach for it: When you operate large-scale cloud infrastructure, understand that Datadog provides unified metrics, traces, and logs so you can correlate signals from different layers to diagnose issues quickly.

Sources:

Datadog - Cloud Monitoring as a Service

Deployment Frequency

intermediatemetrics

How often code is deployed to production.

One of the four DORA metrics. Elite performers: multiple times per day. Higher frequency = smaller changes = lower risk.

When to reach for it: When reducing risk in production, measure how often you deploy so you correlate deployment frequency with failure rates and understand whether smaller, more frequent changes improve stability.

Sources:

DORA Metrics: Four Keys to Measuring Software Delivery Performance

Developer Experience (DevEx)

foundationalculture

The overall experience developers have while building, testing, and shipping software.

Includes tooling, documentation, feedback loops, and cognitive load. High DevEx = higher productivity.

When to reach for it: When you evaluate your engineering organization, understand Developer Experience so you can identify friction points in tooling, feedback loops, and cognitive load that impact productivity.

DevOps

foundationalfundamentals

Union of people, process, and technology to enable continuous delivery of value.

Not a tool or team name. It's a culture and set of practices that breaks down silos between development and operations.

When to reach for it: When development and operations clash, understand DevOps as a culture of breaking silos so you can align teams around shared ownership of reliability.

DevSecOps

intermediatefundamentals

DevOps with security integrated as a shared responsibility throughout the lifecycle.

Security checks are automated in pipelines; threat modeling happens during design; everyone owns security.

When to reach for it: When security feels like a checklist at the end, understand DevSecOps so you can shift security left and make it everyone's responsibility from day one.

Sources:

OWASP Top 10

DF (Deployment Frequency)

foundationalmetrics

How often code is deployed to production.

One of the four DORA metrics. Elite performers: multiple times per day. Higher frequency = smaller changes = lower risk.

When to reach for it: When you assess delivery velocity, understand Deployment Frequency so you can measure how often you deploy and correlate it with team performance and incident risk.

Sources:

DORA - State of DevOps Research

Distroless Images

intermediatesecurity

Container images containing only the application and its runtime dependencies.

No shell, package manager, or OS utilities. Smaller attack surface.

When to reach for it: When you need to reduce container image size and attack surface, understand distroless images so you can deploy applications with only their runtime dependencies and no shell or package manager.

Sources:

Distroless Container Images

Docker

foundationaltools

A platform for developing, shipping, and running applications in containers.

De facto standard for container images. Dockerfile, Docker Compose, Docker Hub.

When to reach for it: When you need to build and ship containerized applications, understand Docker so you can create reproducible container images and manage container workflows.

Sources:

Docker Official Documentation

Domain-Driven Design (DDD)

advancedpractices

A software design approach focused on modeling domains based on business reality.

Bounded contexts, ubiquitous language, aggregates. Helps define service boundaries.

When to reach for it: When designing architecture or defining service boundaries, understand Domain-Driven Design so you can align your system structure with business domains and use a shared vocabulary across teams.

Sources:

Domain-Driven Design by Eric Evans

DORA Metrics

intermediatemetrics

Four key metrics for delivery performance: lead time, deployment frequency, change fail rate, MTTR.

Research-backed metrics that correlate with organizational performance. Used to benchmark and improve.

When to reach for it: When you measure delivery performance, understand DORA metrics so you can track the four research-backed indicators that correlate with organizational success.

Sources:

DORA Metrics - Four Keys

DORA Team Profiles (2025)

advancedmetrics

Seven distinct team performance clusters identified in DORA 2025 research.

Profiles: Elite (7.5%), High Balanced (16%), Mid-level Balanced (11%), Mid-level Starting (15%), Low Throughput (17%), Low Stability (23%), Thrashing (10%). Different profiles need different improvement strategies.

When to reach for it: When you analyze your team's performance across velocity and stability, understand the seven profiles so you can identify which improvement strategies apply to your situation.

Dynatrace

advancedtools

An AI-powered software intelligence platform for observability and security.

Automatic discovery, AI root cause analysis (Davis AI), and full-stack monitoring.

When to reach for it: When you need automatic discovery and monitoring of complex applications, understand that Dynatrace uses AI for root cause analysis so you can detect anomalies without manually configuring thresholds.

Sources:

Dynatrace - Application Performance Monitoring

E

E2E Testing (End-to-End)

intermediatepractices

Testing the entire application flow from user perspective.

Tools: Playwright, Cypress, Selenium. Slower but catches real user journey issues. Top of testing pyramid.

When to reach for it: When you need to verify the entire user journey works as expected, understand E2E testing so you can catch issues that only appear when components interact in production-like conditions.

Sources:

Playwright Documentation

Enabling Team

intermediateculture

A team that helps stream-aligned teams acquire new capabilities.

Focuses on research, guidance, and enablement. Temporary engagement, not long-term dependency.

When to reach for it: When you coach other teams through capability gaps, understand enabling teams as temporary partners that transfer knowledge so you can build internal expertise and then move on.

Sources:

Ephemeral Environments

foundationalpractices

Short-lived environments created on-demand for testing or review.

Spun up per PR or feature branch, torn down after merge. Enables isolated testing.

When to reach for it: When you need to test pull requests in isolation, understand temporary environment patterns so you can enable preview testing and review without impacting shared infrastructure.

Error Budget

intermediatemetrics

The allowed unreliability before you must prioritize stability over change.

Calculated as (1 - SLO). If 99.9% SLO, you have 0.1% error budget (~43 min/month downtime allowed).

When to reach for it: When stability and velocity conflict, understand error budgets so you can quantify risk tolerance and align release decisions with reliability targets.

Sources:

Google SRE Books

Event Sourcing

advancedpractices

Storing state as a sequence of events rather than current state.

Provides audit trail, enables temporal queries, supports event replay.

When to reach for it: When you need to track state changes and provide audit trails, understand event sourcing so you can store state as immutable events and replay them to reconstruct any historical state.

Event-Driven Architecture

intermediatepractices

A pattern where services communicate through asynchronous events.

Loose coupling, scalability, eventual consistency. Tools: Kafka, EventBridge, NATS.

When to reach for it: When you need to decouple services in a system, understand event-driven architecture so you can build scalable, loosely-coupled systems where services communicate through events rather than direct calls.

External Secrets Operator

intermediatetools

Kubernetes operator that synchronizes secrets from external providers.

Pulls secrets from Vault, AWS Secrets Manager, etc. into Kubernetes secrets automatically.

When to reach for it: When you pull secrets from external systems into Kubernetes, understand secret synchronization so you can maintain a single source of truth for credentials across your cluster.

Sources:

F

FaaS (Functions as a Service)

intermediatepractices

Running individual functions without managing servers.

Event-driven execution. Short-lived, stateless. Good for glue code and webhooks.

When to reach for it: When you have short-lived, event-driven workloads, understand Functions as a Service so you can execute individual functions without managing servers.

Sources:

Cloud Native Computing Foundation - Serverless Whitepaper

Feature Flags

intermediatepractices

Runtime configuration that allows enabling or disabling features without deploying new code.

Enables progressive rollout, A/B testing, kill switches, and decoupling deployment from release.

When to reach for it: When deployment and release must be decoupled, understand feature flags so you can deploy confidently without exposing incomplete work.

FinOps

intermediatefundamentals

Financial operations practice for managing cloud costs with engineering, finance, and business collaboration.

Brings financial accountability to cloud spending. Key practices: tagging, showback/chargeback, rightsizing, reserved capacity, spot instances, and FinOps-informed architecture.

When to reach for it: When cloud bills surprise you, understand FinOps so you can bring financial accountability to engineering without killing innovation.

Flagger

advancedtools

A progressive delivery operator for Kubernetes.

Automates canary releases with Istio, Linkerd, or other service meshes. Prometheus-based analysis.

When to reach for it: When you deploy to Kubernetes and want to automate canary releases with traffic shifting, understand automated progressive delivery so you can reduce deployment risk through metric-driven rollouts.

Sources:

Flagger Project

Flow Efficiency

intermediatemetrics

Ratio of active work time to total lead time.

Most organizations have 5-15% flow efficiency. Rest is wait time. Reveals improvement opportunities.

When to reach for it: When you measure process health, understand flow efficiency as the ratio of active work to total lead time so you can quantify waste and validate improvement efforts.

Sources:

The Phoenix Project - Gene Kim, et al

Flux

advancedtools

A GitOps toolkit for Kubernetes, part of the CNCF.

Modular approach to GitOps. Integrates with Helm, Kustomize, and OCI registries.

When to reach for it: When you implement GitOps for Kubernetes, understand modular reconciliation tooling so you can manage infrastructure configuration through Git pull requests.

Sources:

Flux Official Documentation

G

Game Day

intermediateculture

A planned exercise to test incident response and system resilience.

Simulates failures in controlled conditions. Validates runbooks and team readiness.

When to reach for it: When you need to validate incident response procedures and team readiness, understand controlled failure simulation so you can identify gaps before production incidents occur.

Sources:

Site Reliability Engineering: How Google Runs Production Systems

Git Hooks

foundationalpractices

Scripts that run automatically at specific points in the Git workflow.

Pre-commit hooks for linting, commit-msg for format validation. Local enforcement of standards.

When to reach for it: When bad commits reach CI, understand git hooks so you can enforce standards locally before they leave a developer's machine.

GitHub Actions

foundationaltools

GitHub's built-in CI/CD and automation platform.

YAML-based workflows triggered by events. Large marketplace of reusable actions.

When to reach for it: When you automate your CI/CD pipeline on GitHub, understand workflow as code so you can trigger deployments and tests from version control events.

Sources:

GitHub Actions - Official Documentation

GitHub Copilot

foundationaltools

AI pair programmer that suggests code completions and entire functions.

Integrates with IDEs. Supports chat, code generation, and agent-based workflows. Powers Agentic DevOps.

When to reach for it: When you write code with AI assistance, understand pair programming with machine learning so you can generate boilerplate code and learn patterns from your codebase.

Sources:

GitLab CI

intermediatetools

Integrated CI/CD within the GitLab DevOps platform.

YAML pipelines, Auto DevOps, container registry, and security scanning built-in.

When to reach for it: When you build pipelines integrated with GitLab, understand native CI/CD so you can combine version control, code review, and automation in a single system.

Sources:

GitOps

intermediatepractices

Using Git as the single source of truth for declarative infrastructure and applications.

Changes are made via pull requests; reconciliation loops ensure actual state matches desired state in Git.

When to reach for it: When manual deployments creep in, understand GitOps so you can use Git as your single source of truth and reconciliation loops to keep production aligned with intent.

Golden Path

intermediatepractices

An opinionated, well-supported way to build and ship software within an organization.

Reduces cognitive load, ensures consistency, and accelerates onboarding. Core to platform engineering.

When to reach for it: When you build internal developer platforms, understand the golden path concept so you can provide an opinionated, well-supported way that reduces cognitive load and accelerates shipping.

Sources:

Spotify - Backstage Documentation on Golden Paths

Golden Signals

intermediatemetrics

Four key metrics for monitoring: latency, traffic, errors, and saturation.

From the Google SRE book. Provides a baseline for understanding service health.

When to reach for it: When you design monitoring for a service, understand golden signals so you can focus on the four metrics that matter most: latency, traffic, errors, and saturation.

Sources:

Google SRE Books - Monitoring and Alerting

Grafana

foundationaltools

An open-source platform for monitoring and observability visualization.

Supports multiple data sources (Prometheus, Loki, etc.). Dashboards, alerts, and annotations.

When to reach for it: When you visualize operational metrics and logs from multiple sources, understand that Grafana provides dashboards and alert management so you can build shared observability across your infrastructure.

Sources:

Grafana - The Open Observability Platform

GreenOps

intermediatefundamentals

Sustainable IT operations focused on reducing environmental impact of technology.

Optimizing for carbon footprint, energy efficiency, and sustainable architecture. Includes carbon-aware scheduling, rightsizing, and choosing green cloud regions.

When to reach for it: When carbon footprint becomes a business metric, understand GreenOps so you can optimize for environmental impact alongside cost and performance.

Gremlin

intermediatetools

An enterprise chaos engineering platform.

Controlled failure injection, game days, and reliability scoring. SaaS with agents.

When to reach for it: When you need to run reliability tests and game days across your infrastructure, understand controlled failure injection so you can build confidence in incident response and system resilience.

Sources:

Gremlin Platform

H

HashiCorp Vault

advancedtools

A tool for secrets management, encryption, and identity-based access.

Dynamic secrets, encryption as a service, PKI, and database credential rotation.

When to reach for it: When you manage sensitive credentials and encryption keys, understand centralized secret management so you can rotate credentials dynamically and enforce access policies across applications.

Sources:

Helm

intermediatetools

A package manager for Kubernetes using templated charts.

Enables reusable, versioned Kubernetes deployments. Supports values overrides and dependencies.

When to reach for it: When you manage multiple Kubernetes deployments with overlapping configuration, understand that Helm charts enable reusable, versioned packages so you can reduce duplication and simplify dependency management.

Sources:

Helm - The Package Manager for Kubernetes

Human-in-the-Loop (HITL)

intermediatepractices

Design pattern requiring human approval or oversight at critical decision points.

Essential for Agentic DevOps. Humans approve significant changes while AI handles routine tasks. Balances automation speed with risk management.

When to reach for it: When you want to trust automation without abandoning control, understand HITL so you can let agents act fast on routine tasks while humans approve risky ones.

Hybrid Cloud

advancedpractices

Combining on-premises infrastructure with public cloud services.

Common for regulated industries, data sovereignty, or gradual migration. Requires consistent tooling.

When to reach for it: When you integrate on-premises systems with cloud services, understand hybrid cloud architecture so you can meet data residency requirements while gaining cloud benefits.

I

IaC (Infrastructure as Code)

intermediatepractices

Manage infrastructure using versioned, reviewable code.

Enables reproducibility, audit trails, and treating infrastructure changes like application changes.

When to reach for it: When infrastructure changes are snowflakes, understand IaC so you can version, review, and audit infrastructure changes like you do code.

IAST (Interactive Application Security Testing)

advancedsecurity

Real-time security testing using instrumentation within the running application.

Combines SAST and DAST benefits. Lower false positives. Tools: Contrast Security.

When to reach for it: When you need lower false positives in security scanning, understand IAST so you can use real-time instrumentation to distinguish actual vulnerabilities from test artifacts.

Sources:

OWASP - Testing Guide

Immutable Artifacts

intermediatepractices

Build outputs that cannot be modified after creation.

Ensures reproducibility and auditability. Same artifact flows from dev to prod. Never overwrite tags.

When to reach for it: When you deploy code to production, understand immutable artifacts so you can guarantee the same binary flows from development through production without modification.

Immutable Infrastructure

intermediatepractices

Infrastructure that is replaced rather than modified in place.

Servers are never patched, replaced with new images. Ensures consistency and reproducibility.

When to reach for it: When you manage production servers, understand immutable infrastructure so you can eliminate configuration drift and ensure reproducible deployments.

Incident Commander

intermediateculture

The person responsible for coordinating response during an incident.

Single point of coordination. Delegates tasks, communicates status, makes decisions.

When to reach for it: When managing incident response, understand the single command authority pattern so you can prevent conflicting actions and ensure coordinated resolution.

Sources:

Site Reliability Engineering: How Google Runs Production Systems

Inner Source

intermediatepractices

Applying open source development practices within an organization.

Enables cross-team collaboration, shared ownership, and reuse of internal code and standards.

When to reach for it: When knowledge siloes form, understand inner source so you can apply open source practices internally and increase code reuse across teams.

Integration Testing

intermediatepractices

Testing how components work together.

Verifies interactions between modules, services, or external dependencies. Middle of testing pyramid.

When to reach for it: When your code depends on other modules or services, understand integration testing so you can verify those interactions work correctly before releasing to production.

Istio

advancedtools

A service mesh platform providing traffic management, security, and observability.

Sidecar-based architecture. mTLS, traffic splitting, circuit breaking. Complex but powerful.

When to reach for it: When you deploy microservices across Kubernetes and need traffic control, security policies, and tracing across services, understand service mesh architecture so you can manage cross-service communication without modifying application code.

Sources:

Istio Project

J

Jaeger

intermediatetools

An open-source distributed tracing system for monitoring microservices.

Helps with root cause analysis, service dependency analysis, and performance optimization.

When to reach for it: When you debug performance issues in microservices, understand that Jaeger traces requests across service boundaries so you can identify bottlenecks and understand latency distribution.

Sources:

Jaeger - Open Source End-to-End Distributed Tracing

Jenkins

intermediatetools

An open-source automation server for building, testing, and deploying.

Highly extensible with plugins. Jenkinsfile for pipeline-as-code. Mature and widely deployed; consider modern alternatives for greenfield projects.

When to reach for it: When you need a highly customizable CI/CD system, understand self-hosted automation so you can extend with plugins and integrate with legacy infrastructure.

Sources:

K

Knative

advancedtools

A Kubernetes-based platform for deploying serverless workloads.

Serving (request-driven scale-to-zero) and Eventing (event-driven architecture).

When to reach for it: When you deploy serverless workloads on Kubernetes, understand Knative so you can run request-driven or event-driven services that scale to zero.

Sources:

Knative Official Documentation

Kubeflow

advancedtools

A machine learning toolkit for Kubernetes.

ML pipelines, model training, serving, and experiment tracking on Kubernetes. CNCF project.

When to reach for it: When you need to operationalize machine learning pipelines on Kubernetes, understand ML infrastructure patterns so you can manage experiment tracking, training, and model serving at scale.

Sources:

Kubeflow Project

Kubernetes

advancedtools

An open-source container orchestration platform for automating deployment and scaling.

De facto standard for container orchestration. Provides declarative config, self-healing, scaling.

When to reach for it: When you need to orchestrate containerized workloads across machines, understand how Kubernetes provides declarative resource management and self-healing so you can focus on application logic instead of infrastructure operations.

Sources:

Kubernetes Official Documentation

Kueue

intermediatetools

A Kubernetes-native job queueing system for batch and AI workloads.

Manages quotas, priorities, and fair-sharing for compute-intensive jobs. Key for AI infrastructure.

When to reach for it: When you run batch or AI jobs on Kubernetes with competing resource demands, understand job queueing and scheduling so you can optimize cluster resource utilization across workloads.

Sources:

Kueue Documentation

Kustomize

intermediatetools

A Kubernetes-native configuration management tool.

Patch-based customization without templates. Built into kubectl. Good for environment-specific overlays.

When to reach for it: When you need to customize Kubernetes manifests for different environments without templating, understand that Kustomize uses overlay-based patching so you can keep base configs clean and maintain clarity.

Sources:

Kustomize - Kubernetes native configuration management

Kyverno

intermediatetools

A Kubernetes-native policy engine using YAML policies.

Validates, mutates, and generates Kubernetes resources. No new language to learn (unlike OPA/Rego).

When to reach for it: When you need to enforce policies across Kubernetes clusters, understand how policy-as-code works so you can validate and mutate resources declaratively without learning a new policy language.

Sources:

L

LaunchDarkly

intermediatetools

A feature management platform for controlling feature rollouts.

Feature flags, targeting, experimentation, and release management. Enterprise-grade.

When to reach for it: When you decouple deployments from feature releases, understand feature flag management so you can control rollouts independently and experiment safely in production.

Sources:

Lead Time for Changes

intermediatemetrics

Time from code commit to code running in production.

One of the four DORA metrics. Elite performers: under one day. Includes code review, CI, and deployment time.

When to reach for it: When optimizing deployment speed, measure lead time from code commit to production so you identify bottlenecks in code review, CI, and deployment to reduce cycle time.

Sources:

DORA Metrics: Four Keys to Measuring Software Delivery Performance

Linkerd

intermediatetools

A lightweight, security-focused service mesh for Kubernetes.

Simpler than Istio. Automatic mTLS, traffic metrics, and multi-cluster support.

When to reach for it: When you want a service mesh for Kubernetes but need something operationally simpler than Istio, understand lightweight mesh design so you can secure service-to-service traffic with minimal complexity.

Sources:

Linkerd Project

LitmusChaos

intermediatetools

A Kubernetes-native chaos engineering platform.

Chaos experiments as Kubernetes CRDs. Hub of pre-built experiments. CNCF incubating.

When to reach for it: When you need to test Kubernetes system reliability, understand chaos engineering patterns so you can inject controlled failures and validate that applications tolerate disruption.

Sources:

LitmusChaos Project

Loki

intermediatetools

A horizontally-scalable log aggregation system from Grafana Labs.

Like Prometheus, but for logs. Labels-based indexing, integrates with Grafana. Cost-effective at scale.

When to reach for it: When you aggregate logs at scale without the cost of traditional indexing, understand that Loki uses label-based indexing similar to Prometheus so you can reduce storage costs while maintaining queryability.

Sources:

Grafana Loki - Like Prometheus, but for logs

Low Stability Team

intermediateculture

A DORA 2025 team profile that ships fast but with high failure rates.

About 23% of teams. Good throughput (deployment frequency, lead time) but poor stability (change failure rate). Need to focus on testing, quality gates, and progressive delivery.

When to reach for it: When a team ships frequently but has high failure rates, recognize the low stability pattern so you focus on testing, quality gates, and progressive delivery to reduce deployment risk.

Sources:

DORA 2025 Report

Low Throughput Team

intermediateculture

A DORA 2025 team profile that is stable but slow to deliver.

About 17% of teams. Good stability (low failure rate) but slow delivery (infrequent deploys, long lead times). Need to focus on automation, CI/CD, and small batch sizes.

When to reach for it: When a team has low failure rates but slow deployment cycles, identify low throughput as the constraint so you focus on automation, small batch sizes, and CI/CD to increase delivery frequency.

Sources:

DORA 2025 Report

LT (Lead Time for Changes)

foundationalmetrics

Time from code commit to code running in production.

One of the four DORA metrics. Elite performers: under one day. Includes code review, CI, and deployment time.

When to reach for it: When you evaluate pipeline speed, understand Lead Time for Changes so you can measure from commit to production and identify bottlenecks in your delivery process.

Sources:

DORA - State of DevOps Research

MCP (Model Context Protocol)

intermediatepractices

An open protocol for connecting AI models to external tools, data sources, and services.

Created by Anthropic, MCP enables AI agents to access real-time context, execute actions, and integrate with existing systems. Becoming the standard integration protocol for agentic workflows.

When to reach for it: When AI agents need to access your tools and services, understand MCP so you can connect models to real-time data without rebuilding integrations.

Microservices

advancedpractices

An architecture style where applications are composed of small, independent services.

Each service is deployable independently, owns its data, and communicates via APIs.

When to reach for it: When your team grows and single services develop conflicting requirements, understand microservices so you can decompose the system in ways that allow teams to move independently.

Sources:

Building Microservices by Sam Newman

MLflow

intermediatetools

An open-source platform for managing the ML lifecycle.

Experiment tracking, model registry, deployment, and reproducibility. Language-agnostic.

When to reach for it: When you run multiple ML experiments and need to track results reproducibly, understand ML lifecycle management so you can version models, compare experiments, and deploy with confidence.

Sources:

MLflow Project

MLOps

advancedfundamentals

DevOps practices applied to machine learning model lifecycle management.

Includes model versioning, experiment tracking, automated training pipelines, model serving, monitoring for drift, and A/B testing. Tools: MLflow, Kubeflow, Weights & Biases.

When to reach for it: When ML models become tech debt, understand MLOps so you can treat model lifecycles like code lifecycles with versioning, testing, and deployment discipline.

Modular Monolith

intermediatepractices

A monolith with clear module boundaries that could be split into services later.

Best of both worlds: single deployment with clear separation. Good stepping stone.

When to reach for it: When you want the simplicity of a monolith but need a clear upgrade path, understand modular monolith so you can build with strong module boundaries that can become services if needed.

Sources:

Modular Monolith by Sam Newman

Monolith

foundationalpractices

An application architecture where all components are part of a single deployable unit.

Not inherently bad. Simpler to develop and deploy initially. "Monolith-first" is valid strategy.

When to reach for it: When building a new service, understand monolith so you can make an informed choice about whether a single deployable unit matches your team's constraints and growth trajectory.

MTBF (Mean Time Between Failures)

intermediatemetrics

Average time between system failures.

Higher MTBF = more reliable system. Improved by chaos engineering, testing, and resilience patterns.

When to reach for it: When you assess system reliability, understand MTBF so you can measure stability improvements from resilience patterns and testing.

mTLS (Mutual TLS)

intermediatesecurity

Both client and server authenticate each other using certificates.

Standard in service meshes. Ensures both parties are who they claim to be.

When to reach for it: When you secure communication between services in a mesh, understand mTLS as bidirectional certificate authentication so you can ensure both endpoints verify each other's identity before exchanging data.

Sources:

MTTR (Mean Time to Restore)

intermediatemetrics

Average time to restore service after an incident.

One of the four DORA metrics. Elite performers: under 1 hour. Key driver: detection + runbooks.

When to reach for it: When you measure incident response capability, understand MTTR so you can benchmark speed to recovery and track improvements in detection and runbook quality.

Sources:

DORA Metrics - MTTR

Multi-Cloud

advancedpractices

Using services from multiple cloud providers.

Avoids vendor lock-in, leverages best-of-breed services. Increases complexity and requires abstraction.

When to reach for it: When you evaluate cloud vendors, understand multi-cloud strategy so you can avoid lock-in while managing added complexity from multiple platforms.

Multi-Stage Builds

intermediatepractices

Docker builds that use multiple FROM statements to create smaller final images.

Build in one stage, copy artifacts to minimal runtime stage. Reduces image size.

When to reach for it: When you build container images, understand Docker's multi-stage pattern so you can reduce final image size by copying only necessary artifacts to a minimal runtime stage.

Sources:

Docker Documentation - Multi-stage Builds

N

Namespace (Kubernetes)

intermediatetools

A mechanism for isolating groups of resources within a Kubernetes cluster.

Enables multi-tenancy, resource quotas, and RBAC scoping. Not a security boundary.

When to reach for it: When you need to isolate groups of resources in a cluster, understand Kubernetes namespaces so you can enable multi-tenancy, apply resource quotas, and scope RBAC permissions.

Sources:

Kubernetes Official Documentation - Namespaces

NATS

intermediatetools

A lightweight, high-performance messaging system for cloud-native applications.

Simple pub/sub, request/reply, and streaming (JetStream). Low latency, easy to operate.

When to reach for it: When you need lightweight, low-latency messaging for cloud-native systems, understand NATS so you can implement pub/sub, request/reply, and streaming patterns simply.

Sources:

NATS Official Documentation

New Relic

intermediatetools

An observability platform for monitoring application and infrastructure performance.

Full-stack observability, AI-powered insights, and broad language/framework support.

When to reach for it: When you need full-stack observability across applications and infrastructure, understand that New Relic aggregates traces, metrics, and logs with AI-powered analysis so you can reduce mean-time-to-detection.

Sources:

New Relic - Observability Platform

O

Observability

intermediatepractices

The ability to understand system state from its external outputs (logs, metrics, traces).

Beyond monitoring: enables debugging unknown-unknowns. Three pillars: logs, metrics, traces.

When to reach for it: When something unexpected breaks in production, understand observability so you can ask arbitrary questions about system behavior and find root cause without relying on predefined metrics.

Sources:

Observability Engineering by Charity Majors, Liz Fong-Jones, George Miranda

OCI (Open Container Initiative)

foundationaltools

Industry standards for container image format and runtime.

Ensures container portability across different runtimes and registries.

When to reach for it: When you need container portability across different runtimes and platforms, understand OCI so you can use standardized specifications for image format and runtime behavior.

Sources:

Open Container Initiative

On-Call Rotation

foundationalculture

A schedule where team members take turns being available for urgent issues.

Typically 24/7 coverage. Key: fair distribution, good runbooks, and escalation paths.

When to reach for it: When you maintain 24/7 service coverage, understand rotation scheduling so you can distribute operational burden fairly and sustain team health.

Sources:

Site Reliability Engineering: How Google Runs Production Systems

OPA (Open Policy Agent)

advancedtools

A general-purpose policy engine for unified policy enforcement.

Uses Rego policy language. Can enforce policies on Kubernetes, APIs, Terraform, and more.

When to reach for it: When you need to enforce organizational policies across multiple systems, understand that OPA uses a general-purpose policy language so you can define rules once and apply them to Kubernetes, APIs, and infrastructure.

Sources:

Open Policy Agent - Policy-as-Code Framework

OpenTelemetry

intermediatetools

A vendor-neutral standard for collecting telemetry data (traces, metrics, logs).

CNCF project. Provides SDKs, collectors, and exporters. Becoming the industry standard.

When to reach for it: When you instrument applications for observability, understand that OpenTelemetry provides vendor-neutral SDKs and protocols so you can avoid lock-in and switch backends without code changes.

Sources:

OpenTelemetry - Observability for software

OpenTofu

intermediatetools

An open-source fork of Terraform maintained by the Linux Foundation.

Drop-in replacement for Terraform with community governance. Emerged after HashiCorp license change.

When to reach for it: When you need open-source infrastructure as code with community governance, understand the Terraform-compatible alternative so you can avoid proprietary licensing concerns.

Sources:

OpenTofu Official Documentation

Operator Pattern

advancedpractices

A Kubernetes pattern for automating the management of complex applications.

Custom controllers that encode operational knowledge. Examples: database operators.

When to reach for it: When you manage complex applications on Kubernetes, understand the operator pattern so you can encode operational knowledge into custom controllers that automate management and scale.

Sources:

Kubernetes Official Documentation - Operator Pattern

P

Paved Road

intermediatepractices

Another term for golden path, a well-maintained default way to accomplish common tasks.

Teams can go off-road but must accept additional maintenance burden and risk.

When to reach for it: When you standardize development practices, understand the paved road concept so you can define well-maintained defaults that most teams follow while allowing off-road decisions.

Platform as a Product

intermediateculture

Treating the internal developer platform as a product with users, roadmap, and feedback loops.

Platform team acts as product team; engineers are customers. Drives adoption and satisfaction.

When to reach for it: When you organize platform engineering teams, understand platform-as-product so you can treat infrastructure tooling as a user-focused product with roadmaps and feedback loops.

Sources:

Team Topologies - Platform as a Product

Platform Engineering

intermediatefundamentals

Building and maintaining internal developer platforms to improve developer experience and productivity.

Provides golden paths, self-service tooling, and abstractions so teams can ship faster without reinventing infrastructure.

When to reach for it: When teams reinvent infrastructure for every project, understand platform engineering so you can build golden paths that let teams move faster without becoming platform experts.

Platform Team

intermediateculture

A team that provides internal platforms to reduce cognitive load for stream-aligned teams.

Treats platform as a product. Self-service, well-documented, with clear SLAs.

When to reach for it: When you invest in developer experience, understand platform teams as product teams that serve internal customers so you can provide self-service capabilities and reduce cognitive load across the organization.

Sources:

Pod

foundationaltools

The smallest deployable unit in Kubernetes, containing one or more containers.

Containers in a pod share network and storage. Typically one app container per pod.

When to reach for it: When you deploy applications to Kubernetes, understand the pod abstraction so you can manage the smallest deployable unit that contains one or more containers sharing network and storage.

Sources:

Kubernetes Official Documentation - Pods

Podman

intermediatetools

A daemonless container engine compatible with Docker.

Rootless containers by default. Drop-in Docker replacement. Red Hat project.

When to reach for it: When you want rootless containers without a daemon dependency, understand Podman so you can use a drop-in Docker replacement with improved security and compatibility.

Sources:

Podman Official Documentation

Policy-as-Code

intermediatesecurity

Defining and enforcing organizational policies using code.

Tools: OPA/Rego, Kyverno, Sentinel. Enables automated compliance checking in pipelines.

When to reach for it: When you enforce organizational standards, understand Policy-as-Code so you can automate compliance checks in your pipelines and infrastructure.

Sources:

Open Policy Agent

Port

advancedtools

An internal developer portal platform for building self-service experiences.

Software catalog, scorecards, self-service actions, and workflow automation. SaaS-based.

When to reach for it: When you build self-service capabilities for your platform, understand developer portal platforms so you can automate workflows and provide a catalog of reusable services.

Sources:

Port - Official Site

Preview Environments

foundationalpractices

Temporary environments that allow stakeholders to review changes before merge.

Often tied to pull requests. Enables early feedback from product, design, and QA.

When to reach for it: When you want feedback from stakeholders on code changes, understand preview environments so you can ship with confidence knowing product, design, and QA have already reviewed the feature.

Progressive Delivery

intermediatepractices

Gradually releasing features using techniques like canaries, feature flags, and traffic shifting.

Combines deployment strategies with observability to minimize risk and maximize feedback.

When to reach for it: When minimizing release risk matters, understand how to combine deployment strategies with observability so you can gradually increase exposure while catching issues early.

Prometheus

intermediatetools

An open-source monitoring and alerting toolkit optimized for reliability.

Pull-based metrics collection, PromQL query language, integrates with Grafana for visualization.

When to reach for it: When you need metrics from applications and infrastructure, understand that Prometheus scrapes time-series metrics using a pull model so you can perform real-time alerting and query-driven troubleshooting.

Sources:

Prometheus - Monitoring system and time series database

Psychological Safety

foundationalculture

Team environment where members feel safe to take risks and be vulnerable.

Google's Project Aristotle found it to be the #1 predictor of high-performing teams. Enables learning from failures.

When to reach for it: When you build high-performing teams, understand psychological safety as an environment where members can speak up without fear so you can accelerate learning and catch more problems early.

Sources:

Pull Request (PR)

foundationalpractices

A request to merge code changes from one branch into another.

Enables code review, discussion, and automated checks before merging. Also called Merge Request (MR).

When to reach for it: When code review happens ad-hoc, understand PRs so you can formalize change discussion and automated checks before anything merges to main.

Pulumi

advancedtools

Infrastructure as code using general-purpose programming languages.

Write IaC in TypeScript, Python, Go, C#, etc. Full IDE support, testing frameworks, and type safety.

When to reach for it: When you write infrastructure as code, understand programming-language-native approaches so you can reuse development patterns and tooling.

Sources:

Pulumi Official Documentation

R

RabbitMQ

intermediatetools

An open-source message broker supporting multiple messaging protocols.

AMQP, MQTT, STOMP. Good for task queues and traditional messaging patterns.

When to reach for it: When you need reliable asynchronous communication between services, understand RabbitMQ so you can implement task queues and message routing with multiple protocols.

Sources:

RabbitMQ Official Documentation

RASP (Runtime Application Self-Protection)

advancedsecurity

Security technology that runs inside an application to detect and block attacks.

Protects in real-time without code changes. Can block SQL injection, XSS at runtime.

When to reach for it: When you defend applications in production, understand RASP so you can detect and block attacks from inside the running process without code changes.

Sources:

OWASP - Runtime Application Self-Protection

Rolling Update

intermediatepractices

Incrementally updating instances of an application without downtime.

Old instances are replaced one-by-one; traffic shifts gradually to new version.

When to reach for it: When updating live services, understand how to replace instances incrementally so you can shift traffic gradually to new versions without downtime.

Runbook

foundationalpractices

Documentation with step-by-step instructions for handling common scenarios.

Reduces MTTR by codifying knowledge. Should be tested and kept up-to-date.

When to reach for it: When responding to operational incidents, understand documented procedures so you can reduce mean time to resolution and prevent knowledge silos.

Sources:

Site Reliability Engineering: How Google Runs Production Systems

S

Saga Pattern

advancedpractices

Managing distributed transactions across multiple services.

Choreography (events) or orchestration (coordinator). Each step has compensating action.

When to reach for it: When you need to coordinate transactions across multiple independent services, understand the saga pattern so you can manage distributed transactions with compensating actions instead of traditional ACID.

SAST (Static Application Security Testing)

intermediatesecurity

Analyzing source code for security vulnerabilities without executing it.

Tools: SonarQube, CodeQL, Semgrep. Runs in IDE or CI pipeline.

When to reach for it: When you integrate security into development, understand SAST so you can scan source code for vulnerabilities before compilation or runtime execution.

Sources:

OWASP - Code Analysis

SBOM (Software Bill of Materials)

foundationalsecurity

A complete inventory of components in a software artifact.

Required for supply chain transparency. Formats: SPDX, CycloneDX. Enables vulnerability tracking.

When to reach for it: When you track software artifacts, understand SBOM so you can maintain an inventory of components and quickly identify which products are affected by vulnerabilities.

Sources:

OWASP - Software Bill of Materials

SCA (Software Composition Analysis)

intermediatesecurity

Identifying vulnerabilities and license issues in third-party dependencies.

Tools: Snyk, Dependabot, Trivy. Critical for supply chain security.

When to reach for it: When you manage third-party dependencies, understand SCA so you can identify known vulnerabilities and license compliance issues in your supply chain.

Sources:

OWASP - Component Analysis

Secret Scanning

foundationalsecurity

Detecting credentials, API keys, and other secrets in code or configuration.

Tools: GitLeaks, TruffleHog, GitHub Secret Scanning. Critical to prevent credential leaks.

When to reach for it: When you protect credentials, understand Secret Scanning so you can detect and prevent API keys and secrets from being committed to version control.

Sources:

OWASP - Secrets Management

Semantic Versioning (SemVer)

foundationalpractices

A versioning scheme using MAJOR.MINOR.PATCH format.

MAJOR = breaking changes, MINOR = new features, PATCH = bug fixes. Communicates compatibility.

When to reach for it: When you publish libraries or manage dependencies, understand semantic versioning so you can communicate compatibility guarantees and automate safe upgrades.

Sources:

Semantic Versioning 2.0.0

Semgrep

intermediatetools

A fast, open-source static analysis tool for finding bugs and vulnerabilities.

Pattern-based rules in YAML. Supports many languages. Good for custom security rules.

When to reach for it: When you need custom static analysis rules for your codebase, understand pattern-based code analysis so you can detect application-specific bugs and security patterns efficiently.

Sources:

Semgrep Project

Serverless

intermediatepractices

A cloud execution model where the provider manages server infrastructure.

Pay per use, auto-scaling to zero. Examples: AWS Lambda, Azure Functions, Cloud Run.

When to reach for it: When you want to reduce infrastructure management overhead, understand serverless architecture so you can deploy code that auto-scales to zero, runs on-demand, and charges per use.

Sources:

Cloud Native Computing Foundation - Serverless Whitepaper

Service Mesh

advancedtools

Infrastructure layer for handling service-to-service communication.

Provides observability, traffic control, security (mTLS). Examples: Istio, Linkerd, Cilium.

When to reach for it: When your microservices grow beyond a handful and you need consistent traffic control, security, and observability across service calls, understand service mesh so you can enforce policies without modifying application code.

Sources:

Istio Documentation

Shift Left

foundationalsecurity

Move feedback and controls earlier in the lifecycle (testing, security, validation).

Finding issues earlier is cheaper. Examples: SAST in IDE, security review at design.

When to reach for it: When you design quality processes, understand Shift Left so you can catch defects and security issues earlier in development where they cost less to fix.

Sources:

OWASP - Shift Left Testing

Shift Right

intermediatesecurity

Extending testing and monitoring into production environments.

Chaos engineering, canary analysis, production observability. Complements shift-left.

When to reach for it: When you build resilient systems, understand Shift Right so you can extend testing and monitoring into production to catch issues that only appear at scale.

Sources:

Google SRE - Release Engineering

Sidecar Pattern

intermediatepractices

Deploying a helper container alongside the main application container.

Common in Kubernetes. Used for logging, proxying, secrets injection, and service mesh.

When to reach for it: When you need to add cross-cutting concerns like logging, proxying, or security to containers without modifying the application, understand the sidecar pattern so you can keep your main application focused and simple.

Sigstore

intermediatesecurity

A project for signing, verifying, and protecting software supply chains.

Keyless signing with Cosign, transparency logs with Rekor. Makes signing easy and auditable.

When to reach for it: When you sign and verify software artifacts, understand Sigstore so you can add cryptographic signatures without managing private keys manually.

Sources:

Sigstore Project

SLA (Service Level Agreement)

foundationalmetrics

A contractual commitment about service reliability, often with financial penalties.

SLAs are typically less stringent than internal SLOs to provide a buffer.

When to reach for it: When customer expectations about uptime are unclear, understand SLAs so you can set contractual terms backed by penalties and internal buffers.

SLI (Service Level Indicator)

intermediatemetrics

A quantitative measure of service behavior (e.g., latency, error rate, throughput).

The raw metric used to calculate whether an SLO is being met.

When to reach for it: When you're unsure if you're meeting your SLO, understand SLIs so you can measure the actual behavior driving reliability decisions.

Sources:

Google SRE Books

SLO (Service Level Objective)

intermediatemetrics

A measurable reliability target for a service (e.g., 99.9% success rate).

Defines "good enough" reliability; enables data-driven prioritization between features and stability.

When to reach for it: When reliability feels like an infinite goal with no trade-offs, understand SLOs so you can define acceptable reliability and allocate your error budget.

Sources:

Google SRE Books

SLSA (Supply-chain Levels for Software Artifacts)

advancedsecurity

A framework for ensuring the integrity of software artifacts throughout the supply chain.

Levels 1-4 define increasing assurance. Includes provenance, build integrity, and source requirements.

When to reach for it: When you strengthen supply chain integrity, understand SLSA so you can implement a maturity framework that prevents artifact tampering and ensures build provenance.

Sources:

SLSA Framework

Snyk

intermediatetools

A developer-first security platform for finding and fixing vulnerabilities.

SCA, container scanning, IaC security, and code analysis. Integrates with developer workflows.

When to reach for it: When you need to integrate security scanning into your development workflow, understand developer-first vulnerability management so you can fix supply chain and code vulnerabilities before release.

Sources:

Snyk Platform

SonarQube

intermediatetools

A platform for continuous inspection of code quality and security.

Static analysis, code coverage, security hotspots, and quality gates. Self-hosted or cloud.

When to reach for it: When you need to enforce code quality standards and security practices across your team, understand continuous code inspection so you can maintain consistent quality gates and reduce technical debt.

Sources:

SonarQube Product

SOPS

intermediatetools

Secrets OPerationS, a tool for encrypting secrets in files.

Supports AWS KMS, GCP KMS, Azure Key Vault, and age. Enables GitOps with encrypted secrets.

When to reach for it: When you version encrypted secrets in Git, understand file-level encryption so you can integrate secret management into GitOps workflows while keeping secrets encrypted at rest.

Sources:

Spec-Driven Development

intermediatepractices

Writing detailed specifications before implementation, often to guide AI code generation.

AI tools can generate code from specs. Risk: reverting to waterfall anti-patterns with big-bang releases. Best used with incremental delivery.

When to reach for it: When using AI to generate code, understand how detailed specifications guide generation so you can get better results while avoiding waterfall anti-patterns and maintaining incremental delivery.

SPIFFE/SPIRE

advancedsecurity

Standards and tools for workload identity in distributed systems.

SPIFFE defines the identity format; SPIRE is the reference implementation. Zero-trust foundations.

When to reach for it: When you assign cryptographic identity to workloads across distributed systems, understand SPIFFE and SPIRE as standards and tooling for keyless workload authentication so you can automate certificate lifecycle management and enable zero-trust networks.

Sources:

Split

intermediatetools

A feature delivery platform combining feature flags with observability.

Feature flags, A/B testing, and feature impact metrics in one platform.

When to reach for it: When you want to measure feature impact alongside experimentation, understand integrated feature delivery so you can track how releases affect user behavior and system performance.

Sources:

SRE (Site Reliability Engineering)

intermediatefundamentals

A discipline that applies software engineering to operations problems.

Pioneered by Google. Focuses on reliability through SLOs, error budgets, toil reduction, and treating operations as a software problem. SREs are engineers, not ops.

When to reach for it: When ops work feels endless, understand SRE so you can use error budgets and toil reduction to treat reliability as an engineering problem with engineering solutions.

Sources:

Google SRE Books

Strangler Fig Pattern

intermediatepractices

Gradually replacing a legacy system by routing traffic to new components.

Named after strangler fig trees. Low-risk migration strategy.

When to reach for it: When you need to migrate from a legacy system without risky big-bang rewrites, understand the strangler fig pattern so you can gradually replace the old system by routing traffic to new components.

Stream-Aligned Team

intermediateculture

A team aligned to a flow of work from a segment of the business domain.

Primary team type in Team Topologies. Delivers value end-to-end with minimal dependencies.

When to reach for it: When you organize teams around delivery, understand stream-aligned teams as squads responsible for a complete value stream so you can reduce handoffs and accelerate feedback.

Sources:

Synthetic Monitoring

intermediatepractices

Simulating user interactions to proactively detect issues.

Runs scripted transactions against production. Detects issues before real users.

When to reach for it: When you need to detect user-facing outages before your customers notice, understand synthetic monitoring so you can continuously run scripted transactions and catch degradation early.

T

Team Topologies

intermediateculture

A model for organizing teams based on flow and cognitive load.

Four team types: stream-aligned, enabling, complicated-subsystem, platform. Three interaction modes.

When to reach for it: When you structure teams for high velocity, understand Team Topologies as a framework that identifies team types and interaction patterns so you can align teams to value streams and minimize dependencies.

Sources:

Tekton

advancedtools

A Kubernetes-native CI/CD framework for building pipelines.

Cloud-native, declarative pipelines as Kubernetes resources. Part of CD Foundation.

When to reach for it: When you build CI/CD pipelines for Kubernetes-based systems, understand that Tekton defines pipelines as Kubernetes resources so you can use familiar kubectl tooling and run complex workflows declaratively.

Sources:

Tekton - Cloud Native CI/CD

Tempo

advancedtools

A high-scale distributed tracing backend from Grafana Labs.

Object-storage based, integrates with Grafana. Supports Jaeger, Zipkin, and OpenTelemetry.

When to reach for it: When you need to store and query distributed traces at high volume, understand that Tempo uses object storage instead of custom databases so you can reduce operational complexity and reuse existing infrastructure.

Sources:

Grafana Tempo - Distributed tracing backend

Terraform

intermediatetools

An infrastructure as code tool using declarative configuration files.

Multi-cloud support, state management, module ecosystem. Industry standard for IaC.

When to reach for it: When you provision cloud infrastructure, understand declarative infrastructure as code so you can version, review, and reproducibly manage resources.

Sources:

Terraform Official Documentation

Testing Pyramid

foundationalpractices

A model for balancing test types: many unit tests, fewer integration tests, fewest E2E tests.

Ensures fast feedback. Unit tests catch most issues; E2E tests catch integration issues.

When to reach for it: When you allocate resources to testing, understand the testing pyramid so you can balance fast feedback with comprehensive coverage and keep test suites quick and maintainable.

Sources:

Google Testing on the Toilet: The Testing Pyramid

Thrashing Team

advancedculture

A DORA 2025 team profile characterized by poor performance across all metrics.

About 10% of teams. High failure rates, slow delivery, slow recovery. Need fundamental improvements to foundations before AI can help. Opposite of Elite profile.

When to reach for it: When you diagnose a struggling team's problems, understand Thrashing as a profile with poor metrics across all dimensions so you can address root causes before chasing quick fixes.

Throughput

foundationalmetrics

The number of work items completed in a given time period.

Measures team delivery rate. Track trends rather than absolute numbers.

When to reach for it: When you evaluate delivery performance, understand throughput so you can track the rate at which your team completes work over time.

Sources:

DORA Metrics Guide

Toil

foundationalculture

Manual, repetitive, automatable work that scales with service size.

SRE goal: keep toil below 50% of time. Automate or eliminate toil to focus on engineering.

When to reach for it: When you assess operations work, understand toil so you can identify repetitive manual tasks and prioritize automation to free time for engineering.

Sources:

Google SRE Books - Toil

Tracing

intermediatepractices

Tracking requests as they flow through distributed systems.

Enables root cause analysis across services. Standards: OpenTelemetry, W3C Trace Context.

When to reach for it: When requests span multiple services, understand tracing so you can see the path each request takes and identify which service introduced latency or errors.

Sources:

OpenTelemetry Documentation

Trivy

foundationaltools

A comprehensive security scanner for containers, filesystems, and infrastructure.

Scans for vulnerabilities, misconfigurations, secrets, and license issues. Fast and easy to use.

When to reach for it: When you need to scan container images and infrastructure code for vulnerabilities in your CI/CD pipeline, understand vulnerability scanning tools so you can detect and remediate security issues early.

Sources:

Trivy GitHub Repository

Trunk-Based Development

intermediatepractices

A branching strategy where developers merge small changes frequently to a single main branch.

Reduces merge conflicts, enables continuous integration, and supports high-velocity delivery.

When to reach for it: When long-lived branches become merge nightmares, understand trunk-based development so you can merge small changes frequently and keep CI flowing.

Sources:

Trunk-Based Development

Twelve-Factor App

foundationalpractices

A methodology for building modern, scalable, cloud-native applications.

Twelve principles including config in env vars, stateless processes, dev/prod parity. Foundation for cloud-native.

When to reach for it: When you build software-as-a-service applications, understand the twelve-factor methodology so you can ensure consistent behavior across development, staging, and production.

Sources:

The Twelve-Factor App

Two-Pizza Teams

foundationalculture

Teams small enough to be fed by two pizzas (typically 6-10 people).

Amazon concept. Smaller teams have faster communication and decision-making.

When to reach for it: When you organize engineering teams, understand the relationship between team size and decision velocity so you can maintain communication efficiency and autonomy.

U

Unit Testing

foundationalpractices

Testing individual components or functions in isolation.

Fast, focused, numerous. Foundation of the testing pyramid. Run on every commit.

When to reach for it: When you write code, understand unit testing so you can catch bugs fast with feedback in milliseconds and document the intended behavior of individual functions.

Unleash

intermediatetools

An open-source feature flag management system.

Self-hosted or cloud. Supports gradual rollouts, A/B testing, and kill switches.

When to reach for it: When you need an open-source feature flag system, understand self-hosted feature management so you can control release strategies without vendor lock-in.

Sources:

V

Value Stream

foundationalpractices

The end-to-end flow from idea to value delivered to users.

Mapping value streams reveals waste, handoffs, and bottlenecks. Core to lean/DevOps improvement.

When to reach for it: When you analyze how value flows through your organization, understand value streams as the complete path from idea to deployed software so you can identify and eliminate waste.

Sources:

The Phoenix Project - Gene Kim, et al

Value Stream Mapping (VSM)

intermediatepractices

A lean technique for visualizing and analyzing the flow of work.

Identifies wait times, handoffs, and waste. Calculates flow efficiency. Starting point for improvement.

When to reach for it: When you diagnose bottlenecks in your delivery process, understand VSM as a visualization technique that distinguishes active work from wait time so you can calculate flow efficiency and prioritize improvements.

Sources:

The Phoenix Project - Gene Kim, et al

Vibe Coding

intermediatepractices

Rapid prototyping with AI assistance based on natural language descriptions.

Enables quick exploration and iteration. Requires human review and refinement. Useful for MVPs and proof-of-concepts.

When to reach for it: When exploring ideas quickly with AI, understand rapid prototyping based on natural language so you can validate concepts fast while knowing human review and refinement are essential.

W

War Room

foundationalculture

A dedicated space (physical or virtual) for incident response coordination.

Centralizes communication during major incidents. Clear roles and real-time updates.

When to reach for it: When a major incident occurs, understand centralized coordination structures so you can enable real-time communication and decision-making under pressure.

Sources:

Site Reliability Engineering: How Google Runs Production Systems

Work in Progress (WIP)

foundationalmetrics

The number of tasks currently being worked on.

Limiting WIP improves flow and reduces context switching. Core kanban concept.

When to reach for it: When you manage team capacity, understand WIP limits so you can reduce context switching and reveal bottlenecks in your workflow.

Y

You Build It, You Run It

foundationalculture

Teams are responsible for operating and supporting what they build.

Aligns incentives for reliability. Teams feel the pain of poor design or operations.

When to reach for it: When you allocate operational responsibility, understand accountability alignment so you can drive reliability focus from the development side and reduce handoff friction.

Sources:

Amazon Leadership Principles

Z

Zero Trust

intermediatesecurity

Security model that requires verification for every access request, regardless of location.

Never trust, always verify. Applies to networks, identities, and workloads.

When to reach for it: When you design security architecture for distributed systems, understand that zero trust replaces network perimeter defense with continuous verification so you can prevent lateral movement and reduce breach impact.

Sources:

Glossary Terms

A

Agentic DevOps

Agentic Workflows

AGENTS.md

AI Amplifier Effect

AI Capabilities (DORA 2025)

AIOps

Apache Kafka

API Gateway

Argo Rollouts

ArgoCD

Artifact Repository

AWS CDK

Azure DevOps

B

Backstage

BFF (Backend for Frontend)

Bicep

Blameless Postmortem

Blast Radius

Blue-Green Deployment

Branch Protection

Buildpacks

Bulkhead Pattern

Burndown Chart

C

Canary Release

Cattle vs Pets

CD (Continuous Delivery)

CFR (Change Failure Rate)

Change Failure Rate

Chaos Engineering

Checkov

CI (Continuous Integration)

Cilium

CircleCI

Circuit Breaker

Cloud-Native

CNAPP (Cloud-Native Application Protection Platform)

Code Review

CODEOWNERS

Cognitive Load

Complicated Subsystem Team

Container

Container Registry

containerd

Context Engineering

Continuous Deployment

Contract Testing

Conway's Law

Cortex

CQRS (Command Query Responsibility Segregation)

CRD (Custom Resource Definition)

Crossplane

CSPM (Cloud Security Posture Management)

Cumulative Flow Diagram (CFD)

CWPP (Cloud Workload Protection Platform)

Cycle Time

D

Dagger

DAST (Dynamic Application Security Testing)

Datadog

Deployment Frequency

Developer Experience (DevEx)

DevOps

DevSecOps

DF (Deployment Frequency)

Distroless Images

Docker

Domain-Driven Design (DDD)

DORA Metrics

DORA Team Profiles (2025)

Dynatrace

E

E2E Testing (End-to-End)

Enabling Team

Ephemeral Environments

Error Budget

Event Sourcing