- Home
- Glossary
Glossary
209 DevOps terms and definitions to build shared vocabulary.
Showing 209 of 209 terms
Glossary Terms
A
Agentic DevOps
Agent-augmented and optimized DevOps using AI agents with human-in-the-loop governance.
The evolution of DevOps where autonomous AI agents handle toil, code review, testing, and operations while humans provide guardrails, approvals, and strategic oversight. Key pillars: AI governance, agent auditability, human-in-loop metrics, and prompt governance. Enables self-healing infrastructure and AI-assisted SDLC.
When to reach for it: When AI tools feel chaotic without guardrails, understand Agentic DevOps so you can direct autonomous agents for toil while humans keep strategic control.
Related:
Agentic Workflows
AI-powered automation where agents perform multi-step tasks with human oversight.
Agents can plan, execute, and iterate on complex tasks. Requires guardrails, approval gates, and auditability. Examples: automated code review, self-healing infrastructure.
When to reach for it: When tedious multi-step tasks pile up, understand agentic workflows so you can let AI handle the toil while humans focus on judgment calls.
AGENTS.md
A file providing AI coding agents with project-specific instructions and context.
Contains architecture decisions, coding conventions, and workflow guidance. Helps AI assistants understand repo structure and make appropriate changes.
When to reach for it: When AI assistants drift from your project's norms, understand AGENTS.md so you can encode architecture and conventions in a file agents actually read.
AI Amplifier Effect
DORA 2025 finding that AI tools magnify both existing strengths AND weaknesses.
Teams with strong foundations see 2-3x AI benefit vs. struggling teams. AI accelerates good practices but also accelerates tech debt, poor quality, and security vulnerabilities if foundations are weak.
When to reach for it: When you adopt AI tools in your DevOps practice, understand the amplifier effect so you can recognize that AI accelerates both good habits and accumulated debt.
AI Capabilities (DORA 2025)
Seven AI development capabilities measured by DORA: code completion, generation, explanation, test data, tests, docs, optimization.
Elite teams adopt more AI capabilities with greater effect. Key: AI capability adoption correlates with platform adoption (90% of elite teams have platforms).
When to reach for it: When you evaluate which AI capabilities to adopt in development, understand the seven measured capabilities so you can target investments toward practices that elite teams use.
AIOps
Using AI/ML to enhance IT operations through automated insights and actions.
Applies machine learning to analyze operational data, predict issues, automate responses, and reduce MTTR. Includes anomaly detection, intelligent alerting, and root cause analysis.
When to reach for it: When operations drowns in alerts and logs, understand AIOps so you can let machine learning detect anomalies and suggest fixes faster than humans can read dashboards.
Related:
Apache Kafka
A distributed event streaming platform for high-throughput data pipelines.
Publish-subscribe messaging, stream processing, durable storage. Foundation for event-driven architectures.
When to reach for it: When you build event-driven systems with high throughput, understand Apache Kafka so you can stream events durably across services and scale data pipelines.
API Gateway
A single entry point that routes requests to backend services.
Handles auth, rate limiting, request routing. Examples: Kong, AWS API Gateway, Envoy.
When to reach for it: When clients need to access multiple backend services, understand an API gateway so you can provide a single entry point that handles routing, authentication, rate limiting, and request transformation.
Argo Rollouts
A Kubernetes controller for advanced deployment strategies.
Blue-green, canary, and progressive delivery with analysis. Integrates with service meshes.
When to reach for it: When you need fine-grained control over deployment strategies in Kubernetes, understand progressive delivery patterns so you can safely deploy with automated analysis and instant rollback.
Sources:
ArgoCD
A declarative GitOps continuous delivery tool for Kubernetes.
Syncs Kubernetes cluster state with Git repositories. Supports multi-cluster, RBAC, SSO.
When to reach for it: When you manage Kubernetes cluster state, understand declarative GitOps synchronization so you can keep deployed state aligned with Git source of truth.
Sources:
Artifact Repository
A storage system for build outputs, packages, and container images.
Examples: Artifactory, Nexus, GitHub Packages, AWS ECR. Central to reproducible builds and supply chain.
When to reach for it: When you establish build infrastructure, understand artifact repositories so you can store and distribute immutable build outputs that enable reproducible builds and supply chain integrity.
AWS CDK
Cloud Development Kit for defining AWS infrastructure using familiar programming languages.
Synthesizes to CloudFormation. Supports TypeScript, Python, Java, C#, Go. High-level constructs for common patterns.
When to reach for it: When you define AWS infrastructure, understand high-level programming constructs so you can write infrastructure faster with language familiarity and type safety.
Sources:
Azure DevOps
Microsoft's integrated DevOps platform with boards, repos, pipelines, and artifacts.
YAML or classic pipelines. Integrates with Azure, GitHub, and third-party tools.
When to reach for it: When you manage projects in Microsoft's ecosystem, understand integrated DevOps so you can coordinate planning, code, pipelines, and artifacts in one platform.
B
Backstage
An open-source platform for building developer portals.
Service catalog, software templates, TechDocs. Created by Spotify, now CNCF incubating.
When to reach for it: When you need a unified interface for your engineering teams, understand developer portals so you can provide a software catalog and golden paths that reduce cognitive load.
BFF (Backend for Frontend)
A backend service tailored to the needs of a specific frontend.
Optimizes API responses for each client type (web, mobile, etc.).
When to reach for it: When different frontend clients (web, mobile, IoT) have different API requirements, understand BFF so you can build tailored backend services that optimize responses for each client type.
Bicep
A domain-specific language for deploying Azure resources declaratively.
Compiles to ARM templates. Cleaner syntax than JSON. First-class Azure tooling support.
When to reach for it: When you define Azure infrastructure as code, understand that Bicep provides a cleaner language alternative to ARM templates so you can write IaC faster and collaborate more effectively.
Sources:
Blameless Postmortem
An incident review focused on system improvement rather than individual blame.
Assumes people made reasonable decisions with available information. Focus on "how" not "who".
When to reach for it: When you conduct incident reviews, understand blameless postmortems as a structured process that investigates system failures without attributing fault to individuals so you can learn from incidents and improve systems.
Blast Radius
The scope of impact when something goes wrong.
Smaller blast radius = fewer users affected. Techniques: canaries, feature flags, circuit breakers.
When to reach for it: When something fails in production, understand the scope of impact so you can use canaries, feature flags, and circuit breakers to keep blast radius small and minimize user exposure.
Blue-Green Deployment
Running two identical production environments, switching traffic between them for releases.
Enables zero-downtime deployments and instant rollback by switching the router to the previous environment.
When to reach for it: When zero-downtime deployments are critical, understand how to maintain two identical environments so you can switch traffic instantly and rollback without user impact.
Branch Protection
Rules that enforce policies on specific branches.
Require PR reviews, status checks, signed commits. Prevents direct pushes to main branch.
When to reach for it: When someone pushes directly to main, understand branch protection rules so you can enforce policy and prevent accidents.
Buildpacks
A tool for transforming source code into container images without Dockerfiles.
Detects language, applies best practices. CNCF project. Reproducible builds.
When to reach for it: When you need to create container images without managing Dockerfiles, understand buildpacks so you can automatically detect language and build best-practice images with a single command.
Sources:
Bulkhead Pattern
Isolating system components to prevent failures from spreading.
Like ship bulkheads. Separate thread pools, connection pools, or deployments.
When to reach for it: When a failure in one component could crash your entire system, understand the bulkhead pattern so you can isolate resources and prevent failures from spreading across your architecture.
Burndown Chart
A graph showing remaining work over time.
Common in Scrum. Helps visualize sprint progress and predict completion.
When to reach for it: When you track sprint progress, understand burndown charts so you can visualize remaining work and predict whether your team will meet sprint goals.
C
Canary Release
Gradually rolling out changes to a small subset of users before full deployment.
Reduces blast radius by detecting issues early; typically 1-5% of traffic initially.
When to reach for it: When deploying risky changes, understand how to gradually roll out to a small user subset so you can detect issues early and limit the blast radius before affecting everyone.
Cattle vs Pets
A metaphor for treating servers as disposable (cattle) vs unique and irreplaceable (pets).
Cloud-native mindset: servers are numbered, replaceable. Pets have names and are hand-maintained.
When to reach for it: When you scale infrastructure, understand the cattle-vs-pets metaphor so you can shift from hand-maintaining servers to treating them as replaceable resources.
CD (Continuous Delivery)
Keep software always releasable with automated quality gates.
The pipeline can produce a deployable artifact at any time; releases become low-risk and repeatable.
When to reach for it: When you want to ship changes without manual ceremony, understand the difference between CD and deployment so you know when you can release on-demand with confidence.
Sources:
Related:
CFR (Change Failure Rate)
Percentage of deployments that cause a failure in production.
One of the four DORA metrics. Elite performers: 0-5%. Improved by testing, progressive delivery, and feature flags.
When to reach for it: When you track deployment quality, understand Change Failure Rate so you can quantify the impact of testing and quality gates on production stability.
Sources:
Related:
Change Failure Rate
Percentage of deployments that cause a failure in production.
One of the four DORA metrics. Elite performers: 0-5%. Improved by testing, progressive delivery, and feature flags.
When to reach for it: When you measure deployment stability, understand Change Failure Rate so you can track the percentage of releases that cause production problems and identify where quality gates need strengthening.
Sources:
Chaos Engineering
Intentionally injecting failures to test system resilience.
Tools: Chaos Monkey, LitmusChaos, Gremlin. Goal: build confidence in system behavior under failure.
When to reach for it: When you need to know how your system behaves under failure, understand chaos engineering so you can run controlled experiments that reveal weaknesses before customers do.
Sources:
Checkov
A static analysis tool for infrastructure as code security.
Scans Terraform, CloudFormation, Kubernetes, and more for misconfigurations and compliance.
When to reach for it: When you deploy infrastructure-as-code to cloud platforms, understand IaC security scanning so you can catch misconfigurations before they reach production.
Sources:
CI (Continuous Integration)
Frequently merge small changes and validate them automatically.
Typical signals: fast build + unit tests on every change; trunk-based development; reproducible builds.
When to reach for it: When you need to move fast without breaking things, understand CI's merge-small-changes-often pattern so you can catch integration problems early and keep your main branch deployable.
Related:
Cilium
eBPF-based networking, security, and observability for Kubernetes.
High-performance CNI. Service mesh without sidecars. Network policies and Hubble observability.
When to reach for it: When you need network security, observability, and service mesh capabilities in Kubernetes, understand eBPF-based networking so you can achieve high performance without sidecar injection.
Sources:
CircleCI
A cloud-native CI/CD platform with powerful caching and parallelism.
YAML config, orbs for reuse, insights dashboard, and flexible compute options.
When to reach for it: When you run CI/CD pipelines on a hosted platform, understand cloud-native CI so you can parallelize builds and cache dependencies to speed up feedback.
Circuit Breaker
A pattern that prevents cascading failures by stopping calls to failing services.
States: closed, open, half-open. Gives failing services time to recover.
When to reach for it: When one service is failing and causing cascading failures in dependent services, understand circuit breaker so you can stop requests to the failing service and allow it time to recover.
Cloud-Native
Designing applications to fully exploit cloud computing advantages.
Characteristics: containerized, dynamically orchestrated, microservices-oriented. CNCF definition.
When to reach for it: When you design new applications for cloud platforms, understand cloud-native principles so you can build systems that scale automatically and recover from failures.
CNAPP (Cloud-Native Application Protection Platform)
Unified security platform combining CSPM, CWPP, and application security.
Consolidates cloud security tools. Covers infrastructure, workloads, and code.
When to reach for it: When you need unified cloud security coverage across infrastructure and code, understand CNAPP as a consolidated platform that spans posture management, workload protection, and application security so you can reduce tool sprawl.
Code Review
The practice of having peers examine code changes before merging.
Catches bugs, improves code quality, shares knowledge. Can be human, AI-assisted, or both.
When to reach for it: When bugs slip past testing, understand code review so you can catch issues early and spread knowledge across the team.
CODEOWNERS
A file defining who must review changes to specific parts of a codebase.
Automatically requests reviews from designated owners. Ensures domain experts review relevant changes.
When to reach for it: When critical code changes surprise domain experts, understand CODEOWNERS so you can automatically route reviews to people who care most.
Cognitive Load
The mental effort required to operate or understand a system.
Team Topologies concept. Reducing cognitive load improves flow and reduces errors.
When to reach for it: When you design systems and teams, understand cognitive load as the mental effort required to operate or understand something so you can reduce unnecessary complexity and improve decision-making.
Complicated Subsystem Team
A team responsible for a component requiring specialist knowledge.
Examples: ML models, video codecs, cryptography. Shields complexity from other teams.
When to reach for it: When you work in an organization with specialized domains (ML models, video codecs, cryptography), understand team structure so you can allocate specialists to reduce cognitive load on stream-aligned teams.
Container
A lightweight, isolated environment for running applications.
Shares OS kernel, faster than VMs. Standard packaging format for cloud-native apps.
When to reach for it: When you need consistent runtime environments across development, testing, and production, understand containers so you can package applications with their dependencies for fast, portable deployment.
Sources:
Container Registry
A repository for storing and distributing container images.
Examples: Docker Hub, GitHub Container Registry, AWS ECR, Azure ACR. Supports tags, scanning, signing.
When to reach for it: When you need to store and distribute container images across your infrastructure, understand container registries so you can ensure reproducible deployments and enable artifact scanning and signing.
Sources:
containerd
An industry-standard container runtime.
Used by Docker and Kubernetes. Handles container lifecycle, storage, networking.
When to reach for it: When you need a container runtime for Kubernetes or standalone use, understand containerd so you can manage the full container lifecycle from image download to network and storage setup.
Sources:
Context Engineering
The practice of optimizing what information is provided to AI models for better outcomes.
Critical for agentic workflows. Includes structuring prompts, managing context windows, and providing relevant project knowledge via AGENTS.md files or MCP servers.
When to reach for it: When AI agents hallucinate or miss context, understand context engineering so you can structure information to steer model behavior toward accuracy.
Continuous Deployment
Automatically deploy to production after passing quality checks.
Every change that passes automated tests is deployed to production without manual intervention.
When to reach for it: When every deployment requires courage, understand continuous deployment so you can turn deployments into a boring daily routine backed by passing tests.
Related:
Contract Testing
Testing that services honor their API contracts with consumers.
Tools: Pact, Spring Cloud Contract. Catches integration issues without full E2E tests.
When to reach for it: When services evolve independently, understand contract testing so you can catch breaking API changes early without running full E2E tests for every integration.
Sources:
Conway's Law
Organizations design systems that mirror their communication structure.
"Inverse Conway Maneuver" deliberately shapes org structure to achieve desired architecture.
When to reach for it: When you design system architecture, understand how organization structure influences design so you can deliberately structure your teams to achieve desired architecture.
Cortex
An internal developer portal focused on service quality and ownership.
Service catalog, scorecards, CQL queries, and integrations with engineering tools.
When to reach for it: When you track service ownership and quality metrics, understand service quality platforms so you can measure reliability and drive accountability across engineering teams.
Sources:
CQRS (Command Query Responsibility Segregation)
Separating read and write operations into different models.
Optimizes each side independently. Often combined with event sourcing.
When to reach for it: When you have read-heavy or write-heavy workloads with different optimization requirements, understand CQRS so you can separate read and write models and scale them independently.
CRD (Custom Resource Definition)
Extending Kubernetes API with custom resource types.
Foundation for operators and GitOps tools. Declarative management of anything.
When to reach for it: When you extend Kubernetes functionality, understand CRDs so you can define custom resource types that enable declarative management of anything through the Kubernetes API.
Crossplane
A Kubernetes-native infrastructure control plane for managing cloud resources.
Extends Kubernetes API to provision and manage any infrastructure. Enables platform teams to offer self-service infrastructure.
When to reach for it: When you build internal developer platforms, understand Kubernetes-native infrastructure composition so you can enable self-service infrastructure provisioning.
CSPM (Cloud Security Posture Management)
Continuous monitoring of cloud infrastructure for misconfigurations and compliance violations.
Tools: Prisma Cloud, Wiz, AWS Security Hub. Automated remediation and drift detection.
When to reach for it: When you operate cloud infrastructure at scale, understand CSPM as continuous scanning for configuration errors so you can detect misconfigurations before they become breaches.
Cumulative Flow Diagram (CFD)
A chart showing work items in different states over time.
Reveals bottlenecks, WIP limits violations, and flow problems. Key Kanban metric.
When to reach for it: When you analyze workflow metrics, understand cumulative flow diagrams so you can detect bottlenecks and visualize how work moves through your system.
CWPP (Cloud Workload Protection Platform)
Security for cloud workloads including VMs, containers, and serverless.
Runtime protection, vulnerability management, and compliance for cloud workloads.
When to reach for it: When you protect containers and virtual machines in cloud environments, understand CWPP as runtime security that monitors and blocks threats at the workload layer so you can defend against container breakouts and lateral movement.
Cycle Time
Time from starting work on a task to completing it.
Different from lead time (which includes queue time). Measures active work duration.
When to reach for it: When you measure team productivity, understand cycle time so you can distinguish active work duration from total lead time and identify where work stalls.
Sources:
D
Dagger
A programmable CI/CD engine that runs pipelines in containers.
Write pipelines in your language of choice. Run locally or in any CI system. Created by Docker founder.
When to reach for it: When you write CI/CD pipelines that must run identically locally and in CI systems, understand that Dagger containerizes pipeline execution so you can eliminate CI-specific quirks and test workflows on your machine.
DAST (Dynamic Application Security Testing)
Testing a running application for security vulnerabilities.
Tools: OWASP ZAP, Burp Suite. Finds issues that only appear at runtime.
When to reach for it: When you test running applications, understand DAST so you can identify security vulnerabilities that only appear during execution against live systems.
Datadog
A SaaS monitoring and analytics platform for cloud-scale applications.
Unified metrics, traces, and logs. APM, infrastructure monitoring, and security. Wide integrations.
When to reach for it: When you operate large-scale cloud infrastructure, understand that Datadog provides unified metrics, traces, and logs so you can correlate signals from different layers to diagnose issues quickly.
Deployment Frequency
How often code is deployed to production.
One of the four DORA metrics. Elite performers: multiple times per day. Higher frequency = smaller changes = lower risk.
When to reach for it: When reducing risk in production, measure how often you deploy so you correlate deployment frequency with failure rates and understand whether smaller, more frequent changes improve stability.
Developer Experience (DevEx)
The overall experience developers have while building, testing, and shipping software.
Includes tooling, documentation, feedback loops, and cognitive load. High DevEx = higher productivity.
When to reach for it: When you evaluate your engineering organization, understand Developer Experience so you can identify friction points in tooling, feedback loops, and cognitive load that impact productivity.
DevOps
Union of people, process, and technology to enable continuous delivery of value.
Not a tool or team name. It's a culture and set of practices that breaks down silos between development and operations.
When to reach for it: When development and operations clash, understand DevOps as a culture of breaking silos so you can align teams around shared ownership of reliability.
Related:
DevSecOps
DevOps with security integrated as a shared responsibility throughout the lifecycle.
Security checks are automated in pipelines; threat modeling happens during design; everyone owns security.
When to reach for it: When security feels like a checklist at the end, understand DevSecOps so you can shift security left and make it everyone's responsibility from day one.
Sources:
Related:
DF (Deployment Frequency)
How often code is deployed to production.
One of the four DORA metrics. Elite performers: multiple times per day. Higher frequency = smaller changes = lower risk.
When to reach for it: When you assess delivery velocity, understand Deployment Frequency so you can measure how often you deploy and correlate it with team performance and incident risk.
Sources:
Related:
Distroless Images
Container images containing only the application and its runtime dependencies.
No shell, package manager, or OS utilities. Smaller attack surface.
When to reach for it: When you need to reduce container image size and attack surface, understand distroless images so you can deploy applications with only their runtime dependencies and no shell or package manager.
Sources:
Docker
A platform for developing, shipping, and running applications in containers.
De facto standard for container images. Dockerfile, Docker Compose, Docker Hub.
When to reach for it: When you need to build and ship containerized applications, understand Docker so you can create reproducible container images and manage container workflows.
Sources:
Domain-Driven Design (DDD)
A software design approach focused on modeling domains based on business reality.
Bounded contexts, ubiquitous language, aggregates. Helps define service boundaries.
When to reach for it: When designing architecture or defining service boundaries, understand Domain-Driven Design so you can align your system structure with business domains and use a shared vocabulary across teams.
DORA Metrics
Four key metrics for delivery performance: lead time, deployment frequency, change fail rate, MTTR.
Research-backed metrics that correlate with organizational performance. Used to benchmark and improve.
When to reach for it: When you measure delivery performance, understand DORA metrics so you can track the four research-backed indicators that correlate with organizational success.
Sources:
DORA Team Profiles (2025)
Seven distinct team performance clusters identified in DORA 2025 research.
Profiles: Elite (7.5%), High Balanced (16%), Mid-level Balanced (11%), Mid-level Starting (15%), Low Throughput (17%), Low Stability (23%), Thrashing (10%). Different profiles need different improvement strategies.
When to reach for it: When you analyze your team's performance across velocity and stability, understand the seven profiles so you can identify which improvement strategies apply to your situation.
Dynatrace
An AI-powered software intelligence platform for observability and security.
Automatic discovery, AI root cause analysis (Davis AI), and full-stack monitoring.
When to reach for it: When you need automatic discovery and monitoring of complex applications, understand that Dynatrace uses AI for root cause analysis so you can detect anomalies without manually configuring thresholds.
E
E2E Testing (End-to-End)
Testing the entire application flow from user perspective.
Tools: Playwright, Cypress, Selenium. Slower but catches real user journey issues. Top of testing pyramid.
When to reach for it: When you need to verify the entire user journey works as expected, understand E2E testing so you can catch issues that only appear when components interact in production-like conditions.
Sources:
Enabling Team
A team that helps stream-aligned teams acquire new capabilities.
Focuses on research, guidance, and enablement. Temporary engagement, not long-term dependency.
When to reach for it: When you coach other teams through capability gaps, understand enabling teams as temporary partners that transfer knowledge so you can build internal expertise and then move on.
Ephemeral Environments
Short-lived environments created on-demand for testing or review.
Spun up per PR or feature branch, torn down after merge. Enables isolated testing.
When to reach for it: When you need to test pull requests in isolation, understand temporary environment patterns so you can enable preview testing and review without impacting shared infrastructure.
Error Budget
The allowed unreliability before you must prioritize stability over change.
Calculated as (1 - SLO). If 99.9% SLO, you have 0.1% error budget (~43 min/month downtime allowed).
When to reach for it: When stability and velocity conflict, understand error budgets so you can quantify risk tolerance and align release decisions with reliability targets.
Sources:
Event Sourcing
Storing state as a sequence of events rather than current state.
Provides audit trail, enables temporal queries, supports event replay.
When to reach for it: When you need to track state changes and provide audit trails, understand event sourcing so you can store state as immutable events and replay them to reconstruct any historical state.
Event-Driven Architecture
A pattern where services communicate through asynchronous events.
Loose coupling, scalability, eventual consistency. Tools: Kafka, EventBridge, NATS.
When to reach for it: When you need to decouple services in a system, understand event-driven architecture so you can build scalable, loosely-coupled systems where services communicate through events rather than direct calls.
External Secrets Operator
Kubernetes operator that synchronizes secrets from external providers.
Pulls secrets from Vault, AWS Secrets Manager, etc. into Kubernetes secrets automatically.
When to reach for it: When you pull secrets from external systems into Kubernetes, understand secret synchronization so you can maintain a single source of truth for credentials across your cluster.
F
FaaS (Functions as a Service)
Running individual functions without managing servers.
Event-driven execution. Short-lived, stateless. Good for glue code and webhooks.
When to reach for it: When you have short-lived, event-driven workloads, understand Functions as a Service so you can execute individual functions without managing servers.
Feature Flags
Runtime configuration that allows enabling or disabling features without deploying new code.
Enables progressive rollout, A/B testing, kill switches, and decoupling deployment from release.
When to reach for it: When deployment and release must be decoupled, understand feature flags so you can deploy confidently without exposing incomplete work.
FinOps
Financial operations practice for managing cloud costs with engineering, finance, and business collaboration.
Brings financial accountability to cloud spending. Key practices: tagging, showback/chargeback, rightsizing, reserved capacity, spot instances, and FinOps-informed architecture.
When to reach for it: When cloud bills surprise you, understand FinOps so you can bring financial accountability to engineering without killing innovation.
Related:
Flagger
A progressive delivery operator for Kubernetes.
Automates canary releases with Istio, Linkerd, or other service meshes. Prometheus-based analysis.
When to reach for it: When you deploy to Kubernetes and want to automate canary releases with traffic shifting, understand automated progressive delivery so you can reduce deployment risk through metric-driven rollouts.
Sources:
Flow Efficiency
Ratio of active work time to total lead time.
Most organizations have 5-15% flow efficiency. Rest is wait time. Reveals improvement opportunities.
When to reach for it: When you measure process health, understand flow efficiency as the ratio of active work to total lead time so you can quantify waste and validate improvement efforts.
Flux
A GitOps toolkit for Kubernetes, part of the CNCF.
Modular approach to GitOps. Integrates with Helm, Kustomize, and OCI registries.
When to reach for it: When you implement GitOps for Kubernetes, understand modular reconciliation tooling so you can manage infrastructure configuration through Git pull requests.
Sources:
G
Game Day
A planned exercise to test incident response and system resilience.
Simulates failures in controlled conditions. Validates runbooks and team readiness.
When to reach for it: When you need to validate incident response procedures and team readiness, understand controlled failure simulation so you can identify gaps before production incidents occur.
Git Hooks
Scripts that run automatically at specific points in the Git workflow.
Pre-commit hooks for linting, commit-msg for format validation. Local enforcement of standards.
When to reach for it: When bad commits reach CI, understand git hooks so you can enforce standards locally before they leave a developer's machine.
GitHub Actions
GitHub's built-in CI/CD and automation platform.
YAML-based workflows triggered by events. Large marketplace of reusable actions.
When to reach for it: When you automate your CI/CD pipeline on GitHub, understand workflow as code so you can trigger deployments and tests from version control events.
GitHub Copilot
AI pair programmer that suggests code completions and entire functions.
Integrates with IDEs. Supports chat, code generation, and agent-based workflows. Powers Agentic DevOps.
When to reach for it: When you write code with AI assistance, understand pair programming with machine learning so you can generate boilerplate code and learn patterns from your codebase.
GitLab CI
Integrated CI/CD within the GitLab DevOps platform.
YAML pipelines, Auto DevOps, container registry, and security scanning built-in.
When to reach for it: When you build pipelines integrated with GitLab, understand native CI/CD so you can combine version control, code review, and automation in a single system.
GitOps
Using Git as the single source of truth for declarative infrastructure and applications.
Changes are made via pull requests; reconciliation loops ensure actual state matches desired state in Git.
When to reach for it: When manual deployments creep in, understand GitOps so you can use Git as your single source of truth and reconciliation loops to keep production aligned with intent.
Golden Path
An opinionated, well-supported way to build and ship software within an organization.
Reduces cognitive load, ensures consistency, and accelerates onboarding. Core to platform engineering.
When to reach for it: When you build internal developer platforms, understand the golden path concept so you can provide an opinionated, well-supported way that reduces cognitive load and accelerates shipping.
Golden Signals
Four key metrics for monitoring: latency, traffic, errors, and saturation.
From the Google SRE book. Provides a baseline for understanding service health.
When to reach for it: When you design monitoring for a service, understand golden signals so you can focus on the four metrics that matter most: latency, traffic, errors, and saturation.
Grafana
An open-source platform for monitoring and observability visualization.
Supports multiple data sources (Prometheus, Loki, etc.). Dashboards, alerts, and annotations.
When to reach for it: When you visualize operational metrics and logs from multiple sources, understand that Grafana provides dashboards and alert management so you can build shared observability across your infrastructure.
GreenOps
Sustainable IT operations focused on reducing environmental impact of technology.
Optimizing for carbon footprint, energy efficiency, and sustainable architecture. Includes carbon-aware scheduling, rightsizing, and choosing green cloud regions.
When to reach for it: When carbon footprint becomes a business metric, understand GreenOps so you can optimize for environmental impact alongside cost and performance.
Related:
Gremlin
An enterprise chaos engineering platform.
Controlled failure injection, game days, and reliability scoring. SaaS with agents.
When to reach for it: When you need to run reliability tests and game days across your infrastructure, understand controlled failure injection so you can build confidence in incident response and system resilience.
Sources:
H
HashiCorp Vault
A tool for secrets management, encryption, and identity-based access.
Dynamic secrets, encryption as a service, PKI, and database credential rotation.
When to reach for it: When you manage sensitive credentials and encryption keys, understand centralized secret management so you can rotate credentials dynamically and enforce access policies across applications.
Helm
A package manager for Kubernetes using templated charts.
Enables reusable, versioned Kubernetes deployments. Supports values overrides and dependencies.
When to reach for it: When you manage multiple Kubernetes deployments with overlapping configuration, understand that Helm charts enable reusable, versioned packages so you can reduce duplication and simplify dependency management.
Human-in-the-Loop (HITL)
Design pattern requiring human approval or oversight at critical decision points.
Essential for Agentic DevOps. Humans approve significant changes while AI handles routine tasks. Balances automation speed with risk management.
When to reach for it: When you want to trust automation without abandoning control, understand HITL so you can let agents act fast on routine tasks while humans approve risky ones.
Hybrid Cloud
Combining on-premises infrastructure with public cloud services.
Common for regulated industries, data sovereignty, or gradual migration. Requires consistent tooling.
When to reach for it: When you integrate on-premises systems with cloud services, understand hybrid cloud architecture so you can meet data residency requirements while gaining cloud benefits.
I
IaC (Infrastructure as Code)
Manage infrastructure using versioned, reviewable code.
Enables reproducibility, audit trails, and treating infrastructure changes like application changes.
When to reach for it: When infrastructure changes are snowflakes, understand IaC so you can version, review, and audit infrastructure changes like you do code.
IAST (Interactive Application Security Testing)
Real-time security testing using instrumentation within the running application.
Combines SAST and DAST benefits. Lower false positives. Tools: Contrast Security.
When to reach for it: When you need lower false positives in security scanning, understand IAST so you can use real-time instrumentation to distinguish actual vulnerabilities from test artifacts.
Sources:
Immutable Artifacts
Build outputs that cannot be modified after creation.
Ensures reproducibility and auditability. Same artifact flows from dev to prod. Never overwrite tags.
When to reach for it: When you deploy code to production, understand immutable artifacts so you can guarantee the same binary flows from development through production without modification.
Immutable Infrastructure
Infrastructure that is replaced rather than modified in place.
Servers are never patched, replaced with new images. Ensures consistency and reproducibility.
When to reach for it: When you manage production servers, understand immutable infrastructure so you can eliminate configuration drift and ensure reproducible deployments.
Incident Commander
The person responsible for coordinating response during an incident.
Single point of coordination. Delegates tasks, communicates status, makes decisions.
When to reach for it: When managing incident response, understand the single command authority pattern so you can prevent conflicting actions and ensure coordinated resolution.
Inner Source
Applying open source development practices within an organization.
Enables cross-team collaboration, shared ownership, and reuse of internal code and standards.
When to reach for it: When knowledge siloes form, understand inner source so you can apply open source practices internally and increase code reuse across teams.
Integration Testing
Testing how components work together.
Verifies interactions between modules, services, or external dependencies. Middle of testing pyramid.
When to reach for it: When your code depends on other modules or services, understand integration testing so you can verify those interactions work correctly before releasing to production.
Istio
A service mesh platform providing traffic management, security, and observability.
Sidecar-based architecture. mTLS, traffic splitting, circuit breaking. Complex but powerful.
When to reach for it: When you deploy microservices across Kubernetes and need traffic control, security policies, and tracing across services, understand service mesh architecture so you can manage cross-service communication without modifying application code.
Sources:
J
Jaeger
An open-source distributed tracing system for monitoring microservices.
Helps with root cause analysis, service dependency analysis, and performance optimization.
When to reach for it: When you debug performance issues in microservices, understand that Jaeger traces requests across service boundaries so you can identify bottlenecks and understand latency distribution.
Jenkins
An open-source automation server for building, testing, and deploying.
Highly extensible with plugins. Jenkinsfile for pipeline-as-code. Mature and widely deployed; consider modern alternatives for greenfield projects.
When to reach for it: When you need a highly customizable CI/CD system, understand self-hosted automation so you can extend with plugins and integrate with legacy infrastructure.
K
Knative
A Kubernetes-based platform for deploying serverless workloads.
Serving (request-driven scale-to-zero) and Eventing (event-driven architecture).
When to reach for it: When you deploy serverless workloads on Kubernetes, understand Knative so you can run request-driven or event-driven services that scale to zero.
Sources:
Kubeflow
A machine learning toolkit for Kubernetes.
ML pipelines, model training, serving, and experiment tracking on Kubernetes. CNCF project.
When to reach for it: When you need to operationalize machine learning pipelines on Kubernetes, understand ML infrastructure patterns so you can manage experiment tracking, training, and model serving at scale.
Sources:
Kubernetes
An open-source container orchestration platform for automating deployment and scaling.
De facto standard for container orchestration. Provides declarative config, self-healing, scaling.
When to reach for it: When you need to orchestrate containerized workloads across machines, understand how Kubernetes provides declarative resource management and self-healing so you can focus on application logic instead of infrastructure operations.
Kueue
A Kubernetes-native job queueing system for batch and AI workloads.
Manages quotas, priorities, and fair-sharing for compute-intensive jobs. Key for AI infrastructure.
When to reach for it: When you run batch or AI jobs on Kubernetes with competing resource demands, understand job queueing and scheduling so you can optimize cluster resource utilization across workloads.
Sources:
Kustomize
A Kubernetes-native configuration management tool.
Patch-based customization without templates. Built into kubectl. Good for environment-specific overlays.
When to reach for it: When you need to customize Kubernetes manifests for different environments without templating, understand that Kustomize uses overlay-based patching so you can keep base configs clean and maintain clarity.
Kyverno
A Kubernetes-native policy engine using YAML policies.
Validates, mutates, and generates Kubernetes resources. No new language to learn (unlike OPA/Rego).
When to reach for it: When you need to enforce policies across Kubernetes clusters, understand how policy-as-code works so you can validate and mutate resources declaratively without learning a new policy language.
L
LaunchDarkly
A feature management platform for controlling feature rollouts.
Feature flags, targeting, experimentation, and release management. Enterprise-grade.
When to reach for it: When you decouple deployments from feature releases, understand feature flag management so you can control rollouts independently and experiment safely in production.
Lead Time for Changes
Time from code commit to code running in production.
One of the four DORA metrics. Elite performers: under one day. Includes code review, CI, and deployment time.
When to reach for it: When optimizing deployment speed, measure lead time from code commit to production so you identify bottlenecks in code review, CI, and deployment to reduce cycle time.
Linkerd
A lightweight, security-focused service mesh for Kubernetes.
Simpler than Istio. Automatic mTLS, traffic metrics, and multi-cluster support.
When to reach for it: When you want a service mesh for Kubernetes but need something operationally simpler than Istio, understand lightweight mesh design so you can secure service-to-service traffic with minimal complexity.
Sources:
LitmusChaos
A Kubernetes-native chaos engineering platform.
Chaos experiments as Kubernetes CRDs. Hub of pre-built experiments. CNCF incubating.
When to reach for it: When you need to test Kubernetes system reliability, understand chaos engineering patterns so you can inject controlled failures and validate that applications tolerate disruption.
Sources:
Loki
A horizontally-scalable log aggregation system from Grafana Labs.
Like Prometheus, but for logs. Labels-based indexing, integrates with Grafana. Cost-effective at scale.
When to reach for it: When you aggregate logs at scale without the cost of traditional indexing, understand that Loki uses label-based indexing similar to Prometheus so you can reduce storage costs while maintaining queryability.
Low Stability Team
A DORA 2025 team profile that ships fast but with high failure rates.
About 23% of teams. Good throughput (deployment frequency, lead time) but poor stability (change failure rate). Need to focus on testing, quality gates, and progressive delivery.
When to reach for it: When a team ships frequently but has high failure rates, recognize the low stability pattern so you focus on testing, quality gates, and progressive delivery to reduce deployment risk.
Sources:
Low Throughput Team
A DORA 2025 team profile that is stable but slow to deliver.
About 17% of teams. Good stability (low failure rate) but slow delivery (infrequent deploys, long lead times). Need to focus on automation, CI/CD, and small batch sizes.
When to reach for it: When a team has low failure rates but slow deployment cycles, identify low throughput as the constraint so you focus on automation, small batch sizes, and CI/CD to increase delivery frequency.
Sources:
LT (Lead Time for Changes)
Time from code commit to code running in production.
One of the four DORA metrics. Elite performers: under one day. Includes code review, CI, and deployment time.
When to reach for it: When you evaluate pipeline speed, understand Lead Time for Changes so you can measure from commit to production and identify bottlenecks in your delivery process.
Sources:
Related:
M
MCP (Model Context Protocol)
An open protocol for connecting AI models to external tools, data sources, and services.
Created by Anthropic, MCP enables AI agents to access real-time context, execute actions, and integrate with existing systems. Becoming the standard integration protocol for agentic workflows.
When to reach for it: When AI agents need to access your tools and services, understand MCP so you can connect models to real-time data without rebuilding integrations.
Microservices
An architecture style where applications are composed of small, independent services.
Each service is deployable independently, owns its data, and communicates via APIs.
When to reach for it: When your team grows and single services develop conflicting requirements, understand microservices so you can decompose the system in ways that allow teams to move independently.
MLflow
An open-source platform for managing the ML lifecycle.
Experiment tracking, model registry, deployment, and reproducibility. Language-agnostic.
When to reach for it: When you run multiple ML experiments and need to track results reproducibly, understand ML lifecycle management so you can version models, compare experiments, and deploy with confidence.
Sources:
MLOps
DevOps practices applied to machine learning model lifecycle management.
Includes model versioning, experiment tracking, automated training pipelines, model serving, monitoring for drift, and A/B testing. Tools: MLflow, Kubeflow, Weights & Biases.
When to reach for it: When ML models become tech debt, understand MLOps so you can treat model lifecycles like code lifecycles with versioning, testing, and deployment discipline.
Related:
Modular Monolith
A monolith with clear module boundaries that could be split into services later.
Best of both worlds: single deployment with clear separation. Good stepping stone.
When to reach for it: When you want the simplicity of a monolith but need a clear upgrade path, understand modular monolith so you can build with strong module boundaries that can become services if needed.
Sources:
Monolith
An application architecture where all components are part of a single deployable unit.
Not inherently bad. Simpler to develop and deploy initially. "Monolith-first" is valid strategy.
When to reach for it: When building a new service, understand monolith so you can make an informed choice about whether a single deployable unit matches your team's constraints and growth trajectory.
MTBF (Mean Time Between Failures)
Average time between system failures.
Higher MTBF = more reliable system. Improved by chaos engineering, testing, and resilience patterns.
When to reach for it: When you assess system reliability, understand MTBF so you can measure stability improvements from resilience patterns and testing.
mTLS (Mutual TLS)
Both client and server authenticate each other using certificates.
Standard in service meshes. Ensures both parties are who they claim to be.
When to reach for it: When you secure communication between services in a mesh, understand mTLS as bidirectional certificate authentication so you can ensure both endpoints verify each other's identity before exchanging data.
MTTR (Mean Time to Restore)
Average time to restore service after an incident.
One of the four DORA metrics. Elite performers: under 1 hour. Key driver: detection + runbooks.
When to reach for it: When you measure incident response capability, understand MTTR so you can benchmark speed to recovery and track improvements in detection and runbook quality.
Sources:
Multi-Cloud
Using services from multiple cloud providers.
Avoids vendor lock-in, leverages best-of-breed services. Increases complexity and requires abstraction.
When to reach for it: When you evaluate cloud vendors, understand multi-cloud strategy so you can avoid lock-in while managing added complexity from multiple platforms.
Multi-Stage Builds
Docker builds that use multiple FROM statements to create smaller final images.
Build in one stage, copy artifacts to minimal runtime stage. Reduces image size.
When to reach for it: When you build container images, understand Docker's multi-stage pattern so you can reduce final image size by copying only necessary artifacts to a minimal runtime stage.
N
Namespace (Kubernetes)
A mechanism for isolating groups of resources within a Kubernetes cluster.
Enables multi-tenancy, resource quotas, and RBAC scoping. Not a security boundary.
When to reach for it: When you need to isolate groups of resources in a cluster, understand Kubernetes namespaces so you can enable multi-tenancy, apply resource quotas, and scope RBAC permissions.
NATS
A lightweight, high-performance messaging system for cloud-native applications.
Simple pub/sub, request/reply, and streaming (JetStream). Low latency, easy to operate.
When to reach for it: When you need lightweight, low-latency messaging for cloud-native systems, understand NATS so you can implement pub/sub, request/reply, and streaming patterns simply.
Sources:
New Relic
An observability platform for monitoring application and infrastructure performance.
Full-stack observability, AI-powered insights, and broad language/framework support.
When to reach for it: When you need full-stack observability across applications and infrastructure, understand that New Relic aggregates traces, metrics, and logs with AI-powered analysis so you can reduce mean-time-to-detection.
O
Observability
The ability to understand system state from its external outputs (logs, metrics, traces).
Beyond monitoring: enables debugging unknown-unknowns. Three pillars: logs, metrics, traces.
When to reach for it: When something unexpected breaks in production, understand observability so you can ask arbitrary questions about system behavior and find root cause without relying on predefined metrics.
OCI (Open Container Initiative)
Industry standards for container image format and runtime.
Ensures container portability across different runtimes and registries.
When to reach for it: When you need container portability across different runtimes and platforms, understand OCI so you can use standardized specifications for image format and runtime behavior.
Sources:
On-Call Rotation
A schedule where team members take turns being available for urgent issues.
Typically 24/7 coverage. Key: fair distribution, good runbooks, and escalation paths.
When to reach for it: When you maintain 24/7 service coverage, understand rotation scheduling so you can distribute operational burden fairly and sustain team health.
OPA (Open Policy Agent)
A general-purpose policy engine for unified policy enforcement.
Uses Rego policy language. Can enforce policies on Kubernetes, APIs, Terraform, and more.
When to reach for it: When you need to enforce organizational policies across multiple systems, understand that OPA uses a general-purpose policy language so you can define rules once and apply them to Kubernetes, APIs, and infrastructure.
OpenTelemetry
A vendor-neutral standard for collecting telemetry data (traces, metrics, logs).
CNCF project. Provides SDKs, collectors, and exporters. Becoming the industry standard.
When to reach for it: When you instrument applications for observability, understand that OpenTelemetry provides vendor-neutral SDKs and protocols so you can avoid lock-in and switch backends without code changes.
OpenTofu
An open-source fork of Terraform maintained by the Linux Foundation.
Drop-in replacement for Terraform with community governance. Emerged after HashiCorp license change.
When to reach for it: When you need open-source infrastructure as code with community governance, understand the Terraform-compatible alternative so you can avoid proprietary licensing concerns.
Sources:
Operator Pattern
A Kubernetes pattern for automating the management of complex applications.
Custom controllers that encode operational knowledge. Examples: database operators.
When to reach for it: When you manage complex applications on Kubernetes, understand the operator pattern so you can encode operational knowledge into custom controllers that automate management and scale.
P
Paved Road
Another term for golden path, a well-maintained default way to accomplish common tasks.
Teams can go off-road but must accept additional maintenance burden and risk.
When to reach for it: When you standardize development practices, understand the paved road concept so you can define well-maintained defaults that most teams follow while allowing off-road decisions.
Platform as a Product
Treating the internal developer platform as a product with users, roadmap, and feedback loops.
Platform team acts as product team; engineers are customers. Drives adoption and satisfaction.
When to reach for it: When you organize platform engineering teams, understand platform-as-product so you can treat infrastructure tooling as a user-focused product with roadmaps and feedback loops.
Platform Engineering
Building and maintaining internal developer platforms to improve developer experience and productivity.
Provides golden paths, self-service tooling, and abstractions so teams can ship faster without reinventing infrastructure.
When to reach for it: When teams reinvent infrastructure for every project, understand platform engineering so you can build golden paths that let teams move faster without becoming platform experts.
Related:
Platform Team
A team that provides internal platforms to reduce cognitive load for stream-aligned teams.
Treats platform as a product. Self-service, well-documented, with clear SLAs.
When to reach for it: When you invest in developer experience, understand platform teams as product teams that serve internal customers so you can provide self-service capabilities and reduce cognitive load across the organization.
Pod
The smallest deployable unit in Kubernetes, containing one or more containers.
Containers in a pod share network and storage. Typically one app container per pod.
When to reach for it: When you deploy applications to Kubernetes, understand the pod abstraction so you can manage the smallest deployable unit that contains one or more containers sharing network and storage.
Podman
A daemonless container engine compatible with Docker.
Rootless containers by default. Drop-in Docker replacement. Red Hat project.
When to reach for it: When you want rootless containers without a daemon dependency, understand Podman so you can use a drop-in Docker replacement with improved security and compatibility.
Sources:
Policy-as-Code
Defining and enforcing organizational policies using code.
Tools: OPA/Rego, Kyverno, Sentinel. Enables automated compliance checking in pipelines.
When to reach for it: When you enforce organizational standards, understand Policy-as-Code so you can automate compliance checks in your pipelines and infrastructure.
Sources:
Port
An internal developer portal platform for building self-service experiences.
Software catalog, scorecards, self-service actions, and workflow automation. SaaS-based.
When to reach for it: When you build self-service capabilities for your platform, understand developer portal platforms so you can automate workflows and provide a catalog of reusable services.
Sources:
Preview Environments
Temporary environments that allow stakeholders to review changes before merge.
Often tied to pull requests. Enables early feedback from product, design, and QA.
When to reach for it: When you want feedback from stakeholders on code changes, understand preview environments so you can ship with confidence knowing product, design, and QA have already reviewed the feature.
Progressive Delivery
Gradually releasing features using techniques like canaries, feature flags, and traffic shifting.
Combines deployment strategies with observability to minimize risk and maximize feedback.
When to reach for it: When minimizing release risk matters, understand how to combine deployment strategies with observability so you can gradually increase exposure while catching issues early.
Prometheus
An open-source monitoring and alerting toolkit optimized for reliability.
Pull-based metrics collection, PromQL query language, integrates with Grafana for visualization.
When to reach for it: When you need metrics from applications and infrastructure, understand that Prometheus scrapes time-series metrics using a pull model so you can perform real-time alerting and query-driven troubleshooting.
Psychological Safety
Team environment where members feel safe to take risks and be vulnerable.
Google's Project Aristotle found it to be the #1 predictor of high-performing teams. Enables learning from failures.
When to reach for it: When you build high-performing teams, understand psychological safety as an environment where members can speak up without fear so you can accelerate learning and catch more problems early.
Pull Request (PR)
A request to merge code changes from one branch into another.
Enables code review, discussion, and automated checks before merging. Also called Merge Request (MR).
When to reach for it: When code review happens ad-hoc, understand PRs so you can formalize change discussion and automated checks before anything merges to main.
Pulumi
Infrastructure as code using general-purpose programming languages.
Write IaC in TypeScript, Python, Go, C#, etc. Full IDE support, testing frameworks, and type safety.
When to reach for it: When you write infrastructure as code, understand programming-language-native approaches so you can reuse development patterns and tooling.
Sources:
R
RabbitMQ
An open-source message broker supporting multiple messaging protocols.
AMQP, MQTT, STOMP. Good for task queues and traditional messaging patterns.
When to reach for it: When you need reliable asynchronous communication between services, understand RabbitMQ so you can implement task queues and message routing with multiple protocols.
Sources:
RASP (Runtime Application Self-Protection)
Security technology that runs inside an application to detect and block attacks.
Protects in real-time without code changes. Can block SQL injection, XSS at runtime.
When to reach for it: When you defend applications in production, understand RASP so you can detect and block attacks from inside the running process without code changes.
Rolling Update
Incrementally updating instances of an application without downtime.
Old instances are replaced one-by-one; traffic shifts gradually to new version.
When to reach for it: When updating live services, understand how to replace instances incrementally so you can shift traffic gradually to new versions without downtime.
Runbook
Documentation with step-by-step instructions for handling common scenarios.
Reduces MTTR by codifying knowledge. Should be tested and kept up-to-date.
When to reach for it: When responding to operational incidents, understand documented procedures so you can reduce mean time to resolution and prevent knowledge silos.
S
Saga Pattern
Managing distributed transactions across multiple services.
Choreography (events) or orchestration (coordinator). Each step has compensating action.
When to reach for it: When you need to coordinate transactions across multiple independent services, understand the saga pattern so you can manage distributed transactions with compensating actions instead of traditional ACID.
SAST (Static Application Security Testing)
Analyzing source code for security vulnerabilities without executing it.
Tools: SonarQube, CodeQL, Semgrep. Runs in IDE or CI pipeline.
When to reach for it: When you integrate security into development, understand SAST so you can scan source code for vulnerabilities before compilation or runtime execution.
Sources:
SBOM (Software Bill of Materials)
A complete inventory of components in a software artifact.
Required for supply chain transparency. Formats: SPDX, CycloneDX. Enables vulnerability tracking.
When to reach for it: When you track software artifacts, understand SBOM so you can maintain an inventory of components and quickly identify which products are affected by vulnerabilities.
SCA (Software Composition Analysis)
Identifying vulnerabilities and license issues in third-party dependencies.
Tools: Snyk, Dependabot, Trivy. Critical for supply chain security.
When to reach for it: When you manage third-party dependencies, understand SCA so you can identify known vulnerabilities and license compliance issues in your supply chain.
Sources:
Secret Scanning
Detecting credentials, API keys, and other secrets in code or configuration.
Tools: GitLeaks, TruffleHog, GitHub Secret Scanning. Critical to prevent credential leaks.
When to reach for it: When you protect credentials, understand Secret Scanning so you can detect and prevent API keys and secrets from being committed to version control.
Sources:
Semantic Versioning (SemVer)
A versioning scheme using MAJOR.MINOR.PATCH format.
MAJOR = breaking changes, MINOR = new features, PATCH = bug fixes. Communicates compatibility.
When to reach for it: When you publish libraries or manage dependencies, understand semantic versioning so you can communicate compatibility guarantees and automate safe upgrades.
Sources:
Semgrep
A fast, open-source static analysis tool for finding bugs and vulnerabilities.
Pattern-based rules in YAML. Supports many languages. Good for custom security rules.
When to reach for it: When you need custom static analysis rules for your codebase, understand pattern-based code analysis so you can detect application-specific bugs and security patterns efficiently.
Sources:
Serverless
A cloud execution model where the provider manages server infrastructure.
Pay per use, auto-scaling to zero. Examples: AWS Lambda, Azure Functions, Cloud Run.
When to reach for it: When you want to reduce infrastructure management overhead, understand serverless architecture so you can deploy code that auto-scales to zero, runs on-demand, and charges per use.
Service Mesh
Infrastructure layer for handling service-to-service communication.
Provides observability, traffic control, security (mTLS). Examples: Istio, Linkerd, Cilium.
When to reach for it: When your microservices grow beyond a handful and you need consistent traffic control, security, and observability across service calls, understand service mesh so you can enforce policies without modifying application code.
Sources:
Shift Left
Move feedback and controls earlier in the lifecycle (testing, security, validation).
Finding issues earlier is cheaper. Examples: SAST in IDE, security review at design.
When to reach for it: When you design quality processes, understand Shift Left so you can catch defects and security issues earlier in development where they cost less to fix.
Sources:
Shift Right
Extending testing and monitoring into production environments.
Chaos engineering, canary analysis, production observability. Complements shift-left.
When to reach for it: When you build resilient systems, understand Shift Right so you can extend testing and monitoring into production to catch issues that only appear at scale.
Sources:
Sidecar Pattern
Deploying a helper container alongside the main application container.
Common in Kubernetes. Used for logging, proxying, secrets injection, and service mesh.
When to reach for it: When you need to add cross-cutting concerns like logging, proxying, or security to containers without modifying the application, understand the sidecar pattern so you can keep your main application focused and simple.
Sigstore
A project for signing, verifying, and protecting software supply chains.
Keyless signing with Cosign, transparency logs with Rekor. Makes signing easy and auditable.
When to reach for it: When you sign and verify software artifacts, understand Sigstore so you can add cryptographic signatures without managing private keys manually.
Sources:
SLA (Service Level Agreement)
A contractual commitment about service reliability, often with financial penalties.
SLAs are typically less stringent than internal SLOs to provide a buffer.
When to reach for it: When customer expectations about uptime are unclear, understand SLAs so you can set contractual terms backed by penalties and internal buffers.
SLI (Service Level Indicator)
A quantitative measure of service behavior (e.g., latency, error rate, throughput).
The raw metric used to calculate whether an SLO is being met.
When to reach for it: When you're unsure if you're meeting your SLO, understand SLIs so you can measure the actual behavior driving reliability decisions.
Sources:
SLO (Service Level Objective)
A measurable reliability target for a service (e.g., 99.9% success rate).
Defines "good enough" reliability; enables data-driven prioritization between features and stability.
When to reach for it: When reliability feels like an infinite goal with no trade-offs, understand SLOs so you can define acceptable reliability and allocate your error budget.
Sources:
SLSA (Supply-chain Levels for Software Artifacts)
A framework for ensuring the integrity of software artifacts throughout the supply chain.
Levels 1-4 define increasing assurance. Includes provenance, build integrity, and source requirements.
When to reach for it: When you strengthen supply chain integrity, understand SLSA so you can implement a maturity framework that prevents artifact tampering and ensures build provenance.
Sources:
Snyk
A developer-first security platform for finding and fixing vulnerabilities.
SCA, container scanning, IaC security, and code analysis. Integrates with developer workflows.
When to reach for it: When you need to integrate security scanning into your development workflow, understand developer-first vulnerability management so you can fix supply chain and code vulnerabilities before release.
Sources:
SonarQube
A platform for continuous inspection of code quality and security.
Static analysis, code coverage, security hotspots, and quality gates. Self-hosted or cloud.
When to reach for it: When you need to enforce code quality standards and security practices across your team, understand continuous code inspection so you can maintain consistent quality gates and reduce technical debt.
Sources:
SOPS
Secrets OPerationS, a tool for encrypting secrets in files.
Supports AWS KMS, GCP KMS, Azure Key Vault, and age. Enables GitOps with encrypted secrets.
When to reach for it: When you version encrypted secrets in Git, understand file-level encryption so you can integrate secret management into GitOps workflows while keeping secrets encrypted at rest.
Spec-Driven Development
Writing detailed specifications before implementation, often to guide AI code generation.
AI tools can generate code from specs. Risk: reverting to waterfall anti-patterns with big-bang releases. Best used with incremental delivery.
When to reach for it: When using AI to generate code, understand how detailed specifications guide generation so you can get better results while avoiding waterfall anti-patterns and maintaining incremental delivery.
SPIFFE/SPIRE
Standards and tools for workload identity in distributed systems.
SPIFFE defines the identity format; SPIRE is the reference implementation. Zero-trust foundations.
When to reach for it: When you assign cryptographic identity to workloads across distributed systems, understand SPIFFE and SPIRE as standards and tooling for keyless workload authentication so you can automate certificate lifecycle management and enable zero-trust networks.
Split
A feature delivery platform combining feature flags with observability.
Feature flags, A/B testing, and feature impact metrics in one platform.
When to reach for it: When you want to measure feature impact alongside experimentation, understand integrated feature delivery so you can track how releases affect user behavior and system performance.
SRE (Site Reliability Engineering)
A discipline that applies software engineering to operations problems.
Pioneered by Google. Focuses on reliability through SLOs, error budgets, toil reduction, and treating operations as a software problem. SREs are engineers, not ops.
When to reach for it: When ops work feels endless, understand SRE so you can use error budgets and toil reduction to treat reliability as an engineering problem with engineering solutions.
Sources:
Related:
Strangler Fig Pattern
Gradually replacing a legacy system by routing traffic to new components.
Named after strangler fig trees. Low-risk migration strategy.
When to reach for it: When you need to migrate from a legacy system without risky big-bang rewrites, understand the strangler fig pattern so you can gradually replace the old system by routing traffic to new components.
Stream-Aligned Team
A team aligned to a flow of work from a segment of the business domain.
Primary team type in Team Topologies. Delivers value end-to-end with minimal dependencies.
When to reach for it: When you organize teams around delivery, understand stream-aligned teams as squads responsible for a complete value stream so you can reduce handoffs and accelerate feedback.
Synthetic Monitoring
Simulating user interactions to proactively detect issues.
Runs scripted transactions against production. Detects issues before real users.
When to reach for it: When you need to detect user-facing outages before your customers notice, understand synthetic monitoring so you can continuously run scripted transactions and catch degradation early.
T
Team Topologies
A model for organizing teams based on flow and cognitive load.
Four team types: stream-aligned, enabling, complicated-subsystem, platform. Three interaction modes.
When to reach for it: When you structure teams for high velocity, understand Team Topologies as a framework that identifies team types and interaction patterns so you can align teams to value streams and minimize dependencies.
Tekton
A Kubernetes-native CI/CD framework for building pipelines.
Cloud-native, declarative pipelines as Kubernetes resources. Part of CD Foundation.
When to reach for it: When you build CI/CD pipelines for Kubernetes-based systems, understand that Tekton defines pipelines as Kubernetes resources so you can use familiar kubectl tooling and run complex workflows declaratively.
Sources:
Tempo
A high-scale distributed tracing backend from Grafana Labs.
Object-storage based, integrates with Grafana. Supports Jaeger, Zipkin, and OpenTelemetry.
When to reach for it: When you need to store and query distributed traces at high volume, understand that Tempo uses object storage instead of custom databases so you can reduce operational complexity and reuse existing infrastructure.
Terraform
An infrastructure as code tool using declarative configuration files.
Multi-cloud support, state management, module ecosystem. Industry standard for IaC.
When to reach for it: When you provision cloud infrastructure, understand declarative infrastructure as code so you can version, review, and reproducibly manage resources.
Sources:
Testing Pyramid
A model for balancing test types: many unit tests, fewer integration tests, fewest E2E tests.
Ensures fast feedback. Unit tests catch most issues; E2E tests catch integration issues.
When to reach for it: When you allocate resources to testing, understand the testing pyramid so you can balance fast feedback with comprehensive coverage and keep test suites quick and maintainable.
Thrashing Team
A DORA 2025 team profile characterized by poor performance across all metrics.
About 10% of teams. High failure rates, slow delivery, slow recovery. Need fundamental improvements to foundations before AI can help. Opposite of Elite profile.
When to reach for it: When you diagnose a struggling team's problems, understand Thrashing as a profile with poor metrics across all dimensions so you can address root causes before chasing quick fixes.
Throughput
The number of work items completed in a given time period.
Measures team delivery rate. Track trends rather than absolute numbers.
When to reach for it: When you evaluate delivery performance, understand throughput so you can track the rate at which your team completes work over time.
Sources:
Toil
Manual, repetitive, automatable work that scales with service size.
SRE goal: keep toil below 50% of time. Automate or eliminate toil to focus on engineering.
When to reach for it: When you assess operations work, understand toil so you can identify repetitive manual tasks and prioritize automation to free time for engineering.
Sources:
Tracing
Tracking requests as they flow through distributed systems.
Enables root cause analysis across services. Standards: OpenTelemetry, W3C Trace Context.
When to reach for it: When requests span multiple services, understand tracing so you can see the path each request takes and identify which service introduced latency or errors.
Sources:
Trivy
A comprehensive security scanner for containers, filesystems, and infrastructure.
Scans for vulnerabilities, misconfigurations, secrets, and license issues. Fast and easy to use.
When to reach for it: When you need to scan container images and infrastructure code for vulnerabilities in your CI/CD pipeline, understand vulnerability scanning tools so you can detect and remediate security issues early.
Sources:
Trunk-Based Development
A branching strategy where developers merge small changes frequently to a single main branch.
Reduces merge conflicts, enables continuous integration, and supports high-velocity delivery.
When to reach for it: When long-lived branches become merge nightmares, understand trunk-based development so you can merge small changes frequently and keep CI flowing.
Sources:
Twelve-Factor App
A methodology for building modern, scalable, cloud-native applications.
Twelve principles including config in env vars, stateless processes, dev/prod parity. Foundation for cloud-native.
When to reach for it: When you build software-as-a-service applications, understand the twelve-factor methodology so you can ensure consistent behavior across development, staging, and production.
Sources:
Two-Pizza Teams
Teams small enough to be fed by two pizzas (typically 6-10 people).
Amazon concept. Smaller teams have faster communication and decision-making.
When to reach for it: When you organize engineering teams, understand the relationship between team size and decision velocity so you can maintain communication efficiency and autonomy.
U
Unit Testing
Testing individual components or functions in isolation.
Fast, focused, numerous. Foundation of the testing pyramid. Run on every commit.
When to reach for it: When you write code, understand unit testing so you can catch bugs fast with feedback in milliseconds and document the intended behavior of individual functions.
Unleash
An open-source feature flag management system.
Self-hosted or cloud. Supports gradual rollouts, A/B testing, and kill switches.
When to reach for it: When you need an open-source feature flag system, understand self-hosted feature management so you can control release strategies without vendor lock-in.
V
Value Stream
The end-to-end flow from idea to value delivered to users.
Mapping value streams reveals waste, handoffs, and bottlenecks. Core to lean/DevOps improvement.
When to reach for it: When you analyze how value flows through your organization, understand value streams as the complete path from idea to deployed software so you can identify and eliminate waste.
Value Stream Mapping (VSM)
A lean technique for visualizing and analyzing the flow of work.
Identifies wait times, handoffs, and waste. Calculates flow efficiency. Starting point for improvement.
When to reach for it: When you diagnose bottlenecks in your delivery process, understand VSM as a visualization technique that distinguishes active work from wait time so you can calculate flow efficiency and prioritize improvements.
Vibe Coding
Rapid prototyping with AI assistance based on natural language descriptions.
Enables quick exploration and iteration. Requires human review and refinement. Useful for MVPs and proof-of-concepts.
When to reach for it: When exploring ideas quickly with AI, understand rapid prototyping based on natural language so you can validate concepts fast while knowing human review and refinement are essential.
W
War Room
A dedicated space (physical or virtual) for incident response coordination.
Centralizes communication during major incidents. Clear roles and real-time updates.
When to reach for it: When a major incident occurs, understand centralized coordination structures so you can enable real-time communication and decision-making under pressure.
Work in Progress (WIP)
The number of tasks currently being worked on.
Limiting WIP improves flow and reduces context switching. Core kanban concept.
When to reach for it: When you manage team capacity, understand WIP limits so you can reduce context switching and reveal bottlenecks in your workflow.
Y
You Build It, You Run It
Teams are responsible for operating and supporting what they build.
Aligns incentives for reliability. Teams feel the pain of poor design or operations.
When to reach for it: When you allocate operational responsibility, understand accountability alignment so you can drive reliability focus from the development side and reduce handoff friction.
Sources:
Z
Zero Trust
Security model that requires verification for every access request, regardless of location.
Never trust, always verify. Applies to networks, identities, and workloads.
When to reach for it: When you design security architecture for distributed systems, understand that zero trust replaces network perimeter defense with continuous verification so you can prevent lateral movement and reduce breach impact.