Infrastructure as Code (IaC) has become essential. Managing infrastructure manually doesn’t scale, isn’t reproducible, and is error-prone. But IaC poorly done creates its own problems: sprawling codebases, drift, and deployment nightmares.
Here’s how to do IaC well.
Foundational Principles
Declarative Over Imperative
Describe what you want, not how to get there:
# Declarative (Terraform)
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
# Imperative (scripts) - avoid
aws ec2 run-instances --image-id ami-0c55b159... --instance-type t3.micro
Declarative IaC:
- Describes desired state
- Handles creation, updates, and deletion
- Idempotent by design
- Self-documenting
Immutable Infrastructure
Replace, don’t modify:
# Instance replacement on AMI change
resource "aws_launch_template" "web" {
image_id = var.ami_id
# Changes trigger replacement, not in-place update
}
resource "aws_autoscaling_group" "web" {
launch_template {
id = aws_launch_template.web.id
version = aws_launch_template.web.latest_version
}
# Rolling replacement on template change
}
Benefits:
- Reproducible deployments
- Easy rollback (deploy previous version)
- No configuration drift
- Simpler troubleshooting
Version Control Everything
All infrastructure code lives in Git:
infrastructure/
├── terraform/
│ ├── modules/
│ ├── environments/
│ └── global/
├── kubernetes/
│ ├── base/
│ └── overlays/
└── scripts/
Benefits:
- Change history
- Code review for infrastructure changes
- Rollback capability
- Audit trail
Repository Organization
Monorepo vs. Multi-Repo
Monorepo:
infrastructure/
├── modules/
│ ├── vpc/
│ ├── eks/
│ └── rds/
└── environments/
├── dev/
├── staging/
└── production/
Pros: Easier refactoring, atomic changes across modules Cons: Larger blast radius, complex CI/CD
Multi-Repo:
terraform-vpc/
terraform-eks/
terraform-rds/
terraform-env-dev/
terraform-env-prod/
Pros: Independent deployments, smaller blast radius Cons: Harder to coordinate changes, version management
Recommendation: Start with monorepo, split when pain exceeds benefit.
State Organization
Split state to limit blast radius:
terraform/
├── global/ # Account-wide resources (IAM, Route53)
│ └── main.tf
├── network/ # VPC, subnets
│ └── main.tf
├── data/ # Databases, caches
│ └── main.tf
└── services/
├── api/ # Per-service state
└── web/
Each directory = separate state file:
- Changes are isolated
- Faster apply times
- Easier to reason about
- Team ownership boundaries
Module Structure
modules/
├── vpc/
│ ├── main.tf # Resources
│ ├── variables.tf # Input variables
│ ├── outputs.tf # Output values
│ ├── versions.tf # Provider versions
│ └── README.md # Documentation
Standard structure makes modules predictable.
Module Design
Single Responsibility
Modules do one thing:
# Good - focused module
module "vpc" {
source = "./modules/vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
}
module "eks" {
source = "./modules/eks"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
}
# Bad - kitchen sink module
module "infrastructure" {
source = "./modules/everything"
# Creates VPC, EKS, RDS, Redis, S3...
}
Sensible Defaults
Provide good defaults, allow override:
variable "instance_type" {
type = string
default = "t3.micro"
description = "EC2 instance type"
}
variable "enable_monitoring" {
type = bool
default = true
description = "Enable detailed monitoring"
}
Most uses need no customization. Power users can override.
Composition Over Inheritance
Build complex infrastructure from simple modules:
# Root module composes simple modules
module "vpc" {
source = "./modules/vpc"
}
module "security_groups" {
source = "./modules/security-groups"
vpc_id = module.vpc.vpc_id
}
module "rds" {
source = "./modules/rds"
subnet_ids = module.vpc.database_subnet_ids
security_group_id = module.security_groups.rds_sg_id
}
Version Your Modules
# Pin module versions
module "vpc" {
source = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0"
}
# Or with registry
module "vpc" {
source = "hashicorp/vpc/aws"
version = "3.0.0"
}
Unpinned modules break unexpectedly.
Environment Management
Directory Per Environment
environments/
├── dev/
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── production/
├── main.tf
└── terraform.tfvars
Each environment has its own state.
Workspaces for Simple Cases
terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform apply
Works for simple cases but can get confusing.
Environment-Specific Configuration
# terraform.tfvars - environment-specific
environment = "production"
instance_count = 3
instance_type = "t3.large"
# main.tf - common infrastructure
module "app" {
source = "../modules/app"
environment = var.environment
instance_count = var.instance_count
instance_type = var.instance_type
}
Promoting Between Environments
# Dev → Staging → Production
# Same code, different variables
cd environments/dev && terraform apply
# Test...
cd environments/staging && terraform apply
# Test...
cd environments/production && terraform apply
CI/CD for Infrastructure
Plan on PR
# .github/workflows/terraform.yml
on:
pull_request:
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Plan
run: terraform plan -no-color
continue-on-error: true
- name: Comment Plan
uses: actions/github-script@v6
with:
script: |
// Post plan output as PR comment
Every PR shows infrastructure changes.
Apply on Merge
on:
push:
branches: [main]
jobs:
apply:
runs-on: ubuntu-latest
environment: production # Requires approval
steps:
- name: Terraform Apply
run: terraform apply -auto-approve
Drift Detection
Schedule regular plans to detect drift:
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
drift:
runs-on: ubuntu-latest
steps:
- name: Check for drift
run: |
terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
echo "Drift detected!"
# Alert
fi
Security Practices
Secrets Management
Never commit secrets:
# Bad
resource "aws_db_instance" "main" {
password = "hardcoded-password" # Never!
}
# Good - use variables
resource "aws_db_instance" "main" {
password = var.db_password
}
# Better - use secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "db-password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
State Security
State files contain sensitive data:
# Remote state with encryption
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
- Encrypt state at rest
- Restrict state access
- Enable state locking
Least Privilege
IaC execution should have minimal permissions:
# CI/CD role with limited permissions
resource "aws_iam_role" "terraform" {
name = "terraform-ci"
# Only permissions needed for managed resources
}
Key Takeaways
- Use declarative IaC; describe desired state, not steps
- Prefer immutable infrastructure; replace instead of modify
- Version control all infrastructure code
- Split state to limit blast radius and enable team ownership
- Design focused, single-responsibility modules with sensible defaults
- Pin module versions to prevent unexpected changes
- Run terraform plan on PRs, apply on merge
- Schedule drift detection
- Never commit secrets; use secrets managers
- Encrypt state and restrict access
Infrastructure as Code is essential but requires discipline. Invest in good practices early—they’re much harder to adopt later.