Terraform is the de facto standard for infrastructure as code. For small projects, it’s straightforward. At scale—multiple teams, environments, hundreds of resources—it becomes complex. State management, module design, and workflow patterns become critical.
Here’s how to make Terraform work at scale.
The Scale Challenge
What Goes Wrong
terraform_scaling_problems:
monolithic_state:
- One state file for everything
- Slow plans and applies
- High blast radius
- Team coordination issues
copy_paste_config:
- Duplicate code everywhere
- Inconsistent configurations
- Difficult to update
- Bug propagation
no_standards:
- Every team does it differently
- Naming inconsistencies
- No review process
- Security gaps
manual_operations:
- terraform apply from laptops
- No audit trail
- State conflicts
- Credential exposure
State Management
State Isolation
# Split by environment and component
terraform/
├── environments/
│ ├── dev/
│ │ ├── networking/
│ │ ├── compute/
│ │ └── database/
│ ├── staging/
│ │ ├── networking/
│ │ ├── compute/
│ │ └── database/
│ └── prod/
│ ├── networking/
│ ├── compute/
│ └── database/
# Each component has own state
# environments/prod/networking/backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Remote State Data Sources
# Reference outputs from other state files
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}
# Use outputs
resource "aws_instance" "app" {
subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_id
vpc_security_group_ids = [
data.terraform_remote_state.networking.outputs.app_security_group_id
]
}
State Locking
# DynamoDB for state locking (AWS)
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Module Design
Module Structure
# modules/aws-ecs-service/
├── main.tf
├── variables.tf
├── outputs.tf
├── versions.tf
├── README.md
└── examples/
├── basic/
└── complete/
# variables.tf - Clear interface
variable "name" {
description = "Name of the ECS service"
type = string
}
variable "container_image" {
description = "Docker image for the container"
type = string
}
variable "container_port" {
description = "Port exposed by the container"
type = number
default = 8080
}
variable "cpu" {
description = "CPU units for the task"
type = number
default = 256
}
variable "memory" {
description = "Memory in MiB for the task"
type = number
default = 512
}
variable "desired_count" {
description = "Number of tasks to run"
type = number
default = 2
}
variable "environment" {
description = "Environment variables for the container"
type = map(string)
default = {}
}
variable "secrets" {
description = "Secrets from SSM/Secrets Manager"
type = list(object({
name = string
valueFrom = string
}))
default = []
}
Module Versioning
# Pin module versions
module "app_service" {
source = "git::https://github.com/company/terraform-modules.git//aws-ecs-service?ref=v1.2.3"
# Or from registry
source = "company/ecs-service/aws"
version = "~> 1.2"
name = "api"
container_image = "api:${var.image_tag}"
desired_count = var.environment == "prod" ? 3 : 1
}
Composable Modules
# Layer modules for common patterns
# modules/web-application/
# Composes: ECS service + ALB + CloudWatch + Autoscaling
module "web_application" {
source = "./modules/web-application"
name = "frontend"
container_image = "frontend:latest"
domain_name = "app.example.com"
# Scaling
min_capacity = 2
max_capacity = 10
# Resources
cpu = 512
memory = 1024
}
Code Organization
Workspace vs. Directory Structure
approaches:
directory_per_environment:
structure:
- environments/dev/
- environments/staging/
- environments/prod/
pros:
- Clear separation
- Different configs per env
- Independent state
cons:
- Duplication
- Drift between environments
workspaces:
command: terraform workspace select prod
pros:
- Single codebase
- Less duplication
cons:
- Harder to review differences
- Same variables across environments
- State in single backend
recommendation: Directories for isolation, workspaces sparingly
Terragrunt for DRY
# terragrunt.hcl - Parent config
remote_state {
backend = "s3"
config = {
bucket = "company-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
# environments/prod/app/terragrunt.hcl
include {
path = find_in_parent_folders()
}
terraform {
source = "../../../modules//web-application"
}
inputs = {
name = "api"
environment = "prod"
container_image = "api:v1.2.3"
desired_count = 3
}
CI/CD Integration
GitHub Actions Workflow
name: Terraform
on:
pull_request:
paths:
- 'terraform/**'
push:
branches:
- main
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.3.0
- name: Terraform Init
run: terraform init
working-directory: terraform/environments/prod
- name: Terraform Validate
run: terraform validate
working-directory: terraform/environments/prod
- name: Terraform Plan
id: plan
run: terraform plan -out=tfplan -no-color
working-directory: terraform/environments/prod
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Comment Plan
uses: actions/github-script@v6
if: github.event_name == 'pull_request'
with:
script: |
const output = `#### Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
apply:
needs: plan
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Terraform Cloud
terraform {
cloud {
organization = "company"
workspaces {
name = "prod-infrastructure"
}
}
}
# Benefits:
# - Remote state management
# - Locking built-in
# - Run history and audit
# - Policy enforcement (Sentinel)
# - Variable management
Security and Compliance
Policy Enforcement
# Sentinel policy (Terraform Cloud/Enterprise)
import "tfplan/v2" as tfplan
# Require encryption on S3 buckets
s3_buckets = filter tfplan.resource_changes as _, rc {
rc.type is "aws_s3_bucket" and
rc.mode is "managed" and
rc.change.actions contains "create"
}
encryption_required = rule {
all s3_buckets as _, bucket {
bucket.change.after.server_side_encryption_configuration is not null
}
}
main = rule {
encryption_required
}
# OPA/Conftest alternative
# policy/s3.rego
package main
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
resource.change.actions[_] == "create"
not resource.change.after.server_side_encryption_configuration
msg := sprintf("S3 bucket %s must have encryption enabled", [resource.address])
}
Secrets Management
# Don't hardcode secrets
# Bad
resource "aws_db_instance" "main" {
password = "supersecret123" # NO!
}
# Good - Use variables from CI/CD or Terraform Cloud
variable "db_password" {
type = string
sensitive = true
}
# Better - Use secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Key Takeaways
- Split state by environment and component to reduce blast radius
- Use remote state data sources to reference across state files
- Design modules with clear interfaces and version them
- Choose directory structure over workspaces for environment isolation
- Terragrunt can reduce duplication while maintaining separation
- Run Terraform in CI/CD, not from laptops
- Comment plans on PRs for review
- Use Terraform Cloud or similar for state, locking, and audit
- Enforce policies with Sentinel or OPA
- Never hardcode secrets; use secrets management
Terraform at scale requires discipline. The patterns here prevent the chaos that comes from unstructured growth.