Terraform at Scale: Patterns and Practices

Terraform is the de facto standard for infrastructure as code. For small projects, it’s straightforward. At scale—multiple teams, environments, hundreds of resources—it becomes complex. State management, module design, and workflow patterns become critical.

Here’s how to make Terraform work at scale.

The Scale Challenge

What Goes Wrong

terraform_scaling_problems:
  monolithic_state:
    - One state file for everything
    - Slow plans and applies
    - High blast radius
    - Team coordination issues

  copy_paste_config:
    - Duplicate code everywhere
    - Inconsistent configurations
    - Difficult to update
    - Bug propagation

  no_standards:
    - Every team does it differently
    - Naming inconsistencies
    - No review process
    - Security gaps

  manual_operations:
    - terraform apply from laptops
    - No audit trail
    - State conflicts
    - Credential exposure

State Management

State Isolation

# Split by environment and component
terraform/
├── environments/
│   ├── dev/
│   │   ├── networking/
│   │   ├── compute/
│   │   └── database/
│   ├── staging/
│   │   ├── networking/
│   │   ├── compute/
│   │   └── database/
│   └── prod/
│       ├── networking/
│       ├── compute/
│       └── database/

# Each component has own state
# environments/prod/networking/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Remote State Data Sources

# Reference outputs from other state files
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

# Use outputs
resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_id
  vpc_security_group_ids = [
    data.terraform_remote_state.networking.outputs.app_security_group_id
  ]
}

State Locking

# DynamoDB for state locking (AWS)
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Module Design

Module Structure

# modules/aws-ecs-service/
├── main.tf
├── variables.tf
├── outputs.tf
├── versions.tf
├── README.md
└── examples/
    ├── basic/
    └── complete/

# variables.tf - Clear interface
variable "name" {
  description = "Name of the ECS service"
  type        = string
}

variable "container_image" {
  description = "Docker image for the container"
  type        = string
}

variable "container_port" {
  description = "Port exposed by the container"
  type        = number
  default     = 8080
}

variable "cpu" {
  description = "CPU units for the task"
  type        = number
  default     = 256
}

variable "memory" {
  description = "Memory in MiB for the task"
  type        = number
  default     = 512
}

variable "desired_count" {
  description = "Number of tasks to run"
  type        = number
  default     = 2
}

variable "environment" {
  description = "Environment variables for the container"
  type        = map(string)
  default     = {}
}

variable "secrets" {
  description = "Secrets from SSM/Secrets Manager"
  type = list(object({
    name      = string
    valueFrom = string
  }))
  default = []
}

Module Versioning

# Pin module versions
module "app_service" {
  source  = "git::https://github.com/company/terraform-modules.git//aws-ecs-service?ref=v1.2.3"
  # Or from registry
  source  = "company/ecs-service/aws"
  version = "~> 1.2"

  name            = "api"
  container_image = "api:${var.image_tag}"
  desired_count   = var.environment == "prod" ? 3 : 1
}

Composable Modules

# Layer modules for common patterns
# modules/web-application/
# Composes: ECS service + ALB + CloudWatch + Autoscaling

module "web_application" {
  source = "./modules/web-application"

  name            = "frontend"
  container_image = "frontend:latest"
  domain_name     = "app.example.com"

  # Scaling
  min_capacity = 2
  max_capacity = 10

  # Resources
  cpu    = 512
  memory = 1024
}

Code Organization

Workspace vs. Directory Structure

approaches:
  directory_per_environment:
    structure:
      - environments/dev/
      - environments/staging/
      - environments/prod/
    pros:
      - Clear separation
      - Different configs per env
      - Independent state
    cons:
      - Duplication
      - Drift between environments

  workspaces:
    command: terraform workspace select prod
    pros:
      - Single codebase
      - Less duplication
    cons:
      - Harder to review differences
      - Same variables across environments
      - State in single backend

  recommendation: Directories for isolation, workspaces sparingly

Terragrunt for DRY

# terragrunt.hcl - Parent config
remote_state {
  backend = "s3"
  config = {
    bucket         = "company-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

# environments/prod/app/terragrunt.hcl
include {
  path = find_in_parent_folders()
}

terraform {
  source = "../../../modules//web-application"
}

inputs = {
  name            = "api"
  environment     = "prod"
  container_image = "api:v1.2.3"
  desired_count   = 3
}

CI/CD Integration

GitHub Actions Workflow

name: Terraform

on:
  pull_request:
    paths:
      - 'terraform/**'
  push:
    branches:
      - main
    paths:
      - 'terraform/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.3.0

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/environments/prod

      - name: Terraform Validate
        run: terraform validate
        working-directory: terraform/environments/prod

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan -no-color
        working-directory: terraform/environments/prod
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Comment Plan
        uses: actions/github-script@v6
        if: github.event_name == 'pull_request'
        with:
          script: |
            const output = `#### Terraform Plan
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`
            `;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            })

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production
    steps:
      - uses: actions/checkout@v3

      - uses: hashicorp/setup-terraform@v2

      - name: Terraform Init
        run: terraform init

      - name: Terraform Apply
        run: terraform apply -auto-approve
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Terraform Cloud

terraform {
  cloud {
    organization = "company"

    workspaces {
      name = "prod-infrastructure"
    }
  }
}

# Benefits:
# - Remote state management
# - Locking built-in
# - Run history and audit
# - Policy enforcement (Sentinel)
# - Variable management

Security and Compliance

Policy Enforcement

# Sentinel policy (Terraform Cloud/Enterprise)
import "tfplan/v2" as tfplan

# Require encryption on S3 buckets
s3_buckets = filter tfplan.resource_changes as _, rc {
    rc.type is "aws_s3_bucket" and
    rc.mode is "managed" and
    rc.change.actions contains "create"
}

encryption_required = rule {
    all s3_buckets as _, bucket {
        bucket.change.after.server_side_encryption_configuration is not null
    }
}

main = rule {
    encryption_required
}

# OPA/Conftest alternative
# policy/s3.rego
package main

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    resource.change.actions[_] == "create"
    not resource.change.after.server_side_encryption_configuration
    msg := sprintf("S3 bucket %s must have encryption enabled", [resource.address])
}

Secrets Management

# Don't hardcode secrets
# Bad
resource "aws_db_instance" "main" {
  password = "supersecret123"  # NO!
}

# Good - Use variables from CI/CD or Terraform Cloud
variable "db_password" {
  type      = string
  sensitive = true
}

# Better - Use secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Key Takeaways

Split state by environment and component to reduce blast radius
Use remote state data sources to reference across state files
Design modules with clear interfaces and version them
Choose directory structure over workspaces for environment isolation
Terragrunt can reduce duplication while maintaining separation
Run Terraform in CI/CD, not from laptops
Comment plans on PRs for review
Use Terraform Cloud or similar for state, locking, and audit
Enforce policies with Sentinel or OPA
Never hardcode secrets; use secrets management

Terraform at scale requires discipline. The patterns here prevent the chaos that comes from unstructured growth.