Infrastructure as Code Patterns for Scale

November 21, 2022

Infrastructure as Code (IaC) has moved from nice-to-have to essential. But scaling IaC from a few resources to enterprise-grade infrastructure introduces complexity. How you structure, test, and manage IaC determines whether it helps or creates new problems.

Here are patterns that work at scale.

Structuring IaC

Module Design

module_principles:
  single_responsibility:
    - One module, one purpose
    - Example: VPC module, EKS module, RDS module
    - Avoid monolithic "kitchen sink" modules

  composability:
    - Modules can be combined
    - Clear inputs and outputs
    - Minimal interdependencies

  versioning:
    - Semantic versioning
    - Pin versions in consumers
    - Changelog for breaking changes
# Well-structured module
# modules/eks-cluster/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids              = var.subnet_ids
    endpoint_private_access = var.endpoint_private_access
    endpoint_public_access  = var.endpoint_public_access
  }

  depends_on = [
    aws_iam_role_policy_attachment.cluster_policy,
  ]
}

# modules/eks-cluster/variables.tf
variable "cluster_name" {
  type        = string
  description = "Name of the EKS cluster"
}

variable "kubernetes_version" {
  type        = string
  default     = "1.24"
  description = "Kubernetes version"
}

variable "subnet_ids" {
  type        = list(string)
  description = "Subnet IDs for the cluster"
}

# modules/eks-cluster/outputs.tf
output "cluster_endpoint" {
  value       = aws_eks_cluster.main.endpoint
  description = "EKS cluster endpoint"
}

output "cluster_ca_certificate" {
  value       = aws_eks_cluster.main.certificate_authority[0].data
  description = "Cluster CA certificate"
}

Repository Structure

repository_patterns:
  monorepo:
    structure:
      - modules/        # Reusable modules
      - environments/   # Environment configs
        - production/
        - staging/
        - development/
      - policies/       # Compliance policies

    benefits:
      - Everything in one place
      - Easy cross-cutting changes
      - Single CI/CD pipeline

  polyrepo:
    structure:
      - terraform-modules/  # Shared modules repo
      - infra-production/   # Production configs
      - infra-staging/      # Staging configs

    benefits:
      - Team ownership
      - Independent lifecycles
      - Access control

State Management

state_management:
  backends:
    s3_dynamodb:
      state: S3 bucket
      locking: DynamoDB table
      encryption: SSE-S3 or KMS

    terraform_cloud:
      state: Terraform Cloud
      locking: Built-in
      collaboration: Built-in

  state_structure:
    per_environment:
      - production.tfstate
      - staging.tfstate
      - development.tfstate

    per_component:
      - production/networking.tfstate
      - production/compute.tfstate
      - production/database.tfstate
# Backend configuration
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/networking.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Environment Patterns

DRY Environments

dry_environments:
  problem:
    - Copy-paste between environments
    - Drift over time
    - Maintenance burden

  solution:
    - Shared modules
    - Environment-specific variables
    - Terragrunt or workspaces
# Using Terragrunt for DRY configs
# environments/production/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "../../modules//eks-cluster"
}

inputs = {
  cluster_name       = "production-cluster"
  kubernetes_version = "1.24"
  node_instance_type = "m5.xlarge"
  node_min_count     = 3
  node_max_count     = 10
}

# environments/staging/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "../../modules//eks-cluster"
}

inputs = {
  cluster_name       = "staging-cluster"
  kubernetes_version = "1.24"
  node_instance_type = "m5.large"  # Smaller
  node_min_count     = 1           # Fewer nodes
  node_max_count     = 3
}

Environment Promotion

promotion_pattern:
  workflow:
    1. Change merged to main
    2. Auto-deploy to development
    3. Manual promotion to staging
    4. Approval and deploy to production

  implementation:
    - Git branches per environment (anti-pattern)
    - Same code, different variables (better)
    - Promotion workflow in CI/CD

Testing IaC

Validation Layers

testing_pyramid:
  static_analysis:
    - Syntax validation (terraform validate)
    - Linting (tflint)
    - Security scanning (checkov, tfsec)
    - Policy checks (OPA, Sentinel)

  unit_tests:
    - Module logic tests
    - Terraform test framework
    - Mock providers

  integration_tests:
    - Deploy to test environment
    - Verify resources created
    - Destroy after test

  end_to_end:
    - Full environment deployment
    - Application functionality
    - Rarely automated
# Example CI pipeline
name: Terraform CI

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2

      - name: Format Check
        run: terraform fmt -check -recursive

      - name: Validate
        run: |
          cd terraform
          terraform init -backend=false
          terraform validate

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: tfsec
        uses: aquasecurity/tfsec-action@v1

      - name: checkov
        uses: bridgecrewio/checkov-action@v12

  plan:
    runs-on: ubuntu-latest
    needs: [validate, security]
    steps:
      - uses: actions/checkout@v3

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=plan.out

      - name: Upload Plan
        uses: actions/upload-artifact@v3
        with:
          name: terraform-plan
          path: plan.out

Policy as Code

# OPA policy: Require encryption
package terraform.aws.s3

deny[msg] {
  resource := input.resource.aws_s3_bucket[name]
  not resource.server_side_encryption_configuration
  msg := sprintf("S3 bucket '%s' must have encryption enabled", [name])
}

deny[msg] {
  resource := input.resource.aws_s3_bucket[name]
  resource.acl == "public-read"
  msg := sprintf("S3 bucket '%s' must not be public", [name])
}

Handling Drift

Drift Detection

drift_management:
  detection:
    - Scheduled terraform plan
    - Compare plan output to expected
    - Alert on unexpected changes

  causes:
    - Manual changes (console, CLI)
    - External automation
    - Resource auto-scaling
    - Provider updates

  prevention:
    - Lock down console access
    - All changes through IaC
    - Education and culture
# GitHub Action for drift detection
name: Drift Detection

on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Terraform Plan
        id: plan
        run: |
          terraform init
          terraform plan -detailed-exitcode -out=plan.out
        continue-on-error: true

      - name: Check for Drift
        if: steps.plan.outputs.exitcode == 2
        run: |
          echo "Drift detected!"
          terraform show plan.out
          # Send Slack notification

Secrets Management

secrets_patterns:
  avoid:
    - Secrets in terraform files
    - Secrets in state file
    - Hardcoded values

  approaches:
    external_secrets:
      - Reference from Vault/AWS Secrets Manager
      - Data source lookup
      - Injected at runtime

    sensitive_variables:
      - Mark as sensitive
      - Pass via environment
      - CI/CD secrets management
# Reference secrets from AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/database/password"
}

resource "aws_db_instance" "main" {
  identifier     = "production-db"
  engine         = "postgres"
  instance_class = "db.r5.large"

  password = data.aws_secretsmanager_secret_version.db_password.secret_string

  lifecycle {
    ignore_changes = [password]  # Managed externally
  }
}

Key Takeaways

Infrastructure as Code enables scale. Patterns enable maintainability.