Serverless computing has matured significantly. AWS Lambda is now a proven production platform, and competitors like Azure Functions and Google Cloud Functions are viable options. But serverless has specific patterns that work well and anti-patterns that cause problems.
Here’s what we’ve learned from running serverless in production.
Patterns That Work
Event Processing
Serverless excels at event-driven workloads:
# S3 trigger - process uploaded files
def process_upload(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Process file
file = s3.get_object(Bucket=bucket, Key=key)
process(file['Body'].read())
Good fits:
- File processing (upload → transform → store)
- Message queue processing
- Webhook handlers
- IoT data ingestion
- Log processing
Why it works:
- Natural event-driven model
- Scale automatically with event volume
- Pay only for processing time
- No idle resources
Scheduled Tasks
Cron jobs without managing servers:
# CloudWatch Events trigger
functions:
dailyReport:
handler: reports.daily
events:
- schedule: cron(0 9 * * ? *) # 9 AM UTC daily
Good fits:
- Report generation
- Data cleanup
- Batch processing
- Sync jobs
- Health checks
API Backends with Variable Load
APIs with unpredictable traffic:
functions:
api:
handler: api.handler
events:
- http:
path: /users
method: get
- http:
path: /users/{id}
method: get
Good fits:
- Internal tools with sporadic usage
- Startups with unpredictable growth
- APIs with spiky traffic patterns
- Prototypes and MVPs
Glue Logic
Small pieces connecting systems:
# DynamoDB stream → Elasticsearch sync
def sync_to_elasticsearch(event, context):
for record in event['Records']:
if record['eventName'] in ['INSERT', 'MODIFY']:
doc = record['dynamodb']['NewImage']
es.index(index='products', id=doc['id'], body=doc)
elif record['eventName'] == 'REMOVE':
es.delete(index='products', id=record['dynamodb']['Keys']['id'])
Good fits:
- Data sync between systems
- Notification fanout
- Format transformation
- Integration webhooks
Anti-Patterns to Avoid
Long-Running Processes
Lambda has execution time limits (15 minutes max):
# Bad - might timeout
def process_large_dataset(event, context):
for record in get_all_records(): # Millions of records
process(record)
Solutions:
- Step Functions for orchestration
- SQS for work distribution
- EC2/ECS for long-running tasks
- Break into smaller chunks
Monolithic Functions
Single function doing everything:
# Bad - monolithic handler
def handler(event, context):
if event['path'] == '/users':
if event['method'] == 'GET':
return get_users()
elif event['method'] == 'POST':
return create_user()
elif event['path'] == '/orders':
# ... hundreds more lines
Better:
- One function per operation
- Organized by domain
- Shared code in layers
VPC for Everything
VPC adds cold start latency (can add seconds):
# Only use VPC when necessary
functions:
publicApi:
handler: public.handler
# No VPC - faster cold starts
databaseAccess:
handler: private.handler
vpc:
securityGroupIds:
- sg-123
subnetIds:
- subnet-456
Only use VPC when accessing private resources. For public APIs and services, avoid it.
Ignoring Cold Starts
Cold starts affect latency:
Cold start: 500ms - 5s (depends on runtime, VPC, etc.)
Warm invocation: 10-50ms
Mitigation strategies:
# Provisioned concurrency
functions:
api:
handler: api.handler
provisionedConcurrency: 10
# Or warm with scheduled pings
functions:
warmer:
handler: warmer.handler
events:
- schedule: rate(5 minutes)
Synchronous Call Chains
Lambda calling Lambda calling Lambda:
# Bad - synchronous chain
def order_handler(event, context):
user = lambda_client.invoke(FunctionName='get-user')
inventory = lambda_client.invoke(FunctionName='check-inventory')
payment = lambda_client.invoke(FunctionName='process-payment')
# Each hop adds latency and failure points
Better patterns:
- Async via SNS/SQS
- Step Functions for workflows
- Single function for simple logic
Stateful Functions
Lambda functions are stateless:
# Bad - state won't persist
connection = None
def handler(event, context):
global connection
if connection is None:
connection = create_connection() # Might not be reused
Reality:
- Execution context may be reused (warm start)
- But you can’t rely on it
- Don’t store state that must persist
Use external state (DynamoDB, ElastiCache, S3) for anything important.
Cost Patterns
Understand the Pricing Model
Lambda charges:
- Per request ($0.20 per million)
- Per compute time ($0.0000166667 per GB-second)
Cost = Requests × $0.20/million + Duration × Memory × $0.0000166667/GB-second
Right-Size Memory
More memory = faster execution = might be cheaper:
128MB × 1000ms = 0.125 GB-seconds = $0.0000020833
512MB × 300ms = 0.15 GB-seconds = $0.0000025
1024MB × 150ms = 0.15 GB-seconds = $0.0000025
Test different memory sizes. Sometimes more memory is more cost-effective.
High-Volume Can Be Expensive
At high volumes, serverless can cost more than traditional:
1 million requests/day × 500ms × 512MB
= ~260 GB-hours/day
= ~$4.17/day
= ~$125/month (Lambda only)
vs.
t3.medium: ~$30/month (always on)
Calculate breakeven. High-volume, consistent load often cheaper on containers.
Operational Patterns
Structured Logging
CloudWatch Logs work best with structured logs:
import json
def handler(event, context):
logger.info(json.dumps({
'request_id': context.aws_request_id,
'event': 'order_created',
'order_id': order.id,
'customer_id': customer.id,
'total': order.total
}))
Then query with CloudWatch Insights:
fields @timestamp, order_id, total
| filter event = 'order_created'
| stats sum(total) by bin(1h)
Correlation IDs
Track requests across functions:
def handler(event, context):
correlation_id = event.get('correlation_id') or str(uuid.uuid4())
# Include in all logs
logger = logger.bind(correlation_id=correlation_id)
# Propagate to downstream calls
invoke_downstream(correlation_id=correlation_id)
Error Handling
Lambda retries on failure (for async invocations):
def handler(event, context):
try:
process(event)
except RetryableError:
raise # Lambda will retry
except PermanentError as e:
# Don't retry, send to DLQ
send_to_dlq(event, e)
return # Return success to prevent retry
Configure dead-letter queues for failed invocations:
functions:
processor:
handler: process.handler
onError: arn:aws:sqs:region:account:dlq
Deployment Strategies
# Canary deployment
functions:
api:
handler: api.handler
deploymentSettings:
type: Canary10Percent5Minutes
Or use aliases and weighted routing:
# Shift 10% traffic to new version
aws lambda update-alias --name prod --function-name my-function \
--routing-config AdditionalVersionWeights={"2"=0.1}
When Not to Use Serverless
Latency-Sensitive Applications
- Cold starts add latency
- Provisioned concurrency helps but adds cost
- Consider containers if sub-100ms P99 required
Long-Running Workloads
- 15-minute limit is hard
- Step Functions add complexity
- Containers or EC2 better for batch processing
Heavy Compute
- Lambda CPU scales with memory
- Maximum 10GB memory
- GPU not available
- Use EC2/ECS for compute-intensive work
Predictable High Load
- If you know you need 1000 concurrent always
- Servers/containers likely cheaper
- Serverless shines for variable, unpredictable load
Key Takeaways
- Serverless excels at event processing, scheduled tasks, APIs with variable load, and glue logic
- Avoid long-running processes, monolithic functions, and synchronous Lambda-to-Lambda chains
- VPC adds cold start latency; avoid when not needed
- Right-size memory—more memory can be cheaper if it reduces duration
- High-volume, consistent workloads may be cheaper on containers
- Use structured logging and correlation IDs for observability
- Handle errors appropriately; use DLQs for async failures
- Don’t use serverless for latency-sensitive, long-running, or compute-intensive workloads
Serverless is a powerful tool for the right problems. Understanding patterns and anti-patterns helps you use it effectively.