Docker in Production: Lessons from Running Containers at Scale

February 8, 2016

Docker crossed from development curiosity to production reality sometime in 2014. Two years later, the ecosystem has matured significantly, but running containers at scale still requires careful planning and operational discipline. Here’s what we’ve learned from deploying Docker across multiple production environments.

The Promise and Reality

Docker’s value proposition is compelling: consistent environments from development to production, rapid deployment, efficient resource utilization, and simplified dependency management. These benefits are real, but they come with operational complexity that isn’t immediately obvious.

The marketing materials show containers spinning up in milliseconds and developers shipping code with confidence. The reality involves wrestling with networking, debugging opaque failures, managing image sprawl, and building operational tooling that the Docker ecosystem hasn’t yet standardized.

None of this means Docker isn’t worth adopting. It means approaching it with realistic expectations and investing in the operational foundations that make it work.

Image Management

Keep Images Small

Every megabyte in your image is a megabyte that must be transferred on every deployment to every host. Large images slow deployments, consume bandwidth, and waste storage.

Start with minimal base images. Alpine Linux provides a functional base in roughly 5MB. Compare that to Ubuntu’s 188MB or the default Debian image at 125MB. For most applications, Alpine’s musl libc and BusyBox utilities are sufficient.

Use multi-stage builds to separate build dependencies from runtime. Your Node.js application doesn’t need webpack in production; your Go service doesn’t need the compiler. Build in one stage, copy artifacts to a minimal runtime stage.

FROM golang:1.6 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o service .

FROM alpine:3.4
COPY --from=builder /app/service /service
CMD ["/service"]

Layer Caching Matters

Docker builds images in layers, caching each layer for reuse. Order your Dockerfile instructions from least to most frequently changing. Dependencies change less often than application code, so install dependencies before copying source files.

# Dependencies first (changes rarely)
COPY package.json .
RUN npm install

# Application code last (changes frequently)
COPY src/ src/

Tag Immutably

The latest tag is convenient and dangerous. In development, it ensures you’re always running current code. In production, it makes rollbacks impossible and introduces deployment non-determinism.

Tag images with commit SHAs, build numbers, or semantic versions. When something breaks, you need to deploy exactly the previous version, not whatever latest happens to point to now.

Networking

Container networking is where most production complexity lives. Docker’s default bridge networking works for simple cases but breaks down in multi-host environments.

Service Discovery

Containers get IP addresses dynamically. Hard-coding addresses is impossible; you need service discovery. Options include:

DNS-based discovery. Docker’s built-in DNS resolves container names to IP addresses within a network. Simple and sufficient for single-host deployments.

External service discovery. Consul, etcd, or ZooKeeper provide distributed service registries. Containers register on startup, deregister on shutdown, and query the registry to find dependencies.

Orchestrator-provided discovery. Kubernetes, Docker Swarm, and Mesos provide built-in service discovery. If you’re using an orchestrator, leverage its native capabilities.

Overlay Networks

Multi-host networking requires overlay networks that span physical hosts. Docker’s overlay driver creates encrypted tunnels between hosts, presenting a flat network to containers regardless of physical topology.

Overlay networks add latency—typically 1-2ms per hop. For most applications, this is negligible. For latency-sensitive workloads, measure carefully.

Host Networking

For maximum performance, containers can share the host’s network namespace directly. This eliminates NAT overhead and network virtualization but sacrifices isolation. Use sparingly for specific performance-critical services.

Storage

Containers are ephemeral by design. Data written inside a container disappears when the container stops. Persistent data requires explicit storage configuration.

Volumes for Persistent Data

Docker volumes exist outside container lifecycles. Mount volumes for databases, uploaded files, and any data that must survive container restarts.

Name your volumes explicitly rather than letting Docker generate random names. postgres-data is easier to manage than a3f2b9c4d8e1.

Volume Drivers

For multi-host deployments, local volumes don’t suffice—containers need access to the same data regardless of which host they run on. Volume drivers like Flocker, REX-Ray, and vendor-specific plugins integrate with networked storage.

Choose storage backends based on performance requirements. Block storage (EBS, Ceph RBD) provides good performance for databases. Object storage (S3, Swift) suits static assets and backups. NFS and distributed filesystems work for shared configuration.

Database Considerations

Running databases in containers is controversial. The arguments against: databases need careful resource management, persistent storage, and operational attention that containers complicate. The arguments for: consistency, portability, and simplified provisioning.

Our experience: stateless applications benefit most from containerization. Databases can run in containers, but require careful volume configuration, resource limits, and operational procedures that account for container orchestration behaviors.

Logging and Monitoring

Centralized Logging

Container logs default to stdout/stderr, captured by Docker’s logging drivers. In production, you need these logs aggregated centrally.

Configure Docker to forward logs to a log aggregator: ELK stack, Graylog, Splunk, or cloud logging services. Include container metadata—image name, container ID, host—in structured log entries.

Avoid logging to files inside containers. It wastes ephemeral storage, complicates log rotation, and loses data when containers restart.

Metrics Collection

Monitor both host metrics (CPU, memory, disk, network at the host level) and container metrics (resource usage per container). Docker’s stats API exposes per-container metrics; tools like cAdvisor, Prometheus, and Datadog collect and aggregate them.

Set resource limits and alert when containers approach them. A container without memory limits can consume all host memory, affecting other containers. A container with limits will be OOM-killed, which is preferable to impacting neighbors.

Health Checks

Docker 1.12 introduced native health checks. Define checks in your Dockerfile:

HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost/health || exit 1

Orchestrators use health check status for service discovery and automatic recovery. An unhealthy container gets removed from load balancer pools; a persistently unhealthy container gets restarted.

Security

Run as Non-Root

By default, processes inside containers run as root. If a container is compromised, the attacker has root access within the container’s namespaces. While container isolation limits the blast radius, running as non-root adds defense in depth.

Create application users in your Dockerfile:

RUN adduser -D appuser
USER appuser

Read-Only Filesystems

If your application doesn’t need to write to the filesystem, run with --read-only. This prevents attackers from modifying container contents, even if they gain code execution.

For applications that need to write to specific locations (temp files, caches), mount writable volumes only where necessary.

Image Scanning

Container images inherit vulnerabilities from their base images and installed packages. Regularly scan images for known CVEs using tools like Clair, Trivy, or commercial scanning services.

Automate scanning in your CI pipeline. Block deployments of images with critical vulnerabilities. Rebuild and redeploy when base images publish security updates.

Registry Security

Container registries are high-value targets—compromise the registry, compromise every deployment. Run private registries with TLS, authentication, and access controls. Prefer managed registries (Docker Hub, ECR, GCR) that handle security operations.

Sign images cryptographically to verify provenance. Docker Content Trust, built on Notary, provides image signing and verification.

Orchestration

Running a few containers on a single host requires minimal tooling. Running hundreds of containers across dozens of hosts requires orchestration.

The Orchestration Landscape

The container orchestration space is fragmented in 2016. Major options include:

Docker Swarm. Native Docker clustering. Simpler than alternatives, less feature-rich. Good for Docker-native workflows and teams that want minimal operational overhead.

Kubernetes. Google’s container orchestrator, donated to the CNCF. Most feature-rich and complex. Strong community momentum. Steeper learning curve but more capability.

Mesos with Marathon. Data center operating system with container scheduling. Proven at large scale (Twitter, Airbnb). More complex to operate but handles mixed workloads well.

Amazon ECS. AWS-native container orchestration. Deep AWS integration, less portable. Good choice if you’re AWS-committed and want managed infrastructure.

Choosing an Orchestrator

For teams starting with containers, Docker Swarm’s simplicity is appealing. For teams planning significant scale, Kubernetes’ feature set justifies the learning investment. For teams with existing Mesos infrastructure, Marathon adds containers without additional operational overhead.

We’ll likely see consolidation in this space. The market is moving toward Kubernetes as the standard, but it’s early to declare winners.

Deployment Strategies

Blue-Green Deployments

Run two identical production environments: blue (current) and green (new). Deploy to green, verify health, switch traffic from blue to green. If problems emerge, switch back instantly.

With containers, blue-green is straightforward: deploy new containers, verify health checks pass, update load balancer configuration, drain old containers.

Rolling Updates

Update containers incrementally: start new containers, verify health, stop old containers, repeat until complete. Maintains capacity throughout deployment but extends the deployment window.

Orchestrators automate rolling updates. Configure the parallelism (how many containers update simultaneously) and health check thresholds.

Canary Releases

Route a small percentage of traffic to new containers while the majority continues hitting the current version. Monitor error rates and latency; if the new version performs well, gradually increase its traffic share.

Canary releases require sophisticated traffic management—typically a service mesh or programmable load balancer.

Lessons Learned

After two years, here’s what we know:

Start simple. Run containers on a single host before attempting multi-host orchestration. Master image building before optimizing layer caching. Understand Docker networking before adding overlay complexity.

Invest in observability. Container environments are dynamic. Hosts change, containers move, IP addresses rotate. Without strong logging, metrics, and tracing, debugging production issues becomes impossible.

Treat images as artifacts. Build once, deploy everywhere. The same image runs in development, staging, and production. Configuration varies through environment variables, not image modifications.

Plan for failure. Containers crash. Hosts fail. Networks partition. Design applications to handle container restarts, implement health checks, and let orchestrators handle recovery.

Security isn’t optional. Container isolation isn’t perfect. Defense in depth—minimal images, non-root users, read-only filesystems, network policies—limits the impact of compromises.

Docker has transformed how we build and deploy software. The transformation isn’t free—it requires new skills, new tooling, and new operational practices. But for teams willing to invest, the benefits compound over time.

Key Takeaways