VPN at Scale: Lessons from Sudden Remote Work

May 4, 2020

When everyone went remote, VPNs became critical infrastructure overnight. Systems designed for 20% remote usage suddenly needed to handle 100%. Many organizations learned hard lessons about VPN scalability.

Here’s what we learned and how to do better.

What Broke

Capacity Assumptions

before:
  design_assumption: 20% concurrent users
  typical_load: 500 connections
  peak_load: 1000 connections
  provisioned_capacity: 1500 connections

after:
  actual_requirement: 100% workforce
  needed_connections: 5000
  available: 1500
  result: Degraded performance, failed connections

Bandwidth Bottlenecks

Traditional VPN architecture:
                                    ┌─────────────────────┐
                                    │   Corporate         │
Remote Users ───► VPN Gateway ──────│   Network           │
                    │               │   (All traffic)     │
                    │               └─────────────────────┘
                    │                         │
                    │                         ▼
                    └───────────────────► Internet
                        (Backhauled)         (Cloud services)

Problem: All traffic, including cloud, routes through corporate
Result: Massive bandwidth consumption at VPN gateway

Single Points of Failure

failures_observed:
  - VPN gateway hardware failure
  - License server unreachable
  - Authentication service overloaded
  - Certificate expiration
  - ISP issues at data center

Emergency Scaling

Quick Wins

What organizations did immediately:

capacity:
  - Added VPN licenses
  - Deployed additional gateways
  - Upgraded hardware
  - Added bandwidth

optimization:
  - Enabled split tunneling (carefully)
  - Moved cloud services off VPN
  - Staggered work hours
  - Prioritized critical users

redundancy:
  - Added backup gateways
  - Multi-ISP connectivity
  - Geographic distribution

Split Tunneling

Route only corporate traffic through VPN:

# Before: All traffic through VPN
full_tunnel:
  bandwidth_per_user: High (all traffic)
  cloud_latency: High (backhauled)
  vpn_load: Very high

# After: Corporate only through VPN
split_tunnel:
  vpn_traffic:
    - Internal applications
    - Corporate databases
    - Admin interfaces

  direct_internet:
    - Microsoft 365
    - Salesforce
    - Zoom
    - General browsing

  result:
    bandwidth_reduction: 60-80%
    user_experience: Improved
    security_tradeoff: Requires endpoint protection

Cloud VPN Options

aws_client_vpn:
  - Managed service
  - Scales automatically
  - Pay per connection
  - Integrates with VPC

azure_vpn_gateway:
  - Native Azure integration
  - Multiple SKUs for scaling
  - Site-to-site and point-to-site

third_party_cloud:
  - Zscaler Private Access
  - Cloudflare Access
  - Prisma Access (Palo Alto)

Better Architecture

Regional Distribution

                    ┌─────────────────┐
                    │   Corporate DC   │
                    └────────┬────────┘
                             │
      ┌──────────────────────┼──────────────────────┐
      │                      │                      │
      ▼                      ▼                      ▼
┌──────────┐          ┌──────────┐          ┌──────────┐
│ Gateway  │          │ Gateway  │          │ Gateway  │
│  US-West │          │  US-East │          │  EU      │
└────┬─────┘          └────┬─────┘          └────┬─────┘
     │                     │                     │
 US-West               US-East                EU Users
 Users                 Users

High Availability

ha_configuration:
  load_balancing:
    - DNS round-robin
    - Global load balancer
    - Anycast routing

  redundancy:
    - Active-active gateways
    - Automatic failover
    - Health monitoring

  geographic:
    - Multiple regions
    - User-closest routing
    - Data sovereignty compliance

Monitoring and Alerting

metrics_to_watch:
  capacity:
    - Active connections vs. limit
    - Bandwidth utilization
    - CPU/memory on gateways

  performance:
    - Connection time
    - Throughput per user
    - Latency

  availability:
    - Gateway health
    - Authentication success rate
    - Connection drops

alerts:
  - capacity > 70%: Warning
  - capacity > 85%: Critical
  - gateway_down: Page immediately
  - auth_failure_rate > 5%: Investigate

What Comes Next

Zero Trust Transition

VPN is a bridge, not a destination:

migration_path:
  phase_1_immediate:
    - Scale existing VPN
    - Split tunneling
    - Add capacity

  phase_2_short_term:
    - Cloud apps via zero trust proxy
    - VPN for legacy only
    - Improved monitoring

  phase_3_long_term:
    - Identity-based access everywhere
    - VPN deprecated
    - Zero trust architecture

Application-Level Access

Move away from network-level trust:

current_model:
  VPN → Full network access → Applications

target_model:
  Identity → Policy → Specific application access

benefits:
  - No broad network access
  - Per-app authorization
  - Better visibility
  - Easier to scale

Cloud-First Architecture

Design for remote access from the start:

principles:
  - Applications accessible from anywhere
  - Identity is the perimeter
  - Assume hostile network
  - Encrypt everything

implementation:
  - SaaS where possible
  - Cloud-native applications
  - Zero trust access
  - Strong authentication everywhere

Lessons Learned

Planning

capacity_planning:
  - Plan for 100% remote, not 20%
  - Test at expected scale
  - Have headroom for growth
  - Regular capacity reviews

redundancy:
  - No single points of failure
  - Test failover regularly
  - Geographic distribution
  - Multiple ISPs

Operations

monitoring:
  - Real-time capacity visibility
  - Performance metrics
  - User experience tracking
  - Proactive alerting

response:
  - Runbooks for common issues
  - Escalation procedures
  - Vendor support contracts
  - Capacity addition process

Architecture

design_principles:
  - Split tunnel where possible
  - Cloud services direct
  - Regional distribution
  - Scalable infrastructure

future_direction:
  - Zero trust over VPN
  - Application-level access
  - Identity-based security
  - Cloud-native architecture

Key Takeaways

The VPN stress test of 2020 revealed architectural weaknesses. Use this as an opportunity to build more resilient, scalable remote access—ideally moving toward zero trust rather than doubling down on network perimeters.