Scaling Video Infrastructure: Lessons from the Surge

Video communication has gone from nice-to-have to essential infrastructure practically overnight. The engineering behind video at scale is fascinating—real-time, bidirectional, sensitive to latency, and bandwidth-intensive.

Here’s how video infrastructure handles millions of concurrent streams.

The Challenge

Video Is Hard

Traditional web request:
Client → Request → Server → Response (100-500ms acceptable)

Video call:
Client ↔ Audio/Video stream ↔ Client
(< 150ms latency required for natural conversation)

Requirements:

Ultra-low latency: < 150ms end-to-end
Adaptive quality: Adjust to network conditions
Reliable: Audio must work even when video can’t
Scalable: Millions of concurrent calls

Architecture Patterns

Peer-to-Peer (Small Calls)

Direct connection between participants:

┌────────┐          ┌────────┐
│Client A│◄────────►│Client B│
└────────┘          └────────┘
    ▲                   ▲
    │                   │
    └───────┬───────────┘
            │
      STUN/TURN Server
      (connection help)

WebRTC peer-to-peer:

Direct media flow
Lowest possible latency
No server media processing
Limited to 4-6 participants

Selective Forwarding Unit (SFU)

Server forwards streams without processing:

              ┌─────────────┐
         ┌───►│     SFU     │◄───┐
         │    │  (forward)  │    │
         │    └─────────────┘    │
         │          │            │
    ┌────┴───┐ ┌────┴───┐ ┌─────┴──┐
    │Client A│ │Client B│ │Client C│
    └────────┘ └────────┘ └────────┘

A sends 1 stream up
SFU forwards it to B and C
Each client receives N-1 streams

Characteristics:

Server doesn’t transcode
Each client receives multiple streams
Client bandwidth scales with participants
Lower server CPU, higher client bandwidth

Multipoint Control Unit (MCU)

Server mixes all streams:

              ┌─────────────┐
         ┌───►│     MCU     │◄───┐
         │    │   (mix)     │    │
         │    └─────────────┘    │
         │          │            │
    ┌────┴───┐ ┌────┴───┐ ┌─────┴──┐
    │Client A│ │Client B│ │Client C│
    └────────┘ └────────┘ └────────┘

A, B, C send streams up
MCU composites into single stream
Each client receives 1 mixed stream

Characteristics:

Heavy server processing (transcoding, mixing)
Constant client bandwidth regardless of participants
Works with limited-capability clients
Higher latency from processing

Hybrid Architecture

Real systems combine approaches:

Small calls (2-4): P2P
Medium calls (5-20): SFU
Large calls (20+): SFU with simulcast
Webinars: SFU to viewers, MCU for hosts

Scaling the SFU

Horizontal Scaling

                    ┌─────────────────────┐
                    │   Load Balancer     │
                    │   (session sticky)  │
                    └─────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
   ┌───────────┐        ┌───────────┐        ┌───────────┐
   │   SFU 1   │        │   SFU 2   │        │   SFU 3   │
   │  (rooms   │        │  (rooms   │        │  (rooms   │
   │   1-100)  │        │  101-200) │        │  201-300) │
   └───────────┘        └───────────┘        └───────────┘

Challenges:

Session affinity (all participants in room hit same SFU)
Cascading for large meetings (multiple SFUs per room)
Geographic distribution

Cascading SFUs

For large meetings spanning regions:

US Region                    EU Region
┌───────────┐               ┌───────────┐
│  SFU US   │◄─────────────►│  SFU EU   │
│ (US users)│   Backbone    │(EU users) │
└───────────┘               └───────────┘
     ▲                           ▲
     │                           │
┌────┴────┐                 ┌────┴────┐
│Users US │                 │Users EU │
└─────────┘                 └─────────┘

Simulcast

Clients send multiple quality levels:

Client sends:
├── High quality (1080p, 2.5 Mbps)
├── Medium quality (720p, 1 Mbps)
└── Low quality (360p, 300 Kbps)

SFU selects appropriate quality for each receiver
based on their bandwidth and screen size

Bandwidth Optimization

Adaptive Bitrate

Adjust to network conditions:

# Simplified bandwidth estimation
def adjust_bitrate(network_stats):
    packet_loss = network_stats.packet_loss
    rtt = network_stats.round_trip_time
    available_bandwidth = network_stats.estimated_bandwidth

    if packet_loss > 5% or rtt > 300ms:
        # Network congested, reduce quality
        return current_bitrate * 0.7
    elif packet_loss < 1% and rtt < 100ms:
        # Network healthy, can increase
        return min(current_bitrate * 1.1, max_bitrate)
    else:
        return current_bitrate

Congestion Control

WebRTC uses sophisticated congestion control:

GCC (Google Congestion Control):

Monitors packet delays
Estimates available bandwidth
Adjusts send rate proactively

Measure → Estimate → Adjust → Repeat

Key signals:
- Round-trip time
- Packet loss
- Inter-arrival jitter
- Receiver feedback (RTCP)

Selective Stream Forwarding

Only forward what’s needed:

# Which video streams to forward to each participant
def select_streams_for_viewer(viewer, participants):
    streams = []

    # Always include active speaker at high quality
    speaker = get_active_speaker(participants)
    streams.append((speaker, 'high'))

    # Others at low quality or audio-only
    for p in participants:
        if p != speaker and p != viewer:
            if viewer.bandwidth_constrained:
                streams.append((p, 'audio_only'))
            else:
                streams.append((p, 'low'))

    return streams

Media Processing

Audio Processing

Critical for quality:

Input Audio
    │
    ▼
┌───────────────────┐
│ Echo Cancellation │  (remove speaker output from mic input)
└───────────────────┘
    │
    ▼
┌───────────────────┐
│ Noise Suppression │  (filter background noise)
└───────────────────┘
    │
    ▼
┌───────────────────┐
│ Auto Gain Control │  (normalize volume)
└───────────────────┘
    │
    ▼
┌───────────────────┐
│    Opus Encoder   │  (variable bitrate, 6-510 kbps)
└───────────────────┘
    │
    ▼
Network

Video Encoding

Balance quality and bandwidth:

Codec choices:

VP8/VP9: Open, good compression, widely supported
H.264: Hardware acceleration everywhere
AV1: Best compression, still emerging

Encoding settings:

target_settings:
  720p:
    resolution: 1280x720
    framerate: 30
    bitrate: 1500kbps
    keyframe_interval: 3s

  360p:
    resolution: 640x360
    framerate: 15
    bitrate: 400kbps
    keyframe_interval: 3s

Infrastructure

Global Distribution

Latency matters; servers must be close:

┌─────────────────────────────────────────────────────────────────┐
│                     Global Anycast Network                       │
├──────────┬──────────┬──────────┬──────────┬──────────┬─────────┤
│   US-W   │   US-E   │    EU    │   APAC   │   LATAM  │  India  │
│   SFUs   │   SFUs   │   SFUs   │   SFUs   │   SFUs   │  SFUs   │
└──────────┴──────────┴──────────┴──────────┴──────────┴─────────┘

Network Path Optimization

Better than public internet:

Public Internet:
Client → ISP → Transit → Transit → ISP → Server
(variable latency, congestion)

Optimized:
Client → ISP → Edge PoP → Private backbone → Edge PoP → ISP → Server
(predictable, uncongested)

TURN Servers

For when P2P fails (firewalls, NATs):

Client A           TURN Server           Client B
    │                   │                    │
    ├──────────────────►│                    │
    │    Media          │                    │
    │                   ├───────────────────►│
    │                   │      Media         │
    │◄──────────────────┤                    │
    │    Media          │◄───────────────────┤
    │                   │      Media         │

TURN is bandwidth-expensive; minimize usage.

Monitoring

Key Metrics

connection_quality:
  - packet_loss_percentage
  - round_trip_time_ms
  - jitter_ms
  - bitrate_kbps

user_experience:
  - time_to_media (first audio/video)
  - call_setup_success_rate
  - call_quality_score (MOS)

infrastructure:
  - sfu_cpu_utilization
  - sfu_bandwidth_utilization
  - turn_relay_usage
  - calls_per_server

Real-Time Monitoring

Every call sends telemetry:
- Audio/video quality metrics
- Network conditions
- Client-side issues
- Error events

Aggregate in real-time:
- Per-call quality
- Per-region quality
- Global health dashboard

Key Takeaways

Video needs ultra-low latency (< 150ms) for natural conversation
SFU architecture (selective forwarding) scales best for most use cases
Simulcast lets clients send multiple qualities; SFU selects appropriate one
Adaptive bitrate responds to network conditions in real-time
Audio processing (echo cancellation, noise suppression) is critical for quality
Global server distribution minimizes latency; users connect to nearest SFU
TURN servers relay traffic when direct connection fails (firewall/NAT)
Monitor packet loss, RTT, and jitter for quality; MOS for user experience
Cascading SFUs handle large meetings spanning geographic regions

Video infrastructure is complex but fascinating. The surge in demand has accelerated innovation across the industry.