Video communication has gone from nice-to-have to essential infrastructure practically overnight. The engineering behind video at scale is fascinating—real-time, bidirectional, sensitive to latency, and bandwidth-intensive.
Here’s how video infrastructure handles millions of concurrent streams.
The Challenge
Video Is Hard
Traditional web request:
Client → Request → Server → Response (100-500ms acceptable)
Video call:
Client ↔ Audio/Video stream ↔ Client
(< 150ms latency required for natural conversation)
Requirements:
- Ultra-low latency: < 150ms end-to-end
- Adaptive quality: Adjust to network conditions
- Reliable: Audio must work even when video can’t
- Scalable: Millions of concurrent calls
Architecture Patterns
Peer-to-Peer (Small Calls)
Direct connection between participants:
┌────────┐ ┌────────┐
│Client A│◄────────►│Client B│
└────────┘ └────────┘
▲ ▲
│ │
└───────┬───────────┘
│
STUN/TURN Server
(connection help)
WebRTC peer-to-peer:
- Direct media flow
- Lowest possible latency
- No server media processing
- Limited to 4-6 participants
Selective Forwarding Unit (SFU)
Server forwards streams without processing:
┌─────────────┐
┌───►│ SFU │◄───┐
│ │ (forward) │ │
│ └─────────────┘ │
│ │ │
┌────┴───┐ ┌────┴───┐ ┌─────┴──┐
│Client A│ │Client B│ │Client C│
└────────┘ └────────┘ └────────┘
A sends 1 stream up
SFU forwards it to B and C
Each client receives N-1 streams
Characteristics:
- Server doesn’t transcode
- Each client receives multiple streams
- Client bandwidth scales with participants
- Lower server CPU, higher client bandwidth
Multipoint Control Unit (MCU)
Server mixes all streams:
┌─────────────┐
┌───►│ MCU │◄───┐
│ │ (mix) │ │
│ └─────────────┘ │
│ │ │
┌────┴───┐ ┌────┴───┐ ┌─────┴──┐
│Client A│ │Client B│ │Client C│
└────────┘ └────────┘ └────────┘
A, B, C send streams up
MCU composites into single stream
Each client receives 1 mixed stream
Characteristics:
- Heavy server processing (transcoding, mixing)
- Constant client bandwidth regardless of participants
- Works with limited-capability clients
- Higher latency from processing
Hybrid Architecture
Real systems combine approaches:
Small calls (2-4): P2P
Medium calls (5-20): SFU
Large calls (20+): SFU with simulcast
Webinars: SFU to viewers, MCU for hosts
Scaling the SFU
Horizontal Scaling
┌─────────────────────┐
│ Load Balancer │
│ (session sticky) │
└─────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ SFU 1 │ │ SFU 2 │ │ SFU 3 │
│ (rooms │ │ (rooms │ │ (rooms │
│ 1-100) │ │ 101-200) │ │ 201-300) │
└───────────┘ └───────────┘ └───────────┘
Challenges:
- Session affinity (all participants in room hit same SFU)
- Cascading for large meetings (multiple SFUs per room)
- Geographic distribution
Cascading SFUs
For large meetings spanning regions:
US Region EU Region
┌───────────┐ ┌───────────┐
│ SFU US │◄─────────────►│ SFU EU │
│ (US users)│ Backbone │(EU users) │
└───────────┘ └───────────┘
▲ ▲
│ │
┌────┴────┐ ┌────┴────┐
│Users US │ │Users EU │
└─────────┘ └─────────┘
Simulcast
Clients send multiple quality levels:
Client sends:
├── High quality (1080p, 2.5 Mbps)
├── Medium quality (720p, 1 Mbps)
└── Low quality (360p, 300 Kbps)
SFU selects appropriate quality for each receiver
based on their bandwidth and screen size
Bandwidth Optimization
Adaptive Bitrate
Adjust to network conditions:
# Simplified bandwidth estimation
def adjust_bitrate(network_stats):
packet_loss = network_stats.packet_loss
rtt = network_stats.round_trip_time
available_bandwidth = network_stats.estimated_bandwidth
if packet_loss > 5% or rtt > 300ms:
# Network congested, reduce quality
return current_bitrate * 0.7
elif packet_loss < 1% and rtt < 100ms:
# Network healthy, can increase
return min(current_bitrate * 1.1, max_bitrate)
else:
return current_bitrate
Congestion Control
WebRTC uses sophisticated congestion control:
GCC (Google Congestion Control):
- Monitors packet delays
- Estimates available bandwidth
- Adjusts send rate proactively
Measure → Estimate → Adjust → Repeat
Key signals:
- Round-trip time
- Packet loss
- Inter-arrival jitter
- Receiver feedback (RTCP)
Selective Stream Forwarding
Only forward what’s needed:
# Which video streams to forward to each participant
def select_streams_for_viewer(viewer, participants):
streams = []
# Always include active speaker at high quality
speaker = get_active_speaker(participants)
streams.append((speaker, 'high'))
# Others at low quality or audio-only
for p in participants:
if p != speaker and p != viewer:
if viewer.bandwidth_constrained:
streams.append((p, 'audio_only'))
else:
streams.append((p, 'low'))
return streams
Media Processing
Audio Processing
Critical for quality:
Input Audio
│
▼
┌───────────────────┐
│ Echo Cancellation │ (remove speaker output from mic input)
└───────────────────┘
│
▼
┌───────────────────┐
│ Noise Suppression │ (filter background noise)
└───────────────────┘
│
▼
┌───────────────────┐
│ Auto Gain Control │ (normalize volume)
└───────────────────┘
│
▼
┌───────────────────┐
│ Opus Encoder │ (variable bitrate, 6-510 kbps)
└───────────────────┘
│
▼
Network
Video Encoding
Balance quality and bandwidth:
Codec choices:
- VP8/VP9: Open, good compression, widely supported
- H.264: Hardware acceleration everywhere
- AV1: Best compression, still emerging
Encoding settings:
target_settings:
720p:
resolution: 1280x720
framerate: 30
bitrate: 1500kbps
keyframe_interval: 3s
360p:
resolution: 640x360
framerate: 15
bitrate: 400kbps
keyframe_interval: 3s
Infrastructure
Global Distribution
Latency matters; servers must be close:
┌─────────────────────────────────────────────────────────────────┐
│ Global Anycast Network │
├──────────┬──────────┬──────────┬──────────┬──────────┬─────────┤
│ US-W │ US-E │ EU │ APAC │ LATAM │ India │
│ SFUs │ SFUs │ SFUs │ SFUs │ SFUs │ SFUs │
└──────────┴──────────┴──────────┴──────────┴──────────┴─────────┘
Network Path Optimization
Better than public internet:
Public Internet:
Client → ISP → Transit → Transit → ISP → Server
(variable latency, congestion)
Optimized:
Client → ISP → Edge PoP → Private backbone → Edge PoP → ISP → Server
(predictable, uncongested)
TURN Servers
For when P2P fails (firewalls, NATs):
Client A TURN Server Client B
│ │ │
├──────────────────►│ │
│ Media │ │
│ ├───────────────────►│
│ │ Media │
│◄──────────────────┤ │
│ Media │◄───────────────────┤
│ │ Media │
TURN is bandwidth-expensive; minimize usage.
Monitoring
Key Metrics
connection_quality:
- packet_loss_percentage
- round_trip_time_ms
- jitter_ms
- bitrate_kbps
user_experience:
- time_to_media (first audio/video)
- call_setup_success_rate
- call_quality_score (MOS)
infrastructure:
- sfu_cpu_utilization
- sfu_bandwidth_utilization
- turn_relay_usage
- calls_per_server
Real-Time Monitoring
Every call sends telemetry:
- Audio/video quality metrics
- Network conditions
- Client-side issues
- Error events
Aggregate in real-time:
- Per-call quality
- Per-region quality
- Global health dashboard
Key Takeaways
- Video needs ultra-low latency (< 150ms) for natural conversation
- SFU architecture (selective forwarding) scales best for most use cases
- Simulcast lets clients send multiple qualities; SFU selects appropriate one
- Adaptive bitrate responds to network conditions in real-time
- Audio processing (echo cancellation, noise suppression) is critical for quality
- Global server distribution minimizes latency; users connect to nearest SFU
- TURN servers relay traffic when direct connection fails (firewall/NAT)
- Monitor packet loss, RTT, and jitter for quality; MOS for user experience
- Cascading SFUs handle large meetings spanning geographic regions
Video infrastructure is complex but fascinating. The surge in demand has accelerated innovation across the industry.