API Rate Limiting Strategies

APIs without rate limiting are vulnerable. A single client can monopolize resources, intentionally or accidentally. Denial of service attacks become trivial. Misbehaving integrations can take down your entire service.

Rate limiting protects your API and ensures fair resource allocation. Here’s how to implement it effectively.

Why Rate Limit

Protection from Abuse

Without limits:

Scrapers can overwhelm your service
DDoS attacks are more effective
Bugs in client code cause outages
Malicious actors have free reign

Fair Resource Allocation

Limited capacity should be shared fairly:

No single client should dominate
Paying customers get priority
Critical functionality stays available

Cost Management

API calls cost money:

Compute resources
External API calls
Database queries
Bandwidth

Rate limits prevent unexpected cost spikes.

Rate Limiting Algorithms

Fixed Window

Count requests in fixed time windows:

class FixedWindowRateLimiter:
    def __init__(self, redis, limit, window_seconds):
        self.redis = redis
        self.limit = limit
        self.window_seconds = window_seconds

    def is_allowed(self, client_id):
        window = int(time.time() / self.window_seconds)
        key = f"rate:{client_id}:{window}"

        current = self.redis.incr(key)
        if current == 1:
            self.redis.expire(key, self.window_seconds)

        return current <= self.limit

Pros:

Simple to understand and implement
Low memory usage

Cons:

Burst at window boundaries (2x rate possible)
Unfair to clients starting mid-window

Sliding Window Log

Track timestamp of each request:

class SlidingWindowLogRateLimiter:
    def __init__(self, redis, limit, window_seconds):
        self.redis = redis
        self.limit = limit
        self.window_seconds = window_seconds

    def is_allowed(self, client_id):
        now = time.time()
        window_start = now - self.window_seconds
        key = f"rate:{client_id}"

        # Remove old entries
        self.redis.zremrangebyscore(key, 0, window_start)

        # Count current entries
        count = self.redis.zcard(key)

        if count < self.limit:
            self.redis.zadd(key, {str(uuid.uuid4()): now})
            self.redis.expire(key, self.window_seconds)
            return True

        return False

Pros:

Accurate rate limiting
No boundary burst issues

Cons:

High memory usage (stores each request)
More expensive operations

Sliding Window Counter

Weighted combination of current and previous windows:

class SlidingWindowCounterRateLimiter:
    def __init__(self, redis, limit, window_seconds):
        self.redis = redis
        self.limit = limit
        self.window_seconds = window_seconds

    def is_allowed(self, client_id):
        now = time.time()
        current_window = int(now / self.window_seconds)
        previous_window = current_window - 1
        window_progress = (now % self.window_seconds) / self.window_seconds

        current_key = f"rate:{client_id}:{current_window}"
        previous_key = f"rate:{client_id}:{previous_window}"

        current_count = int(self.redis.get(current_key) or 0)
        previous_count = int(self.redis.get(previous_key) or 0)

        # Weighted count
        effective_count = previous_count * (1 - window_progress) + current_count

        if effective_count < self.limit:
            self.redis.incr(current_key)
            self.redis.expire(current_key, self.window_seconds * 2)
            return True

        return False

Pros:

Smooth rate limiting
Low memory usage
Good accuracy

Cons:

Slightly more complex
Approximate (but close enough)

Token Bucket

Tokens accumulate over time, consumed by requests:

class TokenBucketRateLimiter:
    def __init__(self, redis, capacity, refill_rate):
        self.redis = redis
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second

    def is_allowed(self, client_id, tokens=1):
        key = f"bucket:{client_id}"
        now = time.time()

        bucket = self.redis.hgetall(key)
        last_update = float(bucket.get('last_update', now))
        available = float(bucket.get('tokens', self.capacity))

        # Refill tokens
        elapsed = now - last_update
        available = min(self.capacity, available + elapsed * self.refill_rate)

        if available >= tokens:
            available -= tokens
            self.redis.hset(key, mapping={
                'tokens': available,
                'last_update': now
            })
            self.redis.expire(key, int(self.capacity / self.refill_rate) + 60)
            return True

        return False

Pros:

Allows bursts up to capacity
Smooth average rate
Flexible for varying request costs

Cons:

More state per client
More complex implementation

Leaky Bucket

Requests processed at constant rate; excess queued or rejected:

class LeakyBucketRateLimiter:
    def __init__(self, redis, capacity, drain_rate):
        self.redis = redis
        self.capacity = capacity
        self.drain_rate = drain_rate  # requests per second

    def is_allowed(self, client_id):
        key = f"leaky:{client_id}"
        now = time.time()

        bucket = self.redis.hgetall(key)
        last_update = float(bucket.get('last_update', now))
        water_level = float(bucket.get('water_level', 0))

        # Drain water
        elapsed = now - last_update
        water_level = max(0, water_level - elapsed * self.drain_rate)

        if water_level < self.capacity:
            water_level += 1
            self.redis.hset(key, mapping={
                'water_level': water_level,
                'last_update': now
            })
            return True

        return False

Pros:

Constant output rate
Smooths bursts

Cons:

Doesn’t allow bursts (might be desired)
Similar complexity to token bucket

Implementation Strategies

Where to Implement

API Gateway:

Centralized enforcement
Before request reaches application
Good for cross-cutting concerns

Application Layer:

Per-endpoint customization
Access to user context
More flexible

Both:

Gateway for global protection
Application for business logic

What to Limit By

def get_rate_limit_key(request):
    # By IP (unauthenticated)
    if not request.user:
        return f"ip:{request.remote_addr}"

    # By user (authenticated)
    return f"user:{request.user.id}"

    # By API key
    return f"key:{request.api_key}"

    # By organization
    return f"org:{request.user.organization_id}"

Consider:

Unauthenticated: IP-based (but NAT can share IPs)
Authenticated: User or API key based
Tiered: Different limits per subscription level

Response Headers

Communicate limits to clients:

def add_rate_limit_headers(response, limiter, client_id):
    limit_info = limiter.get_info(client_id)

    response.headers['X-RateLimit-Limit'] = limit_info.limit
    response.headers['X-RateLimit-Remaining'] = limit_info.remaining
    response.headers['X-RateLimit-Reset'] = limit_info.reset_timestamp

    if limit_info.remaining <= 0:
        response.headers['Retry-After'] = limit_info.retry_after

Handling Limit Exceeded

@app.before_request
def check_rate_limit():
    client_id = get_rate_limit_key(request)

    if not rate_limiter.is_allowed(client_id):
        response = jsonify({
            'error': 'rate_limit_exceeded',
            'message': 'Too many requests. Please slow down.',
            'retry_after': rate_limiter.get_retry_after(client_id)
        })
        response.status_code = 429
        add_rate_limit_headers(response, rate_limiter, client_id)
        return response

Always return:

HTTP 429 Too Many Requests
Retry-After header
Clear error message

Tiered Limits

Different limits for different plans:

RATE_LIMITS = {
    'free': {'requests_per_minute': 60, 'requests_per_day': 1000},
    'pro': {'requests_per_minute': 600, 'requests_per_day': 50000},
    'enterprise': {'requests_per_minute': 6000, 'requests_per_day': None},
}

def get_rate_limit(user):
    plan = user.subscription_plan
    return RATE_LIMITS.get(plan, RATE_LIMITS['free'])

Endpoint-Specific Limits

Some endpoints need different limits:

ENDPOINT_LIMITS = {
    '/api/search': {'per_minute': 30},  # Expensive
    '/api/users': {'per_minute': 100},   # Standard
    '/api/health': {'per_minute': 1000}, # High limit
}

@app.before_request
def check_rate_limit():
    endpoint_limit = ENDPOINT_LIMITS.get(request.path)
    if endpoint_limit:
        # Apply endpoint-specific limit
        pass

Cost-Based Limiting

Weight requests by cost:

ENDPOINT_COSTS = {
    '/api/simple': 1,
    '/api/search': 10,
    '/api/report': 100,
}

def check_rate_limit(request):
    cost = ENDPOINT_COSTS.get(request.path, 1)
    return token_bucket.is_allowed(client_id, tokens=cost)

Distributed Rate Limiting

Centralized with Redis

Redis provides atomic operations for rate limiting:

# Lua script for atomic token bucket
TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local tokens_requested = tonumber(ARGV[4])

local bucket = redis.call('HGETALL', key)
-- ... implement token bucket logic ...
"""

Local + Sync

For ultra-low latency:

Local rate limiter (no network call)
Periodic sync to shared state
Accept some inaccuracy

Approximate Distributed

Each instance enforces limit / num_instances:

Simple implementation
Works for many cases
Less accurate with uneven load distribution

Key Takeaways

Rate limiting protects against abuse, ensures fairness, and manages costs
Token bucket is most flexible; sliding window counter is simple and effective
Choose limit key based on context: IP, user, API key, or organization
Return proper headers (X-RateLimit-*, Retry-After) and 429 status
Implement tiered limits for different subscription levels
Use cost-based limits for expensive endpoints
Redis provides distributed rate limiting with atomic operations
Consider local rate limiting with sync for lowest latency

Rate limiting is essential infrastructure. Implement it before you need it, not during an incident.