Feature Flags at Scale

September 6, 2021

Feature flags decouple deployment from release. Deploy code to production but keep it dark until ready. This enables trunk-based development, gradual rollouts, and instant rollbacks. But poorly managed feature flags become technical debt.

Here’s how to use feature flags effectively at scale.

Why Feature Flags?

The Deployment/Release Separation

without_feature_flags:
  - Code merged = code released
  - Long-lived branches for big features
  - Releases are high-risk events
  - Rollback means redeployment

with_feature_flags:
  - Code merged ≠ code released
  - Trunk-based development possible
  - Gradual rollout reduces risk
  - Instant rollback via flag toggle

Use Cases

feature_flag_uses:
  release_management:
    - Dark launches (deploy but don't release)
    - Gradual rollout (1% → 10% → 100%)
    - Kill switch for problematic features

  testing:
    - A/B testing
    - Beta user programs
    - Internal dogfooding

  operations:
    - Circuit breakers
    - Load shedding
    - Graceful degradation

  business:
    - Customer-specific features
    - Plan/tier gating
    - Time-limited promotions

Flag Types

Categorization

flag_types:
  release_flags:
    purpose: Control feature rollout
    lifecycle: Short-lived (weeks)
    removal: After full rollout
    example: new_checkout_flow

  experiment_flags:
    purpose: A/B testing
    lifecycle: Medium (weeks to months)
    removal: After experiment concludes
    example: pricing_page_variant

  ops_flags:
    purpose: Operational control
    lifecycle: Long-lived
    removal: Rarely
    example: enable_new_payment_provider

  permission_flags:
    purpose: Access control
    lifecycle: Permanent
    removal: When feature deprecated
    example: enable_enterprise_sso

Implementation Patterns

Basic Flag Check

// Simple boolean flag
func (s *CheckoutService) ProcessOrder(ctx context.Context, order Order) error {
    if s.flags.IsEnabled(ctx, "new_checkout_flow") {
        return s.newCheckoutFlow(ctx, order)
    }
    return s.legacyCheckoutFlow(ctx, order)
}

Percentage Rollout

// Gradual rollout based on user
func (c *FlagClient) IsEnabled(ctx context.Context, flagName string) bool {
    flag := c.getFlag(flagName)

    // Get user for consistent bucketing
    user := getUserFromContext(ctx)
    if user == nil {
        return flag.DefaultValue
    }

    // Consistent hashing for percentage
    bucket := hash(user.ID + flagName) % 100

    // Check if user in rollout percentage
    return bucket < flag.RolloutPercentage
}

// Usage: gradually increase rollout
// Day 1: rollout_percentage: 1
// Day 2: rollout_percentage: 5
// Day 3: rollout_percentage: 25
// Day 4: rollout_percentage: 100

Targeting Rules

# LaunchDarkly-style targeting
flag: new_search_algorithm
description: "Improved search relevance"
variations:
  - value: true
    name: "New algorithm"
  - value: false
    name: "Legacy algorithm"

targeting:
  # Specific users
  - variation: true
    clauses:
      - attribute: email
        op: endsWith
        value: "@company.com"

  # Beta users
  - variation: true
    clauses:
      - attribute: beta_user
        op: equals
        value: true

  # Percentage rollout for everyone else
  - variation: true
    percentage: 25

  # Default
  - variation: false
    percentage: 75

Implementation

# Python SDK example
from launchdarkly import LDClient

ld_client = LDClient("sdk-key")

def get_search_results(user, query):
    # Create user context
    context = {
        "key": user.id,
        "email": user.email,
        "custom": {
            "plan": user.plan,
            "beta_user": user.is_beta
        }
    }

    # Evaluate flag
    use_new_algorithm = ld_client.variation(
        "new_search_algorithm",
        context,
        False  # default
    )

    if use_new_algorithm:
        return new_search(query)
    else:
        return legacy_search(query)

Architecture

Flag Evaluation

evaluation_options:
  server_side:
    description: Evaluate on backend
    pros:
      - Secure (rules not exposed)
      - Works for any client
    cons:
      - Network latency
      - Requires SDK integration

  client_side:
    description: Evaluate in browser/app
    pros:
      - Lower latency
      - Works offline
    cons:
      - Rules visible to users
      - Bundle size

  edge:
    description: Evaluate at CDN/edge
    pros:
      - Very low latency
      - Can affect static content
    cons:
      - Limited context
      - More complex setup

Caching and Performance

// Flag caching pattern
type FlagCache struct {
    flags     map[string]*Flag
    lastSync  time.Time
    syncMu    sync.RWMutex
    client    *RemoteFlagClient
}

func (c *FlagCache) IsEnabled(ctx context.Context, flagName string) bool {
    c.syncMu.RLock()
    flag, ok := c.flags[flagName]
    c.syncMu.RUnlock()

    if !ok {
        // Flag doesn't exist, return safe default
        return false
    }

    return flag.Evaluate(ctx)
}

func (c *FlagCache) StartSync(interval time.Duration) {
    go func() {
        ticker := time.NewTicker(interval)
        for range ticker.C {
            c.sync()
        }
    }()
}

func (c *FlagCache) sync() {
    flags, err := c.client.FetchFlags()
    if err != nil {
        log.Warn("failed to sync flags", "error", err)
        return  // Keep using cached flags
    }

    c.syncMu.Lock()
    c.flags = flags
    c.lastSync = time.Now()
    c.syncMu.Unlock()
}

Flag Lifecycle Management

Creation

flag_creation_process:
  required_fields:
    - name: Unique identifier (snake_case)
    - description: What does this flag control?
    - owner: Who is responsible?
    - type: release | experiment | ops | permission
    - expected_removal_date: When should this be cleaned up?

  review:
    - Is a flag necessary?
    - Is scope appropriate?
    - Are targeting rules correct?

Monitoring

flag_metrics:
  evaluation_count:
    by_variation: true
    alert_on: Unexpected distribution shift

  feature_performance:
    new_vs_old: true
    metrics:
      - latency
      - error_rate
      - conversion (if relevant)

  flag_staleness:
    alert_when: Flag unchanged for > expected_lifetime
// Metrics during evaluation
func (c *FlagClient) IsEnabled(ctx context.Context, flagName string) bool {
    result := c.evaluate(ctx, flagName)

    // Record evaluation
    c.metrics.Inc("flag_evaluations_total",
        "flag", flagName,
        "variation", strconv.FormatBool(result),
    )

    return result
}

Cleanup

flag_cleanup:
  when_to_remove:
    - Release fully rolled out (100%)
    - Experiment concluded
    - Feature deprecated

  removal_process:
    1. Verify flag at 100% or decision made
    2. Remove flag checks from code
    3. Deploy code changes
    4. Archive flag in management system
    5. Delete flag

  technical_debt:
    - Track flags by age
    - Review stale flags monthly
    - Set flag limits per service

Anti-Patterns

What to Avoid

flag_anti_patterns:
  nested_flags:
    bad: |
      if flag_a:
        if flag_b:
          if flag_c:
            # Complex state
    better: Combine into single flag or refactor

  flags_in_flags:
    bad: Using flags to control other flags
    better: Simple, independent flags

  permanent_release_flags:
    bad: Release flags that never get removed
    better: Set expiration, enforce cleanup

  flag_proliferation:
    bad: Hundreds of flags with unclear ownership
    better: Limit flags, require justification

  complex_targeting:
    bad: 15 rules with complex logic
    better: Simple targeting, multiple flags if needed

Testing

Testing with Flags

// Test both variations
func TestCheckout(t *testing.T) {
    testCases := []struct {
        name       string
        flagValue  bool
        expected   string
    }{
        {"new flow enabled", true, "new_checkout_result"},
        {"new flow disabled", false, "legacy_checkout_result"},
    }

    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            // Mock flag client
            flags := &MockFlagClient{
                values: map[string]bool{
                    "new_checkout_flow": tc.flagValue,
                },
            }

            service := NewCheckoutService(flags)
            result := service.ProcessOrder(ctx, testOrder)

            assert.Equal(t, tc.expected, result)
        })
    }
}

Key Takeaways

Feature flags are powerful but require discipline. Without lifecycle management, they become burden instead of benefit.