Building High-Performance Go Services

June 25, 2018

Go was designed for building network services. Its concurrency primitives, garbage collector, and runtime make it excellent for high-throughput, low-latency applications. But writing performant Go requires understanding how the runtime works and following specific patterns.

Here’s how to build Go services that perform at scale.

Understanding Go’s Runtime

Goroutines and Scheduling

Goroutines are lightweight threads managed by Go’s runtime:

The scheduler is cooperative—goroutines yield at specific points:

Long-running computations without yields can block the scheduler.

Memory Allocation

The Go allocator is optimized for small allocations:

But allocation isn’t free. Reducing allocations improves performance.

Garbage Collection

Go’s GC is concurrent and low-latency:

GC time correlates with live heap size and allocation rate. Reduce allocations, reduce GC pressure.

Reducing Allocations

Profiling First

Before optimizing, profile:

import _ "net/http/pprof"

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    // ...
}
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/allocs

Focus on hot paths—most allocations come from few locations.

Stack vs Heap

Variables escape to heap when:

Check escape analysis:

go build -gcflags="-m" ./...
// Escapes - returned pointer
func newUser() *User {
    return &User{}  // Allocated on heap
}

// Doesn't escape
func processUser() {
    user := User{}  // Stack allocated
    doSomething(&user)
}

Sync.Pool

Reuse allocations with sync.Pool:

var bufferPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

func handleRequest(r *Request) {
    buf := bufferPool.Get().([]byte)
    defer bufferPool.Put(buf)

    // Use buffer
    n := copy(buf, r.Body)
    process(buf[:n])
}

sync.Pool reduces allocations for frequently-used objects.

Avoid Interface Allocation

Storing values in interfaces can allocate:

// Allocates - int must be boxed
var x interface{} = 42

// Better - use concrete types when possible
var x int = 42

Preallocate Slices

// Grows multiple times
result := []Item{}
for _, v := range input {
    result = append(result, transform(v))
}

// Preallocated - no growth
result := make([]Item, 0, len(input))
for _, v := range input {
    result = append(result, transform(v))
}

Strings and Bytes

String operations often allocate:

// Many allocations
s := ""
for _, part := range parts {
    s += part
}

// Single allocation
var b strings.Builder
for _, part := range parts {
    b.WriteString(part)
}
s := b.String()

Concurrency Patterns

Worker Pools

Process work with bounded concurrency:

func workerPool(jobs <-chan Job, results chan<- Result, workers int) {
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                results <- process(job)
            }
        }()
    }
    wg.Wait()
    close(results)
}

Workers limit concurrency and reduce goroutine creation overhead.

Bounded Channels

Unbounded work queues cause memory issues:

// Dangerous - unbounded queue
jobs := make(chan Job)

// Better - backpressure
jobs := make(chan Job, 1000)  // Blocks when full

Backpressure prevents memory exhaustion under load.

Context for Cancellation

Propagate cancellation through call chain:

func handleRequest(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()

    resultCh := make(chan Result)
    go func() {
        resultCh <- slowOperation()
    }()

    select {
    case result := <-resultCh:
        return processResult(result)
    case <-ctx.Done():
        return ctx.Err()  // Timeout or cancellation
    }
}

Always respect context cancellation.

Avoid Goroutine Leaks

Goroutines that never exit leak memory:

// Leak - goroutine never exits if done is closed
go func() {
    for {
        select {
        case v := <-input:
            process(v)
        }
    }
}()

// Fixed - exits on done
go func() {
    for {
        select {
        case v := <-input:
            process(v)
        case <-done:
            return
        }
    }
}()

HTTP Server Optimization

Connection Handling

Default settings may not be optimal:

server := &http.Server{
    Addr:         ":8080",
    Handler:      handler,
    ReadTimeout:  5 * time.Second,
    WriteTimeout: 10 * time.Second,
    IdleTimeout:  120 * time.Second,
    MaxHeaderBytes: 1 << 20,
}

Tune timeouts for your workload.

Connection Pooling for Clients

Reuse connections:

var client = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,
    },
    Timeout: 10 * time.Second,
}

Create one client, reuse it.

JSON Performance

Standard library is good but not fastest:

// Standard library
json.Marshal(v)
json.Unmarshal(data, &v)

// Faster alternatives
// jsoniter - drop-in replacement
var json = jsoniter.ConfigCompatibleWithStandardLibrary
json.Marshal(v)

// easyjson - code generation
// Requires generating marshaling code

For hot paths, consider alternatives.

Response Compression

Compress responses:

func compressHandler(next http.Handler) http.Handler {
    return gziphandler.GzipHandler(next)
}

Reduces bandwidth, improves perceived latency.

Database Access

Connection Pooling

Configure pool appropriately:

db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)
db.SetConnMaxLifetime(5 * time.Minute)

Tune based on database capacity and workload.

Prepared Statements

Reuse prepared statements:

stmt, err := db.Prepare("SELECT * FROM users WHERE id = ?")
if err != nil {
    return err
}
defer stmt.Close()

// Reuse for many queries
for _, id := range userIds {
    rows, err := stmt.Query(id)
    // ...
}

Batch Operations

Reduce round trips:

// Slow - N queries
for _, item := range items {
    db.Exec("INSERT INTO items VALUES (?)", item)
}

// Fast - one query
tx, _ := db.Begin()
stmt, _ := tx.Prepare("INSERT INTO items VALUES (?)")
for _, item := range items {
    stmt.Exec(item)
}
tx.Commit()

Benchmarking

Write Benchmarks

func BenchmarkProcess(b *testing.B) {
    input := generateInput()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        process(input)
    }
}

func BenchmarkProcessParallel(b *testing.B) {
    input := generateInput()
    b.ResetTimer()

    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            process(input)
        }
    })
}

Run Benchmarks

go test -bench=. -benchmem -count=5 ./...

Compare before and after:

go test -bench=. -count=10 > old.txt
# Make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt

Production Considerations

GOMAXPROCS

Default is number of CPUs. Sometimes tuning helps:

import _ "go.uber.org/automaxprocs"  // Respects container limits

In containers, detect actual CPU limits.

Memory Limits

Set soft memory limit (Go 1.19+):

import "runtime/debug"

func init() {
    debug.SetMemoryLimit(500 * 1024 * 1024)  // 500MB
}

Or via environment:

GOMEMLIMIT=500MiB ./myservice

Observability

Expose runtime metrics:

import (
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

http.Handle("/metrics", promhttp.Handler())

Monitor goroutine count, heap size, GC pause times.

Key Takeaways

Go performs well by default, but understanding the runtime enables significant improvements for demanding workloads.