Kubernetes Operators: Extending the Platform

Kubernetes manages containerized applications well, but complex stateful applications need more than container orchestration. Databases require backups, schema migrations, and replica management. Message queues need partition rebalancing. Monitoring systems need configuration across clusters.

Operators extend Kubernetes to manage these complex applications automatically. They encode human operational knowledge into software.

What Operators Are

The Concept

An operator is a controller that:

Watches for custom resources (your application-specific objects)
Compares desired state to actual state
Takes actions to reconcile differences

User → Custom Resource → Operator → Kubernetes Resources
        (desired state)             (actual state)

Custom Resources

Custom Resource Definitions (CRDs) extend the Kubernetes API:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  names:
    kind: Database
    plural: databases
    singular: database
  scope: Namespaced
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              engine:
                type: string
                enum: ["postgres", "mysql"]
              version:
                type: string
              replicas:
                type: integer

Users create instances:

apiVersion: example.com/v1
kind: Database
metadata:
  name: orders-db
spec:
  engine: postgres
  version: "14"
  replicas: 3

Control Loop

Operators implement the reconciliation loop:

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // 1. Fetch the Database custom resource
    var db examplev1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Check current state
    currentReplicas := r.getCurrentReplicaCount(ctx, &db)

    // 3. Reconcile to desired state
    if currentReplicas < db.Spec.Replicas {
        r.scaleUp(ctx, &db)
    } else if currentReplicas > db.Spec.Replicas {
        r.scaleDown(ctx, &db)
    }

    // 4. Update status
    db.Status.ReadyReplicas = r.getReadyReplicas(ctx, &db)
    r.Status().Update(ctx, &db)

    return ctrl.Result{RequeueAfter: time.Minute}, nil
}

The loop runs continuously, ensuring actual state matches desired state.

Why Operators Matter

Encoded Operational Knowledge

Consider PostgreSQL:

Primary needs specific configuration
Replicas need streaming replication setup
Failover requires promoting replica and reconfiguring clients
Backups need scheduling and retention
Upgrades require specific sequencing

A PostgreSQL operator encodes this knowledge:

apiVersion: postgres.example.com/v1
kind: PostgresCluster
metadata:
  name: my-cluster
spec:
  version: "14"
  instances: 3
  backup:
    schedule: "0 * * * *"  # Hourly
    retention: 7d

The operator handles replication, failover, backups, and upgrades automatically.

Day 2 Operations

Day 1 (initial deployment) is often easy. Day 2 (ongoing operations) is hard:

Scaling up and down
Rolling upgrades
Backup and restore
Failure recovery
Configuration changes

Operators automate Day 2 operations.

Self-Healing

Operators continuously reconcile:

Database replica crashes →
Operator detects missing replica →
Operator creates replacement →
Operator configures replication →
Cluster healthy again

Recovery happens automatically without human intervention.

Building Operators

Operator SDK

The Operator SDK simplifies operator development:

# Create new operator project
operator-sdk init --domain=example.com --repo=github.com/example/db-operator

# Create API and controller
operator-sdk create api --group=database --version=v1 --kind=Database

This scaffolds:

Custom Resource Definition
Controller skeleton
RBAC configuration
Build and deployment manifests

Kubebuilder

Kubebuilder provides similar scaffolding with a focus on Kubernetes SIG standards:

kubebuilder init --domain=example.com
kubebuilder create api --group=database --version=v1 --kind=Database

Both generate Go-based operators. For other languages, consider Kopf (Python) or Java Operator SDK.

Implementation Pattern

A typical operator:

// types.go - Define the custom resource
type DatabaseSpec struct {
    Engine   string `json:"engine"`
    Version  string `json:"version"`
    Replicas int32  `json:"replicas"`
}

type DatabaseStatus struct {
    Phase         string `json:"phase"`
    ReadyReplicas int32  `json:"readyReplicas"`
    Conditions    []metav1.Condition `json:"conditions,omitempty"`
}

// controller.go - Implement reconciliation
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("database", req.NamespacedName)

    // Fetch the resource
    var db databasev1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Create or update StatefulSet
    if err := r.reconcileStatefulSet(ctx, &db); err != nil {
        return ctrl.Result{}, err
    }

    // Create or update Service
    if err := r.reconcileService(ctx, &db); err != nil {
        return ctrl.Result{}, err
    }

    // Update status
    return r.updateStatus(ctx, &db)
}

Best Practices

Idempotency: Reconciliation must be safe to run multiple times.

Owned Resources: Set owner references so dependent resources are garbage collected:

ctrl.SetControllerReference(&db, &statefulSet, r.Scheme)

Status Conditions: Use standard condition patterns for status:

meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
    Type:    "Ready",
    Status:  metav1.ConditionTrue,
    Reason:  "AllReplicasReady",
    Message: "All replicas are running and ready",
})

Finalizers: Clean up external resources before deletion:

const finalizerName = "database.example.com/finalizer"

if !db.DeletionTimestamp.IsZero() {
    if containsString(db.Finalizers, finalizerName) {
        // Cleanup external resources
        r.cleanupExternalResources(&db)
        // Remove finalizer
        db.Finalizers = removeString(db.Finalizers, finalizerName)
        r.Update(ctx, &db)
    }
    return ctrl.Result{}, nil
}

Operator Maturity Levels

The OperatorHub defines maturity levels:

Level 1: Basic Install

Automated installation
Basic configuration via spec

Level 2: Seamless Upgrades

Automated patch and minor version upgrades
Upgrade without downtime

Level 3: Full Lifecycle

Backup and restore
Failure recovery
Scaling operations

Level 4: Deep Insights

Metrics integration
Alerting rules
Log aggregation

Level 5: Auto Pilot

Automatic scaling
Auto-healing
Performance tuning

Most operators start at Level 1-2. Higher levels require significant investment.

When to Build vs. Use

Use Existing Operators

Many mature operators exist:

PostgreSQL: Crunchy, Zalando, CloudNativePG
MySQL: Oracle, Percona, Vitess
Kafka: Strimzi
Elasticsearch: Elastic Cloud on Kubernetes (ECK)
Redis: Spotahome, Redis Enterprise
Prometheus: Prometheus Operator

Prefer mature operators unless you have specific requirements.

Build When

No existing operator meets your needs
You have deep operational expertise to encode
Your internal systems need custom automation
Learning is a goal (operators teach Kubernetes internals)

Buy When

Commercial operators provide needed features
Support and guarantees matter
Time-to-value is critical

Operational Considerations

Operator Deployment

Deploy operators with care:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: database-operator
spec:
  replicas: 1  # Usually 1 (leader election if more)
  selector:
    matchLabels:
      app: database-operator
  template:
    spec:
      serviceAccountName: database-operator
      containers:
      - name: operator
        image: example/database-operator:v1.0.0
        resources:
          limits:
            memory: 256Mi
            cpu: 500m

RBAC

Operators need appropriate permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: database-operator
rules:
- apiGroups: ["database.example.com"]
  resources: ["databases"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Follow least privilege—only grant what’s needed.

Monitoring

Monitor operators themselves:

Reconciliation latency
Error rates
Queue depth
Resource usage

The Operator SDK includes Prometheus metrics by default.

Testing

Test operators thoroughly:

Unit tests for reconciliation logic
Integration tests against real clusters
End-to-end tests for complete workflows
Chaos testing for failure scenarios

Consider envtest for controller testing:

testEnv = &envtest.Environment{
    CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")},
}

cfg, err = testEnv.Start()

Key Takeaways

Operators extend Kubernetes with application-specific controllers
They encode operational knowledge into automated reconciliation loops
Custom Resource Definitions create application-specific APIs
Operators excel at Day 2 operations: scaling, upgrades, backups, recovery
Use Operator SDK or Kubebuilder to scaffold new operators
Prefer existing mature operators when they meet requirements
Build operators to encode unique operational expertise
Implement idempotency, owner references, conditions, and finalizers
Test thoroughly including failure scenarios
Monitor operator health as critical infrastructure

Operators represent a powerful pattern for extending Kubernetes. They’re especially valuable for stateful applications that need more than basic container management.