Kubernetes manages containerized applications well, but complex stateful applications need more than container orchestration. Databases require backups, schema migrations, and replica management. Message queues need partition rebalancing. Monitoring systems need configuration across clusters.
Operators extend Kubernetes to manage these complex applications automatically. They encode human operational knowledge into software.
What Operators Are
The Concept
An operator is a controller that:
- Watches for custom resources (your application-specific objects)
- Compares desired state to actual state
- Takes actions to reconcile differences
User → Custom Resource → Operator → Kubernetes Resources
(desired state) (actual state)
Custom Resources
Custom Resource Definitions (CRDs) extend the Kubernetes API:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
names:
kind: Database
plural: databases
singular: database
scope: Namespaced
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["postgres", "mysql"]
version:
type: string
replicas:
type: integer
Users create instances:
apiVersion: example.com/v1
kind: Database
metadata:
name: orders-db
spec:
engine: postgres
version: "14"
replicas: 3
Control Loop
Operators implement the reconciliation loop:
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// 1. Fetch the Database custom resource
var db examplev1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Check current state
currentReplicas := r.getCurrentReplicaCount(ctx, &db)
// 3. Reconcile to desired state
if currentReplicas < db.Spec.Replicas {
r.scaleUp(ctx, &db)
} else if currentReplicas > db.Spec.Replicas {
r.scaleDown(ctx, &db)
}
// 4. Update status
db.Status.ReadyReplicas = r.getReadyReplicas(ctx, &db)
r.Status().Update(ctx, &db)
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
The loop runs continuously, ensuring actual state matches desired state.
Why Operators Matter
Encoded Operational Knowledge
Consider PostgreSQL:
- Primary needs specific configuration
- Replicas need streaming replication setup
- Failover requires promoting replica and reconfiguring clients
- Backups need scheduling and retention
- Upgrades require specific sequencing
A PostgreSQL operator encodes this knowledge:
apiVersion: postgres.example.com/v1
kind: PostgresCluster
metadata:
name: my-cluster
spec:
version: "14"
instances: 3
backup:
schedule: "0 * * * *" # Hourly
retention: 7d
The operator handles replication, failover, backups, and upgrades automatically.
Day 2 Operations
Day 1 (initial deployment) is often easy. Day 2 (ongoing operations) is hard:
- Scaling up and down
- Rolling upgrades
- Backup and restore
- Failure recovery
- Configuration changes
Operators automate Day 2 operations.
Self-Healing
Operators continuously reconcile:
Database replica crashes →
Operator detects missing replica →
Operator creates replacement →
Operator configures replication →
Cluster healthy again
Recovery happens automatically without human intervention.
Building Operators
Operator SDK
The Operator SDK simplifies operator development:
# Create new operator project
operator-sdk init --domain=example.com --repo=github.com/example/db-operator
# Create API and controller
operator-sdk create api --group=database --version=v1 --kind=Database
This scaffolds:
- Custom Resource Definition
- Controller skeleton
- RBAC configuration
- Build and deployment manifests
Kubebuilder
Kubebuilder provides similar scaffolding with a focus on Kubernetes SIG standards:
kubebuilder init --domain=example.com
kubebuilder create api --group=database --version=v1 --kind=Database
Both generate Go-based operators. For other languages, consider Kopf (Python) or Java Operator SDK.
Implementation Pattern
A typical operator:
// types.go - Define the custom resource
type DatabaseSpec struct {
Engine string `json:"engine"`
Version string `json:"version"`
Replicas int32 `json:"replicas"`
}
type DatabaseStatus struct {
Phase string `json:"phase"`
ReadyReplicas int32 `json:"readyReplicas"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
// controller.go - Implement reconciliation
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("database", req.NamespacedName)
// Fetch the resource
var db databasev1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Create or update StatefulSet
if err := r.reconcileStatefulSet(ctx, &db); err != nil {
return ctrl.Result{}, err
}
// Create or update Service
if err := r.reconcileService(ctx, &db); err != nil {
return ctrl.Result{}, err
}
// Update status
return r.updateStatus(ctx, &db)
}
Best Practices
Idempotency: Reconciliation must be safe to run multiple times.
Owned Resources: Set owner references so dependent resources are garbage collected:
ctrl.SetControllerReference(&db, &statefulSet, r.Scheme)
Status Conditions: Use standard condition patterns for status:
meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "AllReplicasReady",
Message: "All replicas are running and ready",
})
Finalizers: Clean up external resources before deletion:
const finalizerName = "database.example.com/finalizer"
if !db.DeletionTimestamp.IsZero() {
if containsString(db.Finalizers, finalizerName) {
// Cleanup external resources
r.cleanupExternalResources(&db)
// Remove finalizer
db.Finalizers = removeString(db.Finalizers, finalizerName)
r.Update(ctx, &db)
}
return ctrl.Result{}, nil
}
Operator Maturity Levels
The OperatorHub defines maturity levels:
Level 1: Basic Install
- Automated installation
- Basic configuration via spec
Level 2: Seamless Upgrades
- Automated patch and minor version upgrades
- Upgrade without downtime
Level 3: Full Lifecycle
- Backup and restore
- Failure recovery
- Scaling operations
Level 4: Deep Insights
- Metrics integration
- Alerting rules
- Log aggregation
Level 5: Auto Pilot
- Automatic scaling
- Auto-healing
- Performance tuning
Most operators start at Level 1-2. Higher levels require significant investment.
When to Build vs. Use
Use Existing Operators
Many mature operators exist:
- PostgreSQL: Crunchy, Zalando, CloudNativePG
- MySQL: Oracle, Percona, Vitess
- Kafka: Strimzi
- Elasticsearch: Elastic Cloud on Kubernetes (ECK)
- Redis: Spotahome, Redis Enterprise
- Prometheus: Prometheus Operator
Prefer mature operators unless you have specific requirements.
Build When
- No existing operator meets your needs
- You have deep operational expertise to encode
- Your internal systems need custom automation
- Learning is a goal (operators teach Kubernetes internals)
Buy When
- Commercial operators provide needed features
- Support and guarantees matter
- Time-to-value is critical
Operational Considerations
Operator Deployment
Deploy operators with care:
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-operator
spec:
replicas: 1 # Usually 1 (leader election if more)
selector:
matchLabels:
app: database-operator
template:
spec:
serviceAccountName: database-operator
containers:
- name: operator
image: example/database-operator:v1.0.0
resources:
limits:
memory: 256Mi
cpu: 500m
RBAC
Operators need appropriate permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: database-operator
rules:
- apiGroups: ["database.example.com"]
resources: ["databases"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Follow least privilege—only grant what’s needed.
Monitoring
Monitor operators themselves:
- Reconciliation latency
- Error rates
- Queue depth
- Resource usage
The Operator SDK includes Prometheus metrics by default.
Testing
Test operators thoroughly:
- Unit tests for reconciliation logic
- Integration tests against real clusters
- End-to-end tests for complete workflows
- Chaos testing for failure scenarios
Consider envtest for controller testing:
testEnv = &envtest.Environment{
CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")},
}
cfg, err = testEnv.Start()
Key Takeaways
- Operators extend Kubernetes with application-specific controllers
- They encode operational knowledge into automated reconciliation loops
- Custom Resource Definitions create application-specific APIs
- Operators excel at Day 2 operations: scaling, upgrades, backups, recovery
- Use Operator SDK or Kubebuilder to scaffold new operators
- Prefer existing mature operators when they meet requirements
- Build operators to encode unique operational expertise
- Implement idempotency, owner references, conditions, and finalizers
- Test thoroughly including failure scenarios
- Monitor operator health as critical infrastructure
Operators represent a powerful pattern for extending Kubernetes. They’re especially valuable for stateful applications that need more than basic container management.