Operators & CRDs
Understanding Kubernetes Operators, Custom Resources, and extending Kubernetes
Kubernetes Operators
Operators are software extensions to Kubernetes that use custom resources to manage applications and their components. They implement domain-specific knowledge to automate the entire lifecycle of the software they manage, from deployment and configuration to updates, backups, and failure handling.
Operators follow the Kubernetes principle of reconciliation loops - continuously comparing the desired state with the actual state and taking actions to align them. This makes them powerful tools for automating complex operational tasks that would otherwise require manual intervention.
Understanding Custom Resources
Custom Resource Definitions (CRDs)
- Extend Kubernetes API: Add new endpoints to the Kubernetes API server
- Define new resource types: Create schema and validation for custom objects
- Domain-specific objects: Represent application-specific concepts as Kubernetes resources
- Declarative management: Apply, update, and delete with standard kubectl commands
- Kubernetes-native interfaces: Integrate with existing tools and workflows
- Versioning support: Enable API evolution with multiple versions
- Namespace or cluster scoped: Control resource visibility and isolation
Custom Controllers
- Watch custom resources: Monitor the Kubernetes API for changes to custom objects
- Implement business logic: Encode domain knowledge and operational procedures
- Reconcile desired state: Continuously work to make actual state match specification
- Manage application lifecycle: Handle creation, updates, scaling, and deletion
- Automate operational tasks: Perform backups, upgrades, failovers, and more
- Handle edge cases: Implement retry logic and error handling
- Report status: Update status subresource with current conditions
The combination of CRDs and controllers is what makes Operators powerful. CRDs define the "what" (the desired state) while controllers implement the "how" (the reconciliation logic).
Creating Custom Resources
A CustomResourceDefinition (CRD) is a Kubernetes resource that defines a new type of custom resource. Here's a detailed example of a CRD for a simple CronTab resource:
This CRD includes:
- Proper versioning support for API evolution
- OpenAPI schema validation to ensure correctness
- Subresource support for clean separation of status updates
- Custom printer columns for better kubectl output
- Pattern validation for the cron expression
Using Custom Resources
Once a CRD is defined, you can create instances of your custom resource. These are used just like built-in Kubernetes resources with kubectl:
You can manage these resources with familiar kubectl commands:
The controller for this CRD would watch for these resources and create the necessary Kubernetes objects (like Jobs or CronJobs) based on the specification.
Operator Pattern
The Operator pattern consists of:
- Custom Resource Definition (CRD): Defines the schema for your custom resource
- Controller watching for CR instances: Detects when resources are created, updated, or deleted
- Domain-specific knowledge coded into controller: Embeds operational expertise as code
- Continuous reconciliation loop: Constantly works to ensure actual state matches desired state
- Kubernetes-native management experience: Provides kubectl integration and familiar workflows
This pattern enables complex applications to be managed declaratively, which means:
- Infrastructure as code principles can be applied
- Git-based workflows can be used for application management
- Audit trails exist for all configuration changes
- Rollbacks are possible through resource versioning
The Operator pattern is particularly valuable for stateful applications that require specific domain knowledge to operate correctly. Traditional deployment tools may struggle with databases, message queues, and other stateful systems, but Operators can encode the necessary operational procedures directly.
An Operator can handle tasks such as:
- Automated backup and restore procedures
- Data replication configuration
- Leader election in clustered applications
- Rolling updates with zero downtime
- Scaling with proper data rebalancing
- Disaster recovery processes
Popular Operators
Prometheus Operator
- Manages Prometheus instances: Deploys and configures Prometheus servers
- Configures monitoring targets: Automatically discovers and scrapes services
- Handles alerting rules: Manages AlertManager and notification policies
- Manages Grafana dashboards: Provisions dashboards and data sources
- Simplifies monitoring setup: Creates ServiceMonitors for target discovery
- Handles high availability: Supports Prometheus clustering for reliability
- Implements sharding: Scales monitoring across multiple instances
PostgreSQL Operator
- Deploys PostgreSQL clusters: Creates primary and replica instances
- Handles high availability: Manages automatic failover mechanisms
- Manages backups: Schedules and restores point-in-time backups
- Implements scaling: Adjusts resources and replica count based on demand
- Automates upgrades: Performs zero-downtime version upgrades
- Manages connection pooling: Configures PgBouncer for optimal performance
- Monitors database health: Collects metrics and alerts on issues
- Manages users and permissions: Handles role-based access control
Elasticsearch Operator
- Creates Elasticsearch clusters: Provisions multi-node Elasticsearch deployments
- Manages Kibana instances: Deploys and configures visualization frontend
- Handles data nodes: Distributes data for performance and redundancy
- Implements security: Configures authentication, authorization, and TLS
- Automates operations: Handles index management and shard allocation
- Manages topology: Places nodes across availability zones
- Handles snapshots: Configures automated backup schedules
- Upgrades safely: Performs rolling updates of cluster components
Other notable operators include:
- Strimzi Kafka Operator: Manages Apache Kafka clusters, topics, users, and more
- MongoDB Operator: Automates MongoDB replica sets and sharded clusters
- Redis Operator: Manages Redis Sentinel and Redis Cluster deployments
- Jaeger Operator: Manages distributed tracing infrastructure
- Vault Operator: Automates HashiCorp Vault deployment and secret management
- Istio Operator: Simplifies service mesh installation and upgrades
Operator Frameworks
Several frameworks exist to simplify operator development, with Operator SDK being the most widely used. These frameworks provide scaffolding, utilities, and best practices to accelerate development.
Operator SDK
The Operator SDK, part of the Operator Framework, supports multiple options for implementing operators:
- Go: Native language for Kubernetes with direct client-go integration
- Ansible: For teams with existing Ansible expertise
- Helm: For converting existing Helm charts into operators
Other Operator Development Options
KOPF (Kubernetes Operator Pythonic Framework):
- Python-based framework for writing operators
- Event-driven programming model
- Simpler learning curve for Python developers
Kubebuilder:
- Foundation for Operator SDK's Go support
- Focused specifically on Go-based operators
- Uses controller-runtime library
KUDO (Kubernetes Universal Declarative Operator):
- Declarative approach to building operators
- YAML-based operator definitions
- No programming required for many use cases
Operator Lifecycle Manager (OLM)
Operator Lifecycle Manager (OLM) helps cluster administrators manage the lifecycle of operators in a Kubernetes cluster, from installation to updates to removal.
Benefits
- Operator discoverability: Catalog of available operators in a cluster
- Dependency resolution: Automatically installs operator dependencies
- Cluster stability: Ensures compatibility between operators
- Update management: Handles operator upgrades safely
- Version tracking: Manages multiple versions of operators
- Channel-based updates: Supports concepts like stable/beta/alpha channels
- Namespace tenancy: Controls operator visibility and access
- Seamless upgrades: Updates operators without service interruption
Installation
OLM introduces several custom resources for operator management:
- ClusterServiceVersion (CSV): Represents a specific version of an operator
- InstallPlan: Calculated list of resources to be created for an operator
- Subscription: Keeps operators updated by tracking a channel in a package
- CatalogSource: Repository of operator metadata that OLM can query
- OperatorGroup: Defines the service account permissions for operators
These resources work together to provide a comprehensive operator management solution for cluster administrators.
Building an Operator
Key steps in operator development:
- Define the API (CRD)
- Design your resource schema carefully
- Consider versioning from the beginning
- Add validation to prevent invalid configurations
- Include status fields for reporting conditions
- Implement the controller
- Follow the controller-runtime patterns
- Implement idempotent reconciliation logic
- Handle all edge cases and error conditions
- Use owner references for garbage collection
- Add proper logging and error reporting
- Test with reconciliation
- Write unit tests for controller logic
- Use envtest for integration testing
- Test error handling and recovery
- Verify reconciliation convergence
- Simulate various failure scenarios
- Package as container image
- Use multi-stage builds for smaller images
- Include only necessary runtime dependencies
- Configure appropriate security contexts
- Set resource requests and limits
- Tag with semantic versioning
- Deploy to Kubernetes
- Create proper RBAC permissions
- Use Kustomize or Helm for deployment
- Consider namespace isolation
- Set up monitoring and alerting
- Configure proper liveness/readiness probes
- Continuous iteration
- Monitor operator logs and performance
- Gather user feedback
- Implement new features
- Fix bugs and improve error handling
- Plan for API evolution and upgrades
Building an effective operator requires deep understanding of both Kubernetes internals and the application domain. The best operators combine Kubernetes expertise with application-specific operational knowledge.
Example: Simple Operator
Below is a more detailed example of a Go-based controller implementation for a Memcached operator:
Advanced CRD features enhance your operator in multiple ways:
- Subresources separate concerns and enable standard Kubernetes features:
- Status subresource: Updates status without modifying spec
- Scale subresource: Enables HPA integration
- Printer columns improve the CLI experience:
- Customized
kubectl get
output - Relevant information at a glance
- Better operational visibility
- Customized
- Validation schema ensures data integrity:
- Type checking and format validation
- Required fields enforcement
- Enumerated value restrictions
- Numeric range constraints
- Regex pattern validation
- Conversion webhooks enable API evolution:
- Seamless version upgrades
- Data transformation between versions
- Backward compatibility support
Troubleshooting Operators
When operators aren't behaving as expected, systematic troubleshooting approaches can help identify and resolve issues.
Common Issues
- Controller not reconciling:
- Operator pod might be failing or restarting
- Watch might not be set up correctly
- Reconcile function might have errors
- Resource might be ignored due to owner reference filtering
- Webhook might be rejecting changes
- Status not updating:
- Missing RBAC permissions for status subresource
- Errors in status update code
- Status updates being overwritten by another controller
- Custom resource not defining status fields properly
- Status updates being throttled by API server
- CRD validation errors:
- Schema validation rejecting valid resources
- Required fields missing in custom resources
- Data type mismatches
- Pattern validation too strict
- Enum values not covering all cases
- Permissions problems:
- Insufficient RBAC for accessing resources
- Missing service account configuration
- Namespace restrictions
- Security contexts preventing operations
- Missing API groups in ClusterRole
- Resource conflicts:
- Multiple controllers managing same resources
- ResourceVersion conflicts during updates
- Finalizers preventing deletion
- Ownership conflicts
- Race conditions in concurrent reconciliations
Debugging
Advanced Debugging Techniques
- Use tools like
delve
for remote debugging Go operators - Add debug endpoints to your operator for on-demand diagnostics
- Create a debug build with additional instrumentation
- Implement trace ID propagation for distributed tracing
- Use tools like Telepresence for local development against remote cluster
Security Considerations
Key security aspects:
- Limit RBAC permissions
- Follow principle of least privilege
- Use separate service accounts for different components
- Regularly audit permissions and remove unused ones
- Consider namespace-scoped operators instead of cluster-wide
- Use RoleBindings instead of ClusterRoleBindings when possible
- Validate user inputs
- Implement comprehensive CRD schema validation
- Use admission webhooks for complex validation
- Sanitize all user inputs before use
- Validate all environment variables and configuration
- Implement strict type checking and bounds validation
- Secure sensitive data
- Never store credentials in CRDs or ConfigMaps
- Use Kubernetes Secrets for sensitive information
- Consider external secret management (Vault, AWS Secrets Manager)
- Implement encryption for data at rest and in transit
- Rotate credentials regularly
- Consider multi-tenancy
- Isolate operators by namespace
- Implement tenant isolation within operators
- Use NetworkPolicies to restrict communication
- Consider security implications of shared operators
- Implement resource quotas to prevent DoS
- Implement auditing
- Enable Kubernetes audit logging
- Record all significant operator actions
- Implement detailed event recording
- Consider audit trail for sensitive operations
- Use structured logging with appropriate metadata
- Handle upgrades securely
- Validate all changes before applying
- Implement gradual rollout strategies
- Have rollback procedures ready
- Test upgrades thoroughly in staging
- Monitor closely during and after upgrades
- Container security
- Use minimal base images
- Run containers as non-root users
- Implement read-only file systems where possible
- Scan images for vulnerabilities
- Set appropriate security contexts
- API security
- Use mTLS for all communications
- Implement rate limiting to prevent abuse
- Consider API request validation
- Secure webhook endpoints
- Monitor for unusual API requests
Production Readiness
Before deploying an operator to production, ensure it meets high standards for reliability, maintainability, and operability.
Checklist
- Comprehensive tests
- Unit tests for controller logic
- Integration tests with real Kubernetes API
- End-to-end tests for full workflows
- Chaos testing for resilience
- Upgrade tests between versions
- Performance tests under load
- Security scans and penetration testing
- Version strategy
- Semantic versioning for operator releases
- CRD versioning plan
- Clear deprecation policies
- Conversion strategy between versions
- Compatibility matrix documentation
- Container image tagging strategy
- Rollback procedures defined
- Documentation
- Installation and configuration guide
- API reference for all CRDs
- Operational procedures for common tasks
- Troubleshooting guides
- Architectural overview
- Example use cases and configurations
- Known limitations and workarounds
- Monitoring integration
- Prometheus metrics exposed
- Default Grafana dashboards
- Alert definitions for critical conditions
- Health and readiness endpoints
- Detailed logging strategy
- Tracing integration
- Event recording for significant actions
- Backup/restore procedures
- Clear backup methodology
- Documented restore process
- Disaster recovery testing
- Point-in-time recovery options
- Data consistency guarantees
- Cross-cluster migration procedures
- Backup validation mechanisms
- Update strategy
- In-place upgrade support
- Canary deployment options
- Progressive rollout capabilities
- Feature flags for gradual enabling
- A/B testing support
- Blue/green deployment procedures
- Automatic or manual approval workflows
- Error handling
- Graceful degradation under pressure
- Comprehensive error logging
- Self-healing mechanisms
- Circuit breaking for external dependencies
- Retry strategies with backoff
- Failure domain isolation
- Deterministic error reporting
- Resource constraints
- Appropriate resource requests and limits
- Horizontal scaling capability
- Vertical scaling considerations
- Performance under resource pressure
- Graceful handling of resource exhaustion
- Quality of service guarantees
- Prioritization of critical operations
A production-ready operator should be treated like any mission-critical application, with proper CI/CD pipelines, change management procedures, and operational runbooks. The ultimate goal is to make the operator itself as reliable and maintainable as the applications it manages.
Operator Hub
OperatorHub.io is a central repository for Kubernetes Operators:
- Discover and share operators
- Public registry of community operators
- Searchable catalog of production-ready operators
- Filterable by capability level and category
- Ratings and reviews from community
- Vendor-backed and community operators
- Browse by category
- Database management
- Monitoring and observability
- Security
- Storage and backup
- Big data and analytics
- Cloud providers
- Developer tools
- Networking
- Installation instructions
- Step-by-step deployment guides
- YAML manifests and Helm charts
- OLM integration for dependency management
- Version compatibility information
- Resource requirements and prerequisites
- Custom configuration options
- Community-contributed
- Open submission process
- Community review and feedback
- Collaborative improvement
- Issue tracking and feature requests
- Use case examples and best practices
- Regular updates and maintenance
- Operator SDK integration
- Scaffolding for operator development
- Publishing guidelines and tools
- Validation tests for quality assurance
- Bundle format standardization
- Scoring and capability level assessment
- Lifecycle management integration
Featured Operators from OperatorHub
Some popular operators available on OperatorHub include:
- Prometheus Operator - Automated deployment and management of Prometheus monitoring stacks
- Elasticsearch Operator - Manages Elasticsearch, Kibana, and APM Server on Kubernetes
- etcd Operator - Manages etcd clusters deployed on Kubernetes
- MongoDB Community Kubernetes Operator - Automates and manages MongoDB deployments
- Strimzi Kafka Operator - Simplifies running Apache Kafka on Kubernetes
- Redis Operator - Creates and maintains Redis clusters
- PostgreSQL Operator - Manages PostgreSQL clusters
- Jaeger Operator - Simplifies deployment of Jaeger tracing infrastructure
Publishing Your Operator
To publish your operator to OperatorHub:
- Package your operator using the Operator Framework bundle format
- Ensure it meets the required criteria
- Submit a pull request to the community-operators repository
- Respond to community review feedback
- Maintain your operator with regular updates and improvements
By publishing to OperatorHub, you make your operator discoverable to the wider Kubernetes community and benefit from community feedback and contributions.