Kubernetes Topology Spread Constraints
Advanced pod scheduling across topology domains using Topology Spread Constraints for high availability and balanced workload distribution
Understanding Topology Spread Constraints
Kubernetes Topology Spread Constraints provide a sophisticated mechanism for controlling how pods are distributed across your cluster with respect to topology domains such as regions, zones, nodes, and other user-defined topology domains. This feature enables advanced high-availability configurations, improved resource utilization, and protection against topology-related failures.
Introduced as beta in Kubernetes 1.18 and graduating to stable in 1.19, topology spread constraints give cluster administrators and application developers fine-grained control over workload distribution beyond what is possible with pod affinity/anti-affinity rules alone. This capability becomes increasingly important as clusters span multiple availability zones, regions, or even cloud providers.
The core principle behind topology spread constraints is simple yet powerful: ensure that pods are distributed according to specific rules across different topology domains to minimize the impact of domain-level failures. Unlike traditional scheduling approaches that focus primarily on resource availability, topology spread constraints prioritize the relative distribution of pods, ensuring balanced deployment across the infrastructure landscape.
When to Use Topology Spread Constraints
Topology spread constraints are particularly valuable in the following scenarios:
- High Availability Applications
- Distribute application instances across failure domains
- Ensure service continuity during zone or rack failures
- Maintain application availability during node maintenance
- Implement regional resiliency for globally distributed applications
- Protect against cascading failures within a single domain
- Balanced Resource Utilization
- Prevent resource hotspots in specific topology domains
- Distribute workload evenly across the cluster
- Optimize resource utilization across zones or racks
- Balance network traffic across different network segments
- Prevent over-subscription of limited resources in specific domains
- Cost Optimization
- Balance workloads across zones to reduce cross-zone traffic costs
- Prevent overutilization of resources in premium pricing zones
- Optimize resource allocation in hybrid or multi-cloud environments
- Distribute computational load to maximize resource efficiency
- Leverage spot instances while maintaining reliability
- Performance Optimization
- Distribute latency-sensitive workloads across topology domains
- Minimize network latency by strategic pod placement
- Balance computational load across nodes with specialized hardware
- Optimize data locality for data-intensive applications
- Reduce contention for shared resources
The flexibility of topology spread constraints allows them to be adapted to various operational requirements, from strict regulatory compliance that mandates geographic distribution to performance-critical applications that require careful balancing of latency and throughput considerations.
Core Concepts
Topology spread constraints rely on several key concepts that form the foundation of this powerful scheduling capability:
Topology Domain
- A division of your infrastructure based on specific characteristics
- Common domains include zone, region, node, rack, switch
- Can use any node label as a topology domain
- Defined by the
topologyKeyfield in the constraint - Examples:
topology.kubernetes.io/zone,kubernetes.io/hostname,node.kubernetes.io/instance-type - Well-known topology keys are automatically added by cloud providers
- Custom topology domains can be created with node labels
- Hierarchical domains (e.g., region -> zone -> node) can be used together
- Domain selection impacts failure isolation and performance characteristics
- Proper domain labeling is crucial for effective constraint implementation
Spread Constraints
- Rules that define how pods should be distributed
- Specified in the pod specification under
topologySpreadConstraints - Can include multiple constraints for different topology keys
- Evaluated during the scheduling process to determine pod placement
- Each constraint is considered independently unless matching parameters indicate otherwise
- Can be applied to individual pods or pod templates in workload resources
- Act as hard or soft requirements depending on configuration
- Can include pod label selectors to count only specific pods
- More specific than pod affinity/anti-affinity for distribution purposes
- Each constraint includes:
maxSkew: The degree to which pods may be unevenly distributedtopologyKey: The node label used to identify the topology domainwhenUnsatisfiable: The behavior when the constraint cannot be satisfiedlabelSelector: Which pods to count for determining the spreadmatchLabelKeys(optional): Pod labels to include in the selection criterianodeAffinityPolicy(optional): How node affinity should be considerednodeTaintsPolicy(optional): How node taints should be considered
MaxSkew
- Defines the maximum permitted difference between domains
- Value of 1 means domains can differ by at most 1 pod
- Higher values allow more imbalance between domains
- If domains have 5, 5, and 4 matching pods, the skew is 1
- If domains have 6, 3, and 2 matching pods, the skew is 4
- Calculated using the formula: max(domain_count) - min(domain_count)
- Represents the tolerance for imbalance in the system
- Smaller values enforce stricter distribution
- Should be tuned based on the total pod count and domain count
- Important consideration: maxSkew of 1 with a small number of pods can prevent scheduling
- For new deployments, consider starting with higher skew and reducing over time
- Different workloads may require different skew values based on criticality
- A domain with zero pods is not considered for maxSkew calculation when using DoNotSchedule
- Special consideration for the first pod in a deployment (no skew calculation possible)
WhenUnsatisfiable
- Determines behavior when constraints cannot be met
DoNotSchedule: Pod remains pending until constraints are satisfiable (default)ScheduleAnyway: Best-effort distribution; pod will be scheduled even if constraints are violated- With
DoNotSchedule, pods may remain pending indefinitely if constraints cannot be satisfied - With
ScheduleAnyway, the scheduler will attempt to minimize the skew even when it cannot be fully satisfied DoNotScheduleprovides stronger guarantees but may impact availabilityScheduleAnywayprioritizes availability over perfect distribution- Can be used to implement hard and soft constraints in the same pod spec
- Critical for balancing strict distribution requirements against scheduling success
- Pending pods with
DoNotSchedulewill be reconsidered when cluster state changes - Pods with
ScheduleAnywaystill respect other scheduling constraints like resource requirements
These concepts work together to form a comprehensive system for controlling pod distribution. The flexibility of this system allows for simple configurations for basic use cases while supporting sophisticated distribution patterns for complex environments.
Topology Labels and Domains
The effectiveness of topology spread constraints depends heavily on proper node labeling. Kubernetes provides several standard topology labels:
- Cloud Provider Topology Labels
topology.kubernetes.io/zone: The availability zone where the node is runningtopology.kubernetes.io/region: The geographic region where the node is running- Automatically added by cloud providers like AWS, GCP, and Azure
- Example values:
us-east-1a(zone),us-east-1(region)
- Node Identity Labels
kubernetes.io/hostname: The node's hostname- Unique per node in the cluster
- Useful for spreading pods across physical machines
- Custom Topology Labels
topology.kubernetes.io/rack: Physical rack location (must be manually labeled)node.kubernetes.io/instance-type: The type of VM or hardware- Any custom label that represents a meaningful topology domain
Example of manually adding topology labels:
When designing your topology domains, consider the failure modes you want to protect against:
- Zone-level failures: Use
topology.kubernetes.io/zone - Node-level failures: Use
kubernetes.io/hostname - Rack-level failures: Use
topology.kubernetes.io/rack - Network segment failures: Use custom network topology labels
- Power distribution failures: Use custom power domain labels
The hierarchy of domains is also important—typically, you want to spread across larger domains first (regions), then medium-sized domains (zones), and finally smaller domains (nodes).
Basic Implementation
To implement topology spread constraints, you need to add the topologySpreadConstraints field to your pod specification:
This configuration ensures that the difference between the number of pods with the label app=example-app across any two zones is at most 1. If this cannot be achieved, the pod will remain in the Pending state.
Let's break down how this works in practice:
Assume you have a cluster with three zones (zone-a, zone-b, and zone-c) and the following pod distribution:
- zone-a: 2 pods with label
app=example-app - zone-b: 1 pod with label
app=example-app - zone-c: 1 pod with label
app=example-app
The current skew is 1 (maximum 2 - minimum 1 = 1), which equals the maxSkew specified.
If you create a new pod with the above constraint:
- It can be scheduled in zone-b or zone-c (bringing either to 2 pods)
- It cannot be scheduled in zone-a (would increase skew to 2)
If all zones already had 2 pods each (skew = 0), the new pod could be scheduled in any zone, as it would result in a skew of 1, which is acceptable.
Advanced Configuration Patterns
Topology spread constraints can be configured in various ways to achieve different distribution patterns:
- Multi-level Topology Spreading
This configuration distributes pods evenly across zones (with strict enforcement) and then attempts to distribute them across nodes with more flexibility. The combination ensures that:- Pods are strictly distributed across zones for high availability
- Within each zone, pods are spread across nodes but with some flexibility
- Zone-level distribution takes precedence over node-level distribution
- The system remains resilient to zone failures while optimizing node-level resource usage
- Different Selector for Counting
This counts pods with eitherapp=frontendorapp=middlewarelabels when determining the spread. This approach is useful when:- Multiple components need to be considered together for distribution purposes
- Components share resources or failure domains
- You want to balance related services across topology domains
- Different pod types serve the same logical function
- Custom Topology Domains
This uses a custom topology domain (rack) that you've labeled on your nodes. Custom domains enable:- Distribution based on physical infrastructure characteristics
- Accounting for power distribution units, network switches, or cooling systems
- Creating domain-specific high availability patterns
- Implementing compliance requirements for physical separation
- Graceful Fallback Configuration
This attempts to distribute pods evenly across zones but will still schedule the pod even if perfect distribution cannot be achieved. The scheduler will:- Try to minimize the skew as much as possible
- Use the skew value to calculate a score for each node
- Balance distribution against other scheduling factors
- Ensure the pod is scheduled somewhere, prioritizing availability
These configuration patterns can be combined in various ways to address complex distribution requirements. The flexibility of the constraint system allows for tailored solutions that balance high availability, resource utilization, and performance considerations.
Implementation with Deployments and StatefulSets
Topology spread constraints are more commonly used with Deployments, StatefulSets, and other workload controllers rather than with individual pods:
This Deployment creates 9 replicas that will be evenly distributed across both zones and nodes. With 3 zones, you would have 3 pods per zone, and these 3 pods in each zone would be distributed across different nodes.
The scheduling process for this deployment would work as follows:
- Initial scheduling: The first pod can be scheduled anywhere since there's no skew yet.
- Zone balancing: Subsequent pods will be distributed to maintain balanced zones.
- Node balancing: Within each zone, pods will be distributed across nodes.
- Scaling up: When adding pods, they'll be placed to maintain the minimum skew.
- Scaling down: When removing pods, the relative distribution will be maintained.
StatefulSet Example
StatefulSets have unique considerations due to their ordered creation and stable identities:
For StatefulSets, consider the following special considerations:
- Pod creation order: StatefulSets create pods in order (0, 1, 2, ...), which may temporarily violate constraints.
- PersistentVolumes: Volume availability in specific zones may impact distribution.
- Pod deletion order: StatefulSets delete in reverse order, which may temporarily affect distribution.
- Pod Management Policy: Using
Parallelpod management policy can help achieve faster balanced distribution.
Handling Different Cluster Topologies
Different cluster topologies require different constraint configurations:
- Multi-zone Clusters
This configuration works well for clusters spanning multiple availability zones within a single region. The constraint ensures that pods are distributed evenly across zones, minimizing the impact of zone-level failures.
In a three-zone cluster with 9 replicas, this would result in 3 pods per zone. If a zone fails, two-thirds of the application capacity would remain available. - Single-zone, Multi-rack Clusters
For on-premises deployments or single-zone clusters with multiple racks, this configuration distributes pods across physical racks. This protects against rack-level failures such as power or network issues.
With 4 racks and 12 replicas, this would place 3 pods per rack. A rack failure would result in a 25% capacity reduction. - Multi-region, Multi-zone Clusters
This hierarchical approach first distributes pods across regions (with some flexibility) and then ensures even distribution across zones within each region. This configuration provides protection against both region and zone failures.
In a cluster spanning 2 regions with 3 zones each (6 zones total) and 18 replicas, this would ideally place 9 pods per region and 3 pods per zone. - Custom Topology Hierarchy
This three-level hierarchy distributes pods across zones, then network segments within zones, and finally across nodes within network segments. The first two constraints useDoNotSchedulefor strict enforcement, while the node-level constraint usesScheduleAnywayfor flexibility.
This complex topology awareness provides resilience against multiple failure types simultaneously.
Advanced Use Cases
Topology spread constraints enable several advanced use cases:
Zone-aware StatefulSet Distribution
- Distribute database instances across zones
- Maintain even distribution of stateful workloads
- Ensure data availability during zone failures
- Balance read replicas across multiple zones
- Minimize cross-zone data transfer costs
- Provide consistent performance across zones
- Enable zone-local reads with distributed writes
- Support disaster recovery scenarios
- Implement quorum-based consensus systems
- Example StatefulSet configuration:
Hardware-aware Workload Distribution
- Distribute GPU workloads across available GPU nodes
- Balance CPU-intensive tasks across CPU types
- Optimize for specialized hardware utilization
- Avoid resource contention on accelerator cards
- Balance network-intensive workloads across NICs
- Ensure high-performance storage access
- Distribute workloads based on CPU architecture
- Optimize for NUMA topology
- Balance memory-intensive workloads
- Example for GPU workload distribution:
Cost-optimized Multi-cloud Distribution
- Balance workloads across cloud providers
- Optimize for spot instance availability
- Distribute load based on regional pricing
- Balance between on-premises and cloud resources
- Implement cloud bursting patterns
- Optimize for data transfer costs
- Use constraints alongside node taints for cost zones
- Balance reserved instances usage
- Implement follow-the-sun deployment patterns
- Example multi-cloud constraint:
Application Component Co-location
- Keep related components in the same zone for performance
- Maintain component ratios across topology domains
- Optimize for communication latency
- Balance frontend and backend components
- Distribute cache instances alongside application servers
- Optimize data locality for analytics workloads
- Implement sharded architectures with balanced distribution
- Ensure messaging systems are properly distributed
- Balance data producers and consumers
- Example for component distribution:
These advanced use cases demonstrate the versatility of topology spread constraints beyond simple high-availability scenarios. By combining constraints with other Kubernetes features like affinity rules, resource requirements, and taints/tolerations, you can implement sophisticated workload placement strategies.
Real-world Example: Highly Available Database Cluster
Here's a comprehensive example of a highly available database cluster using topology spread constraints:
This StatefulSet configuration ensures that:
- PostgreSQL instances are evenly distributed across zones (strictly enforced)
- Within zones, instances are distributed across racks (best effort)
- No two instances run on the same node (required anti-affinity)
- Each instance has stable storage and identity
- Proper initialization happens based on whether the instance is primary or replica
- Appropriate health checks are configured
- Resources are properly allocated
This example demonstrates how topology spread constraints can be combined with other Kubernetes features to create a comprehensive high-availability solution.
Monitoring and Troubleshooting
To effectively use topology spread constraints, you need to monitor and troubleshoot their behavior:
- Checking Pod Distribution
- Diagnosing Pending Pods
- Visualizing Topology Distribution
- Analyzing Scheduler Decisions
- Enable scheduler logging with higher verbosity
- Look for "TopologySpreadConstraint" entries in scheduler logs
- Check scheduler metrics related to topology spread constraints
Integration with Pod Affinity and Anti-Affinity
Topology spread constraints work alongside pod affinity and anti-affinity rules, creating powerful combinations for advanced placement control:
This configuration implements a sophisticated placement strategy:
- Primary High Availability: Strictly enforces even distribution across zones, ensuring the application remains available during zone-level failures
- Secondary Efficiency: Preferentially spreads pods across different nodes within each zone, minimizing the impact of node failures
- Performance Optimization: Tries to place web-app pods on the same nodes as cache pods when possible, improving data locality and reducing network latency
- Balanced Priorities: Uses required constraints for critical availability guarantees and preferred constraints for optimization
The different mechanisms complement each other:
- Topology spread constraints handle the quantitative distribution across domains
- Pod anti-affinity handles qualitative separation between specific pods
- Pod affinity handles co-location for performance optimization
This combination addresses multiple concerns simultaneously:
- High availability through proper distribution
- Performance optimization through strategic co-location
- Resource efficiency through balanced placement
Performance Considerations
Topology spread constraints can impact scheduling performance and behavior in ways that require careful consideration:
- Scheduler Performance
- Complex constraints increase scheduling computation time
- Multiple constraints multiply the scheduler's workload
- Large clusters with many constraints may experience scheduling latency
- Consider using the
ScheduleAnywaymode for non-critical constraints - In large clusters (100+ nodes), monitor scheduler CPU usage
- Benchmark scheduling latency with and without constraints
- Example scheduler CPU impact:
- Schedule Time vs. Runtime Trade-offs
- Strict constraints (
DoNotSchedule) can leave pods pending for extended periods - Consider using soft constraints (
ScheduleAnyway) for better scheduling success rates - Balance distribution requirements against scheduling speed
- Implement progressive backoff for retrying constrained scheduling
- Consider implementing custom controllers for complex distribution requirements
- In dynamic environments, prefer
ScheduleAnywaywith appropriate metrics and alerts - Example metric to monitor:
- Strict constraints (
- Scaling Behavior
- Rapid scaling operations may temporarily violate constraints as the scheduler works
- Consider implementing rate-limited scaling for large deployments with constraints
- Use PodDisruptionBudgets alongside constraints to maintain distribution during scaling
- For StatefulSets, consider
Parallelpod management policy during scale-up - Set appropriate resource requests to prevent bottlenecks during scaling events
- Monitor skew metrics during scaling operations
- Example HorizontalPodAutoscaler with behavior limits:
- Node Additions and Removals
- Adding nodes may trigger rebalancing if using cluster-proportional autoscaling
- Node draining may temporarily violate constraints
- Plan maintenance operations with constraint satisfaction in mind
- Use node cordoning and draining strategically across topology domains
- Implement domain-aware maintenance procedures
- Consider domain impact when adding or removing nodes
- Example node drain command that respects topology:
Advanced Configuration: MatchLabelKeys and NodeAffinityPolicy
Kubernetes 1.25+ introduced additional configuration options for topology spread constraints that provide even more flexibility:
These advanced features provide powerful capabilities:
- matchLabelKeys
- Automatically includes pod labels with these keys in the label selector
- Simplifies configuration by dynamically incorporating pod metadata
- Enables consistent constraints across multiple deployments
- In the example, pods will be counted for skew if they match:
app=web(from explicit matchExpressions) ANDenvironment=production(from current pod's label via matchLabelKeys) ANDteam=alpha(from current pod's label via matchLabelKeys)
- Changes to pod labels will automatically affect constraint matching
- Useful for multi-dimensional topologies (app + environment + team)
- Reduces duplication in constraint specifications
- nodeAffinityPolicy
- Controls whether node affinity/anti-affinity is respected when calculating pod topology spread skew
- Values:
Honor(default) orIgnore - With
Honor, only nodes that satisfy the pod's node affinity are considered - With
Ignore, all nodes are considered regardless of node affinity - Impacts skew calculation by changing the set of eligible nodes
- Example use case: When using node affinity for specialized hardware but wanting even zone distribution
- nodeTaintsPolicy
- Controls whether node taints are respected when calculating pod topology spread skew
- Values:
Honor(default) orIgnore - With
Honor, nodes with taints that the pod doesn't tolerate are not considered - With
Ignore, all nodes are considered regardless of taints - Useful when certain nodes are reserved but should still factor into distribution calculations
- Example use case: When using taints for special workloads but wanting to maintain zone balance
These options allow for more sophisticated constraint policies that:
- Adapt dynamically to pod metadata
- Interact intelligently with other scheduling mechanisms
- Provide more accurate skew calculations in complex environments
- Reduce configuration overhead for multi-dimensional constraints
Best Practices
To effectively use topology spread constraints, follow these best practices:
Start with Relaxed Constraints
- Begin with higher maxSkew values and relax as needed
- Use
ScheduleAnywayduring initial implementation - Tighten constraints gradually based on observed behavior
- Monitor scheduling success rates and adjust accordingly
- Consider workload characteristics when setting initial constraints
- Test with realistic pod counts and cluster configurations
- Example progressive implementation:
Combine with Pod Disruption Budgets
- Use PDBs to maintain distribution during maintenance
- Ensure budget aligns with topology constraints
- Prevent disruptions that would violate constraints
- Set PDB parameters based on topology domain counts
- Account for expected maintenance scenarios in PDB settings
- Consider zone-level impacts in multi-zone deployments
- Example PDB configuration:
Consider Startup Order for StatefulSets
- StatefulSets create pods in sequential order
- This may temporarily violate topology constraints
- Design for graceful startup with higher initial maxSkew
- Consider custom controllers for complex stateful workloads
- Use init containers to handle bootstrap coordination
- Implement readiness gates for topology-aware initialization
- Example StatefulSet with ordinal-based priority:
Label Nodes Appropriately
- Ensure all nodes have required topology labels
- Add custom topology domains as needed
- Standardize label naming conventions
- Validate node labels regularly
- Document topology labeling scheme
- Consider automation for label management
- Implement label verification in CI/CD pipelines
- Example node labeling:
Implement Hierarchical Constraints
- Use multiple constraints with different topology keys
- Order constraints from largest to smallest domains
- Consider the impact of each level on the others
- Test complex hierarchies thoroughly before production
- Document the hierarchy and its purpose
- Monitor distribution at each level
- Example hierarchical implementation:
Conclusion
Kubernetes Topology Spread Constraints provide a powerful mechanism for controlling pod distribution across infrastructure topology domains. By leveraging this feature, you can:
- Improve application availability by distributing workloads across failure domains
- Optimize resource utilization by preventing concentration in specific regions or zones
- Enhance performance by controlling workload placement with respect to network topology
- Implement complex high-availability patterns for stateful and stateless workloads
- Achieve cost optimization by balancing workloads across different pricing domains
- Create sophisticated multi-dimensional distribution strategies
- Implement regulatory compliance requirements for geographic distribution
- Protect against various failure scenarios from hardware to regional outages
When combined with other Kubernetes scheduling features like affinity rules, taints, tolerations, and priority classes, topology spread constraints enable sophisticated workload placement strategies that can significantly improve the resilience and efficiency of your applications.
As Kubernetes environments grow in complexity and scale across multiple regions, cloud providers, and infrastructure types, topology spread constraints become an essential tool for managing workload distribution and ensuring consistent application performance and availability. They represent a key capability for organizations implementing global, highly available Kubernetes platforms for mission-critical applications.
