Kubernetes Topology Spread Constraints

Advanced pod scheduling across topology domains using Topology Spread Constraints for high availability and balanced workload distribution

H A R S H H A A

@NotHarshhaa

Understanding Topology Spread Constraints

Kubernetes Topology Spread Constraints provide a sophisticated mechanism for controlling how pods are distributed across your cluster with respect to topology domains such as regions, zones, nodes, and other user-defined topology domains. This feature enables advanced high-availability configurations, improved resource utilization, and protection against topology-related failures.

Introduced as beta in Kubernetes 1.18 and graduating to stable in 1.19, topology spread constraints give cluster administrators and application developers fine-grained control over workload distribution beyond what is possible with pod affinity/anti-affinity rules alone. This capability becomes increasingly important as clusters span multiple availability zones, regions, or even cloud providers.

The core principle behind topology spread constraints is simple yet powerful: ensure that pods are distributed according to specific rules across different topology domains to minimize the impact of domain-level failures. Unlike traditional scheduling approaches that focus primarily on resource availability, topology spread constraints prioritize the relative distribution of pods, ensuring balanced deployment across the infrastructure landscape.

When to Use Topology Spread Constraints

Topology spread constraints are particularly valuable in the following scenarios:

High Availability Applications
- Distribute application instances across failure domains
- Ensure service continuity during zone or rack failures
- Maintain application availability during node maintenance
- Implement regional resiliency for globally distributed applications
- Protect against cascading failures within a single domain
Balanced Resource Utilization
- Prevent resource hotspots in specific topology domains
- Distribute workload evenly across the cluster
- Optimize resource utilization across zones or racks
- Balance network traffic across different network segments
- Prevent over-subscription of limited resources in specific domains
Cost Optimization
- Balance workloads across zones to reduce cross-zone traffic costs
- Prevent overutilization of resources in premium pricing zones
- Optimize resource allocation in hybrid or multi-cloud environments
- Distribute computational load to maximize resource efficiency
- Leverage spot instances while maintaining reliability
Performance Optimization
- Distribute latency-sensitive workloads across topology domains
- Minimize network latency by strategic pod placement
- Balance computational load across nodes with specialized hardware
- Optimize data locality for data-intensive applications
- Reduce contention for shared resources

The flexibility of topology spread constraints allows them to be adapted to various operational requirements, from strict regulatory compliance that mandates geographic distribution to performance-critical applications that require careful balancing of latency and throughput considerations.

Core Concepts

Topology spread constraints rely on several key concepts that form the foundation of this powerful scheduling capability:

Topology Domain

A division of your infrastructure based on specific characteristics
Common domains include zone, region, node, rack, switch
Can use any node label as a topology domain
Defined by the topologyKey field in the constraint
Examples: topology.kubernetes.io/zone, kubernetes.io/hostname, node.kubernetes.io/instance-type
Well-known topology keys are automatically added by cloud providers
Custom topology domains can be created with node labels
Hierarchical domains (e.g., region -> zone -> node) can be used together
Domain selection impacts failure isolation and performance characteristics
Proper domain labeling is crucial for effective constraint implementation

Spread Constraints

Rules that define how pods should be distributed
Specified in the pod specification under topologySpreadConstraints
Can include multiple constraints for different topology keys
Evaluated during the scheduling process to determine pod placement
Each constraint is considered independently unless matching parameters indicate otherwise
Can be applied to individual pods or pod templates in workload resources
Act as hard or soft requirements depending on configuration
Can include pod label selectors to count only specific pods
More specific than pod affinity/anti-affinity for distribution purposes
Each constraint includes:
- maxSkew: The degree to which pods may be unevenly distributed
- topologyKey: The node label used to identify the topology domain
- whenUnsatisfiable: The behavior when the constraint cannot be satisfied
- labelSelector: Which pods to count for determining the spread
- matchLabelKeys (optional): Pod labels to include in the selection criteria
- nodeAffinityPolicy (optional): How node affinity should be considered
- nodeTaintsPolicy (optional): How node taints should be considered

MaxSkew

Defines the maximum permitted difference between domains
Value of 1 means domains can differ by at most 1 pod
Higher values allow more imbalance between domains
If domains have 5, 5, and 4 matching pods, the skew is 1
If domains have 6, 3, and 2 matching pods, the skew is 4
Calculated using the formula: max(domain_count) - min(domain_count)
Represents the tolerance for imbalance in the system
Smaller values enforce stricter distribution
Should be tuned based on the total pod count and domain count
Important consideration: maxSkew of 1 with a small number of pods can prevent scheduling
For new deployments, consider starting with higher skew and reducing over time
Different workloads may require different skew values based on criticality
A domain with zero pods is not considered for maxSkew calculation when using DoNotSchedule
Special consideration for the first pod in a deployment (no skew calculation possible)

WhenUnsatisfiable

Determines behavior when constraints cannot be met
DoNotSchedule: Pod remains pending until constraints are satisfiable (default)
ScheduleAnyway: Best-effort distribution; pod will be scheduled even if constraints are violated
With DoNotSchedule, pods may remain pending indefinitely if constraints cannot be satisfied
With ScheduleAnyway, the scheduler will attempt to minimize the skew even when it cannot be fully satisfied
DoNotSchedule provides stronger guarantees but may impact availability
ScheduleAnyway prioritizes availability over perfect distribution
Can be used to implement hard and soft constraints in the same pod spec
Critical for balancing strict distribution requirements against scheduling success
Pending pods with DoNotSchedule will be reconsidered when cluster state changes
Pods with ScheduleAnyway still respect other scheduling constraints like resource requirements

These concepts work together to form a comprehensive system for controlling pod distribution. The flexibility of this system allows for simple configurations for basic use cases while supporting sophisticated distribution patterns for complex environments.

Topology Labels and Domains

The effectiveness of topology spread constraints depends heavily on proper node labeling. Kubernetes provides several standard topology labels:

Cloud Provider Topology Labels
- topology.kubernetes.io/zone: The availability zone where the node is running
- topology.kubernetes.io/region: The geographic region where the node is running
- Automatically added by cloud providers like AWS, GCP, and Azure
- Example values: us-east-1a (zone), us-east-1 (region)
Node Identity Labels
- kubernetes.io/hostname: The node's hostname
- Unique per node in the cluster
- Useful for spreading pods across physical machines
Custom Topology Labels
- topology.kubernetes.io/rack: Physical rack location (must be manually labeled)
- node.kubernetes.io/instance-type: The type of VM or hardware
- Any custom label that represents a meaningful topology domain

Example of manually adding topology labels:

# Label nodes with rack information
kubectl label node worker-1 topology.kubernetes.io/rack=rack1
kubectl label node worker-2 topology.kubernetes.io/rack=rack1
kubectl label node worker-3 topology.kubernetes.io/rack=rack2
kubectl label node worker-4 topology.kubernetes.io/rack=rack2

# Label nodes with custom power domain information
kubectl label node worker-1 example.com/power-supply=ups1
kubectl label node worker-2 example.com/power-supply=ups1
kubectl label node worker-3 example.com/power-supply=ups2
kubectl label node worker-4 example.com/power-supply=ups2

When designing your topology domains, consider the failure modes you want to protect against:

Zone-level failures: Use topology.kubernetes.io/zone
Node-level failures: Use kubernetes.io/hostname
Rack-level failures: Use topology.kubernetes.io/rack
Network segment failures: Use custom network topology labels
Power distribution failures: Use custom power domain labels

The hierarchy of domains is also important—typically, you want to spread across larger domains first (regions), then medium-sized domains (zones), and finally smaller domains (nodes).

Basic Implementation

To implement topology spread constraints, you need to add the topologySpreadConstraints field to your pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    app: example-app
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: example-app
  containers:
  - name: nginx
    image: nginx:1.20

This configuration ensures that the difference between the number of pods with the label app=example-app across any two zones is at most 1. If this cannot be achieved, the pod will remain in the Pending state.

Let's break down how this works in practice:

Assume you have a cluster with three zones (zone-a, zone-b, and zone-c) and the following pod distribution:

zone-a: 2 pods with label app=example-app
zone-b: 1 pod with label app=example-app
zone-c: 1 pod with label app=example-app

The current skew is 1 (maximum 2 - minimum 1 = 1), which equals the maxSkew specified.

If you create a new pod with the above constraint:

It can be scheduled in zone-b or zone-c (bringing either to 2 pods)
It cannot be scheduled in zone-a (would increase skew to 2)

If all zones already had 2 pods each (skew = 0), the new pod could be scheduled in any zone, as it would result in a skew of 1, which is acceptable.

Advanced Configuration Patterns

Topology spread constraints can be configured in various ways to achieve different distribution patterns:

Multi-level Topology Spreading
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app - maxSkew: 2 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: my-app
This configuration distributes pods evenly across zones (with strict enforcement) and then attempts to distribute them across nodes with more flexibility. The combination ensures that:
- Pods are strictly distributed across zones for high availability
- Within each zone, pods are spread across nodes but with some flexibility
- Zone-level distribution takes precedence over node-level distribution
- The system remains resilient to zone failures while optimizing node-level resource usage
Different Selector for Counting
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchExpressions: - key: app operator: In values: - frontend - middleware
This counts pods with either app=frontend or app=middleware labels when determining the spread. This approach is useful when:
- Multiple components need to be considered together for distribution purposes
- Components share resources or failure domains
- You want to balance related services across topology domains
- Different pod types serve the same logical function
Custom Topology Domains
topologySpreadConstraints: - maxSkew: 1 topologyKey: custom.topology/rack whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: database
This uses a custom topology domain (rack) that you've labeled on your nodes. Custom domains enable:
- Distribution based on physical infrastructure characteristics
- Accounting for power distribution units, network switches, or cooling systems
- Creating domain-specific high availability patterns
- Implementing compliance requirements for physical separation
Graceful Fallback Configuration
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: critical-app
This attempts to distribute pods evenly across zones but will still schedule the pod even if perfect distribution cannot be achieved. The scheduler will:
- Try to minimize the skew as much as possible
- Use the skew value to calculate a score for each node
- Balance distribution against other scheduling factors
- Ensure the pod is scheduled somewhere, prioritizing availability

These configuration patterns can be combined in various ways to address complex distribution requirements. The flexibility of the constraint system allows for tailored solutions that balance high availability, resource utilization, and performance considerations.

Implementation with Deployments and StatefulSets

Topology spread constraints are more commonly used with Deployments, StatefulSets, and other workload controllers rather than with individual pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 9
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-server
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-server
      containers:
      - name: nginx
        image: nginx:1.20

This Deployment creates 9 replicas that will be evenly distributed across both zones and nodes. With 3 zones, you would have 3 pods per zone, and these 3 pods in each zone would be distributed across different nodes.

The scheduling process for this deployment would work as follows:

Initial scheduling: The first pod can be scheduled anywhere since there's no skew yet.
Zone balancing: Subsequent pods will be distributed to maintain balanced zones.
Node balancing: Within each zone, pods will be distributed across nodes.
Scaling up: When adding pods, they'll be placed to maintain the minimum skew.
Scaling down: When removing pods, the relative distribution will be maintained.

StatefulSet Example

StatefulSets have unique considerations due to their ordered creation and stable identities:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: distributed-database
spec:
  serviceName: "database"
  replicas: 6
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: database
      containers:
      - name: postgres
        image: postgres:13
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

For StatefulSets, consider the following special considerations:

Pod creation order: StatefulSets create pods in order (0, 1, 2, ...), which may temporarily violate constraints.
PersistentVolumes: Volume availability in specific zones may impact distribution.
Pod deletion order: StatefulSets delete in reverse order, which may temporarily affect distribution.
Pod Management Policy: Using Parallel pod management policy can help achieve faster balanced distribution.

Handling Different Cluster Topologies

Different cluster topologies require different constraint configurations:

Multi-zone Clusters
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app
This configuration works well for clusters spanning multiple availability zones within a single region. The constraint ensures that pods are distributed evenly across zones, minimizing the impact of zone-level failures.
In a three-zone cluster with 9 replicas, this would result in 3 pods per zone. If a zone fails, two-thirds of the application capacity would remain available.
Single-zone, Multi-rack Clusters
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/rack whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app
For on-premises deployments or single-zone clusters with multiple racks, this configuration distributes pods across physical racks. This protects against rack-level failures such as power or network issues.
With 4 racks and 12 replicas, this would place 3 pods per rack. A rack failure would result in a 25% capacity reduction.
Multi-region, Multi-zone Clusters
topologySpreadConstraints: - maxSkew: 2 topologyKey: topology.kubernetes.io/region whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app
This hierarchical approach first distributes pods across regions (with some flexibility) and then ensures even distribution across zones within each region. This configuration provides protection against both region and zone failures.
In a cluster spanning 2 regions with 3 zones each (6 zones total) and 18 replicas, this would ideally place 9 pods per region and 3 pods per zone.
Custom Topology Hierarchy
topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app - maxSkew: 1 topologyKey: custom.topology/network-segment whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: my-app
This three-level hierarchy distributes pods across zones, then network segments within zones, and finally across nodes within network segments. The first two constraints use DoNotSchedule for strict enforcement, while the node-level constraint uses ScheduleAnyway for flexibility.
This complex topology awareness provides resilience against multiple failure types simultaneously.

Advanced Use Cases

Topology spread constraints enable several advanced use cases:

Zone-aware StatefulSet Distribution

Distribute database instances across zones
Maintain even distribution of stateful workloads
Ensure data availability during zone failures
Balance read replicas across multiple zones
Minimize cross-zone data transfer costs
Provide consistent performance across zones
Enable zone-local reads with distributed writes
Support disaster recovery scenarios
Implement quorum-based consensus systems

Example StatefulSet configuration:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zone-aware-db
spec:
  replicas: 6
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: database
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - database
            topologyKey: kubernetes.io/hostname
      containers:
      - name: db
        image: postgres:13
        env:
        - name: ZONE
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['topology.kubernetes.io/zone']

Hardware-aware Workload Distribution

Distribute GPU workloads across available GPU nodes
Balance CPU-intensive tasks across CPU types
Optimize for specialized hardware utilization
Avoid resource contention on accelerator cards
Balance network-intensive workloads across NICs
Ensure high-performance storage access
Distribute workloads based on CPU architecture
Optimize for NUMA topology
Balance memory-intensive workloads

Example for GPU workload distribution:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-workload
spec:
  replicas: 12
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: nvidia.com/gpu.product
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: ml-training
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: ml-training
      containers:
      - name: training
        image: ml-training:latest
        resources:
          limits:
            nvidia.com/gpu: 1

Cost-optimized Multi-cloud Distribution

Balance workloads across cloud providers
Optimize for spot instance availability
Distribute load based on regional pricing
Balance between on-premises and cloud resources
Implement cloud bursting patterns
Optimize for data transfer costs
Use constraints alongside node taints for cost zones
Balance reserved instances usage
Implement follow-the-sun deployment patterns

Example multi-cloud constraint:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cost-optimized-app
spec:
  replicas: 20
  selector:
    matchLabels:
      app: web-service
  template:
    metadata:
      labels:
        app: web-service
    spec:
      topologySpreadConstraints:
      - maxSkew: 5  # Allows some imbalance for cost optimization
        topologyKey: cloud.kubernetes.io/provider
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-service
      - maxSkew: 2
        topologyKey: node.kubernetes.io/instance-type
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-service
      - maxSkew: 1
        topologyKey: cloud.kubernetes.io/instance-lifecycle
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-service
      containers:
      - name: web
        image: web-service:latest

Application Component Co-location

Keep related components in the same zone for performance
Maintain component ratios across topology domains
Optimize for communication latency
Balance frontend and backend components
Distribute cache instances alongside application servers
Optimize data locality for analytics workloads
Implement sharded architectures with balanced distribution
Ensure messaging systems are properly distributed
Balance data producers and consumers

Example for component distribution:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
spec:
  replicas: 9
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
        component: web
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            component: web
      containers:
      - name: frontend
        image: frontend:latest
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
spec:
  replicas: 9
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
        component: api
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            component: api
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  component: web
              topologyKey: topology.kubernetes.io/zone
      containers:
      - name: backend
        image: backend:latest

These advanced use cases demonstrate the versatility of topology spread constraints beyond simple high-availability scenarios. By combining constraints with other Kubernetes features like affinity rules, resource requirements, and taints/tolerations, you can implement sophisticated workload placement strategies.

Real-world Example: Highly Available Database Cluster

Here's a comprehensive example of a highly available database cluster using topology spread constraints:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ha-postgresql
spec:
  serviceName: "postgresql"
  replicas: 5  # Primary + 4 replicas
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      terminationGracePeriodSeconds: 60
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: postgresql
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/rack
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: postgresql
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: postgresql
            topologyKey: kubernetes.io/hostname
      initContainers:
      - name: init-postgresql
        image: postgres:13-alpine
        command: ['/bin/sh', '-c']
        args:
        - |
          # Determine if this is primary (index 0) or replica
          HOSTNAME=$(hostname)
          ORDINAL=${HOSTNAME##*-}
          if [ "$ORDINAL" = "0" ]; then
            echo "Initializing as primary"
            # Primary-specific initialization
          else
            echo "Initializing as replica"
            # Replica-specific initialization
          fi
      containers:
      - name: postgresql
        image: postgres:13
        ports:
        - containerPort: 5432
          name: postgresql
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgresql-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ZONE
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: postgresql-config
          mountPath: /etc/postgresql
        resources:
          requests:
            cpu: 1
            memory: 2Gi
          limits:
            cpu: 2
            memory: 4Gi
        livenessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
      volumes:
      - name: postgresql-config
        configMap:
          name: postgresql-config
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "premium-storage"
      resources:
        requests:
          storage: 100Gi

This StatefulSet configuration ensures that:

PostgreSQL instances are evenly distributed across zones (strictly enforced)
Within zones, instances are distributed across racks (best effort)
No two instances run on the same node (required anti-affinity)
Each instance has stable storage and identity
Proper initialization happens based on whether the instance is primary or replica
Appropriate health checks are configured
Resources are properly allocated

This example demonstrates how topology spread constraints can be combined with other Kubernetes features to create a comprehensive high-availability solution.

Monitoring and Troubleshooting

To effectively use topology spread constraints, you need to monitor and troubleshoot their behavior:

Checking Pod Distribution

# View pod distribution by zone
kubectl get pods -l app=my-app -o wide --sort-by=".spec.nodeName" | \
  awk '{print $1, $7, $8}' | \
  column -t

# Count pods per zone
kubectl get pods -l app=my-app -o jsonpath='{.items[*].spec.nodeName}' | \
  tr ' ' '\n' | \
  xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
  sort | \
  uniq -c

Diagnosing Pending Pods

# Check pod events
kubectl describe pod <pending-pod-name>

# Look for scheduling events with messages about topology spread constraints
# Example output:
# Warning  FailedScheduling  8s (x22 over 5m)  default-scheduler  0/6 nodes are available: 
# 1 node(s) didn't match Pod's node affinity/selector, 5 node(s) didn't match pod topology spread constraints.

Visualizing Topology Distribution

# Using kubectl-topology plugin (if installed)
kubectl topology pods -l app=my-app --by-node

# Custom script to visualize distribution
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: topology-inspector
spec:
  containers:
  - name: inspector
    image: bitnami/kubectl
    command: 
    - "bash"
    - "-c"
    - |
      echo "Topology Distribution Analysis"
      echo "=============================="
      echo "Zone distribution:"
      kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}' | sort -k2 | awk '{print $2}' | uniq -c
      
      echo -e "\nPod distribution by zone:"
      for zone in $(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}' | sort | uniq); do
        echo -e "\nZone: $zone"
        kubectl get pods -l app=my-app -o wide | grep -i $(kubectl get nodes -l topology.kubernetes.io/zone=$zone -o jsonpath='{range .items[*]}{.metadata.name}{" "}{end}') | wc -l
      done
      
      echo -e "\nDetailed pod placement:"
      kubectl get pods -l app=my-app -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' | sort -k2
    
      # Stay alive for debugging
      sleep 3600
EOF

# Retrieve the visualization results
kubectl logs -f topology-inspector

Analyzing Scheduler Decisions

Enable scheduler logging with higher verbosity
Look for "TopologySpreadConstraint" entries in scheduler logs
Check scheduler metrics related to topology spread constraints

# Enable verbose scheduler logging (if using kubeadm)
kubectl -n kube-system edit deployment kube-scheduler

# Add to spec.template.spec.containers[0].command:
# - --v=4

# Check logs for topology spread constraint decisions
kubectl -n kube-system logs -l component=kube-scheduler | grep -i topologyspread

# Check metrics for scheduling attempts and failures
kubectl -n kube-system port-forward service/kube-scheduler 10259:10259
curl -k https://localhost:10259/metrics | grep -i scheduler_scheduling_attempt

Integration with Pod Affinity and Anti-Affinity

Topology spread constraints work alongside pod affinity and anti-affinity rules, creating powerful combinations for advanced placement control:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: combined-placement-controls
spec:
  replicas: 12
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-app
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: web-app
              topologyKey: kubernetes.io/hostname
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cache
              topologyKey: kubernetes.io/hostname
      containers:
      - name: web-app
        image: web-app:latest

This configuration implements a sophisticated placement strategy:

Primary High Availability: Strictly enforces even distribution across zones, ensuring the application remains available during zone-level failures
Secondary Efficiency: Preferentially spreads pods across different nodes within each zone, minimizing the impact of node failures
Performance Optimization: Tries to place web-app pods on the same nodes as cache pods when possible, improving data locality and reducing network latency
Balanced Priorities: Uses required constraints for critical availability guarantees and preferred constraints for optimization

The different mechanisms complement each other:

Topology spread constraints handle the quantitative distribution across domains
Pod anti-affinity handles qualitative separation between specific pods
Pod affinity handles co-location for performance optimization

This combination addresses multiple concerns simultaneously:

High availability through proper distribution
Performance optimization through strategic co-location
Resource efficiency through balanced placement

Performance Considerations

Topology spread constraints can impact scheduling performance and behavior in ways that require careful consideration:

Scheduler Performance
- Complex constraints increase scheduling computation time
- Multiple constraints multiply the scheduler's workload
- Large clusters with many constraints may experience scheduling latency
- Consider using the ScheduleAnyway mode for non-critical constraints
- In large clusters (100+ nodes), monitor scheduler CPU usage
- Benchmark scheduling latency with and without constraints
- Example scheduler CPU impact:
  # Monitor scheduler CPU usage kubectl -n kube-system top pods -l component=kube-scheduler
Schedule Time vs. Runtime Trade-offs
- Strict constraints (DoNotSchedule) can leave pods pending for extended periods
- Consider using soft constraints (ScheduleAnyway) for better scheduling success rates
- Balance distribution requirements against scheduling speed
- Implement progressive backoff for retrying constrained scheduling
- Consider implementing custom controllers for complex distribution requirements
- In dynamic environments, prefer ScheduleAnyway with appropriate metrics and alerts
- Example metric to monitor:
  # Monitor pending pods due to topology constraints kubectl get events | grep -i "topology spread constraints" | grep -i "pending" | wc -l

Scaling Behavior

Rapid scaling operations may temporarily violate constraints as the scheduler works
Consider implementing rate-limited scaling for large deployments with constraints
Use PodDisruptionBudgets alongside constraints to maintain distribution during scaling
For StatefulSets, consider Parallel pod management policy during scale-up
Set appropriate resource requests to prevent bottlenecks during scaling events
Monitor skew metrics during scaling operations

Example HorizontalPodAutoscaler with behavior limits:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: controlled-scaling
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: constrained-deployment
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 20
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Node Additions and Removals
- Adding nodes may trigger rebalancing if using cluster-proportional autoscaling
- Node draining may temporarily violate constraints
- Plan maintenance operations with constraint satisfaction in mind
- Use node cordoning and draining strategically across topology domains
- Implement domain-aware maintenance procedures
- Consider domain impact when adding or removing nodes
- Example node drain command that respects topology:
  # Drain nodes in a specific zone one at a time for node in $(kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a -o jsonpath='{.items[*].metadata.name}'); do kubectl drain $node --ignore-daemonsets --delete-emptydir-data sleep 300 # Wait for pods to stabilize done

Advanced Configuration: MatchLabelKeys and NodeAffinityPolicy

Kubernetes 1.25+ introduced additional configuration options for topology spread constraints that provide even more flexibility:

apiVersion: v1
kind: Pod
metadata:
  name: advanced-constraints-pod
  labels:
    app: web
    environment: production
    team: alpha
    version: v2
    criticality: high
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - web
    matchLabelKeys:
      - environment
      - team
    nodeAffinityPolicy: Honor
    nodeTaintsPolicy: Honor
  containers:
  - name: web
    image: web-app:2.3

These advanced features provide powerful capabilities:

matchLabelKeys
- Automatically includes pod labels with these keys in the label selector
- Simplifies configuration by dynamically incorporating pod metadata
- Enables consistent constraints across multiple deployments
- In the example, pods will be counted for skew if they match:
  - app=web (from explicit matchExpressions) AND
  - environment=production (from current pod's label via matchLabelKeys) AND
  - team=alpha (from current pod's label via matchLabelKeys)
- Changes to pod labels will automatically affect constraint matching
- Useful for multi-dimensional topologies (app + environment + team)
- Reduces duplication in constraint specifications
nodeAffinityPolicy
- Controls whether node affinity/anti-affinity is respected when calculating pod topology spread skew
- Values: Honor (default) or Ignore
- With Honor, only nodes that satisfy the pod's node affinity are considered
- With Ignore, all nodes are considered regardless of node affinity
- Impacts skew calculation by changing the set of eligible nodes
- Example use case: When using node affinity for specialized hardware but wanting even zone distribution
nodeTaintsPolicy
- Controls whether node taints are respected when calculating pod topology spread skew
- Values: Honor (default) or Ignore
- With Honor, nodes with taints that the pod doesn't tolerate are not considered
- With Ignore, all nodes are considered regardless of taints
- Useful when certain nodes are reserved but should still factor into distribution calculations
- Example use case: When using taints for special workloads but wanting to maintain zone balance

These options allow for more sophisticated constraint policies that:

Adapt dynamically to pod metadata
Interact intelligently with other scheduling mechanisms
Provide more accurate skew calculations in complex environments
Reduce configuration overhead for multi-dimensional constraints

Best Practices

To effectively use topology spread constraints, follow these best practices:

Start with Relaxed Constraints

Begin with higher maxSkew values and relax as needed
Use ScheduleAnyway during initial implementation
Tighten constraints gradually based on observed behavior
Monitor scheduling success rates and adjust accordingly
Consider workload characteristics when setting initial constraints
Test with realistic pod counts and cluster configurations

Example progressive implementation:

# Initial implementation
topologySpreadConstraints:
- maxSkew: 3
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: ScheduleAnyway
  
# After validation
topologySpreadConstraints:
- maxSkew: 2
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: ScheduleAnyway
  
# Final configuration
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule

Combine with Pod Disruption Budgets

Use PDBs to maintain distribution during maintenance
Ensure budget aligns with topology constraints
Prevent disruptions that would violate constraints
Set PDB parameters based on topology domain counts
Account for expected maintenance scenarios in PDB settings
Consider zone-level impacts in multi-zone deployments
Example PDB configuration:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: zonal-pdb spec: minAvailable: "33%" # Ensures at least one zone worth of pods remains selector: matchLabels: app: my-app

Consider Startup Order for StatefulSets

StatefulSets create pods in sequential order
This may temporarily violate topology constraints
Design for graceful startup with higher initial maxSkew
Consider custom controllers for complex stateful workloads
Use init containers to handle bootstrap coordination
Implement readiness gates for topology-aware initialization

Example StatefulSet with ordinal-based priority:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: distributed-db
spec:
  podManagementPolicy: Parallel  # Use parallel for better distribution during startup
  replicas: 9
  template:
    # ... other configuration
    spec:
      initContainers:
      - name: topology-bootstrap
        image: bitnami/kubectl
        command: ["bash", "-c"]
        args:
          - |
            # Get pod ordinal
            ORDINAL=$(echo $HOSTNAME | rev | cut -d'-' -f1 | rev)
            # Get zone
            ZONE=$(kubectl get node $NODE_NAME -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}')
            # Topology-aware initialization logic
            if [ "$ORDINAL" -lt "3" ]; then
              # Primary zone initialization
              echo "Initializing as primary in zone $ZONE"
            else
              # Secondary zone initialization
              echo "Initializing as replica in zone $ZONE"
            fi
        env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

Label Nodes Appropriately

Ensure all nodes have required topology labels
Add custom topology domains as needed
Standardize label naming conventions
Validate node labels regularly
Document topology labeling scheme
Consider automation for label management
Implement label verification in CI/CD pipelines

Example node labeling:

# Add rack label to nodes
kubectl label nodes node1 node2 node3 topology.kubernetes.io/rack=rack1
kubectl label nodes node4 node5 node6 topology.kubernetes.io/rack=rack2

# Add custom topology domain
kubectl label nodes node1 node4 custom.topology/power-supply=ups1
kubectl label nodes node2 node5 custom.topology/power-supply=ups2
kubectl label nodes node3 node6 custom.topology/power-supply=ups3

# Add network segment information
kubectl label nodes node1 node2 custom.topology/network-segment=segment1
kubectl label nodes node3 node4 custom.topology/network-segment=segment2
kubectl label nodes node5 node6 custom.topology/network-segment=segment3

Implement Hierarchical Constraints

Use multiple constraints with different topology keys
Order constraints from largest to smallest domains
Consider the impact of each level on the others
Test complex hierarchies thoroughly before production
Document the hierarchy and its purpose
Monitor distribution at each level

Example hierarchical implementation:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/region
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: critical-app
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: critical-app
- maxSkew: 2
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: critical-app

Conclusion

Kubernetes Topology Spread Constraints provide a powerful mechanism for controlling pod distribution across infrastructure topology domains. By leveraging this feature, you can:

Improve application availability by distributing workloads across failure domains
Optimize resource utilization by preventing concentration in specific regions or zones
Enhance performance by controlling workload placement with respect to network topology
Implement complex high-availability patterns for stateful and stateless workloads
Achieve cost optimization by balancing workloads across different pricing domains
Create sophisticated multi-dimensional distribution strategies
Implement regulatory compliance requirements for geographic distribution
Protect against various failure scenarios from hardware to regional outages

When combined with other Kubernetes scheduling features like affinity rules, taints, tolerations, and priority classes, topology spread constraints enable sophisticated workload placement strategies that can significantly improve the resilience and efficiency of your applications.

As Kubernetes environments grow in complexity and scale across multiple regions, cloud providers, and infrastructure types, topology spread constraints become an essential tool for managing workload distribution and ensuring consistent application performance and availability. They represent a key capability for organizations implementing global, highly available Kubernetes platforms for mission-critical applications.

Edit this page

Kubernetes Persistent Storage Best Practices

Comprehensive strategies and best practices for managing persistent storage in Kubernetes environments

Kubernetes Service Account Token Volume Projection

Advanced authentication and authorization with Service Account Token Volume Projection in Kubernetes

On this page

Understanding Topology Spread Constraints
When to Use Topology Spread Constraints
Core Concepts
Topology Labels and Domains
Basic Implementation
Advanced Configuration Patterns
Implementation with Deployments and StatefulSets
- StatefulSet Example
Handling Different Cluster Topologies
Advanced Use Cases
Real-world Example: Highly Available Database Cluster
Monitoring and Troubleshooting
Integration with Pod Affinity and Anti-Affinity
Performance Considerations
Advanced Configuration: MatchLabelKeys and NodeAffinityPolicy
Best Practices
Conclusion

Star on GitHub Create Issues