Welcome to from-docker-to-kubernetes

Kubernetes Topology Spread Constraints

Advanced pod scheduling across topology domains using Topology Spread Constraints for high availability and balanced workload distribution

Understanding Topology Spread Constraints

Kubernetes Topology Spread Constraints provide a sophisticated mechanism for controlling how pods are distributed across your cluster with respect to topology domains such as regions, zones, nodes, and other user-defined topology domains. This feature enables advanced high-availability configurations, improved resource utilization, and protection against topology-related failures.

Introduced as beta in Kubernetes 1.18 and graduating to stable in 1.19, topology spread constraints give cluster administrators and application developers fine-grained control over workload distribution beyond what is possible with pod affinity/anti-affinity rules alone. This capability becomes increasingly important as clusters span multiple availability zones, regions, or even cloud providers.

The core principle behind topology spread constraints is simple yet powerful: ensure that pods are distributed according to specific rules across different topology domains to minimize the impact of domain-level failures. Unlike traditional scheduling approaches that focus primarily on resource availability, topology spread constraints prioritize the relative distribution of pods, ensuring balanced deployment across the infrastructure landscape.

When to Use Topology Spread Constraints

Topology spread constraints are particularly valuable in the following scenarios:

The flexibility of topology spread constraints allows them to be adapted to various operational requirements, from strict regulatory compliance that mandates geographic distribution to performance-critical applications that require careful balancing of latency and throughput considerations.

Core Concepts

Topology spread constraints rely on several key concepts that form the foundation of this powerful scheduling capability:

Topology Domain

  • A division of your infrastructure based on specific characteristics
  • Common domains include zone, region, node, rack, switch
  • Can use any node label as a topology domain
  • Defined by the topologyKey field in the constraint
  • Examples: topology.kubernetes.io/zone, kubernetes.io/hostname, node.kubernetes.io/instance-type
  • Well-known topology keys are automatically added by cloud providers
  • Custom topology domains can be created with node labels
  • Hierarchical domains (e.g., region -> zone -> node) can be used together
  • Domain selection impacts failure isolation and performance characteristics
  • Proper domain labeling is crucial for effective constraint implementation

Spread Constraints

  • Rules that define how pods should be distributed
  • Specified in the pod specification under topologySpreadConstraints
  • Can include multiple constraints for different topology keys
  • Evaluated during the scheduling process to determine pod placement
  • Each constraint is considered independently unless matching parameters indicate otherwise
  • Can be applied to individual pods or pod templates in workload resources
  • Act as hard or soft requirements depending on configuration
  • Can include pod label selectors to count only specific pods
  • More specific than pod affinity/anti-affinity for distribution purposes
  • Each constraint includes:
    • maxSkew: The degree to which pods may be unevenly distributed
    • topologyKey: The node label used to identify the topology domain
    • whenUnsatisfiable: The behavior when the constraint cannot be satisfied
    • labelSelector: Which pods to count for determining the spread
    • matchLabelKeys (optional): Pod labels to include in the selection criteria
    • nodeAffinityPolicy (optional): How node affinity should be considered
    • nodeTaintsPolicy (optional): How node taints should be considered

MaxSkew

  • Defines the maximum permitted difference between domains
  • Value of 1 means domains can differ by at most 1 pod
  • Higher values allow more imbalance between domains
  • If domains have 5, 5, and 4 matching pods, the skew is 1
  • If domains have 6, 3, and 2 matching pods, the skew is 4
  • Calculated using the formula: max(domain_count) - min(domain_count)
  • Represents the tolerance for imbalance in the system
  • Smaller values enforce stricter distribution
  • Should be tuned based on the total pod count and domain count
  • Important consideration: maxSkew of 1 with a small number of pods can prevent scheduling
  • For new deployments, consider starting with higher skew and reducing over time
  • Different workloads may require different skew values based on criticality
  • A domain with zero pods is not considered for maxSkew calculation when using DoNotSchedule
  • Special consideration for the first pod in a deployment (no skew calculation possible)

WhenUnsatisfiable

  • Determines behavior when constraints cannot be met
  • DoNotSchedule: Pod remains pending until constraints are satisfiable (default)
  • ScheduleAnyway: Best-effort distribution; pod will be scheduled even if constraints are violated
  • With DoNotSchedule, pods may remain pending indefinitely if constraints cannot be satisfied
  • With ScheduleAnyway, the scheduler will attempt to minimize the skew even when it cannot be fully satisfied
  • DoNotSchedule provides stronger guarantees but may impact availability
  • ScheduleAnyway prioritizes availability over perfect distribution
  • Can be used to implement hard and soft constraints in the same pod spec
  • Critical for balancing strict distribution requirements against scheduling success
  • Pending pods with DoNotSchedule will be reconsidered when cluster state changes
  • Pods with ScheduleAnyway still respect other scheduling constraints like resource requirements

These concepts work together to form a comprehensive system for controlling pod distribution. The flexibility of this system allows for simple configurations for basic use cases while supporting sophisticated distribution patterns for complex environments.

Topology Labels and Domains

The effectiveness of topology spread constraints depends heavily on proper node labeling. Kubernetes provides several standard topology labels:

  1. Cloud Provider Topology Labels
    • topology.kubernetes.io/zone: The availability zone where the node is running
    • topology.kubernetes.io/region: The geographic region where the node is running
    • Automatically added by cloud providers like AWS, GCP, and Azure
    • Example values: us-east-1a (zone), us-east-1 (region)
  2. Node Identity Labels
    • kubernetes.io/hostname: The node's hostname
    • Unique per node in the cluster
    • Useful for spreading pods across physical machines
  3. Custom Topology Labels
    • topology.kubernetes.io/rack: Physical rack location (must be manually labeled)
    • node.kubernetes.io/instance-type: The type of VM or hardware
    • Any custom label that represents a meaningful topology domain

Example of manually adding topology labels:

# Label nodes with rack information
kubectl label node worker-1 topology.kubernetes.io/rack=rack1
kubectl label node worker-2 topology.kubernetes.io/rack=rack1
kubectl label node worker-3 topology.kubernetes.io/rack=rack2
kubectl label node worker-4 topology.kubernetes.io/rack=rack2

# Label nodes with custom power domain information
kubectl label node worker-1 example.com/power-supply=ups1
kubectl label node worker-2 example.com/power-supply=ups1
kubectl label node worker-3 example.com/power-supply=ups2
kubectl label node worker-4 example.com/power-supply=ups2

When designing your topology domains, consider the failure modes you want to protect against:

  • Zone-level failures: Use topology.kubernetes.io/zone
  • Node-level failures: Use kubernetes.io/hostname
  • Rack-level failures: Use topology.kubernetes.io/rack
  • Network segment failures: Use custom network topology labels
  • Power distribution failures: Use custom power domain labels

The hierarchy of domains is also important—typically, you want to spread across larger domains first (regions), then medium-sized domains (zones), and finally smaller domains (nodes).

Basic Implementation

To implement topology spread constraints, you need to add the topologySpreadConstraints field to your pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    app: example-app
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: example-app
  containers:
  - name: nginx
    image: nginx:1.20

This configuration ensures that the difference between the number of pods with the label app=example-app across any two zones is at most 1. If this cannot be achieved, the pod will remain in the Pending state.

Let's break down how this works in practice:

Assume you have a cluster with three zones (zone-a, zone-b, and zone-c) and the following pod distribution:

  • zone-a: 2 pods with label app=example-app
  • zone-b: 1 pod with label app=example-app
  • zone-c: 1 pod with label app=example-app

The current skew is 1 (maximum 2 - minimum 1 = 1), which equals the maxSkew specified.

If you create a new pod with the above constraint:

  • It can be scheduled in zone-b or zone-c (bringing either to 2 pods)
  • It cannot be scheduled in zone-a (would increase skew to 2)

If all zones already had 2 pods each (skew = 0), the new pod could be scheduled in any zone, as it would result in a skew of 1, which is acceptable.

Advanced Configuration Patterns

Topology spread constraints can be configured in various ways to achieve different distribution patterns:

These configuration patterns can be combined in various ways to address complex distribution requirements. The flexibility of the constraint system allows for tailored solutions that balance high availability, resource utilization, and performance considerations.

Implementation with Deployments and StatefulSets

Topology spread constraints are more commonly used with Deployments, StatefulSets, and other workload controllers rather than with individual pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 9
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-server
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-server
      containers:
      - name: nginx
        image: nginx:1.20

This Deployment creates 9 replicas that will be evenly distributed across both zones and nodes. With 3 zones, you would have 3 pods per zone, and these 3 pods in each zone would be distributed across different nodes.

The scheduling process for this deployment would work as follows:

  1. Initial scheduling: The first pod can be scheduled anywhere since there's no skew yet.
  2. Zone balancing: Subsequent pods will be distributed to maintain balanced zones.
  3. Node balancing: Within each zone, pods will be distributed across nodes.
  4. Scaling up: When adding pods, they'll be placed to maintain the minimum skew.
  5. Scaling down: When removing pods, the relative distribution will be maintained.

StatefulSet Example

StatefulSets have unique considerations due to their ordered creation and stable identities:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: distributed-database
spec:
  serviceName: "database"
  replicas: 6
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: database
      containers:
      - name: postgres
        image: postgres:13
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

For StatefulSets, consider the following special considerations:

  1. Pod creation order: StatefulSets create pods in order (0, 1, 2, ...), which may temporarily violate constraints.
  2. PersistentVolumes: Volume availability in specific zones may impact distribution.
  3. Pod deletion order: StatefulSets delete in reverse order, which may temporarily affect distribution.
  4. Pod Management Policy: Using Parallel pod management policy can help achieve faster balanced distribution.

Handling Different Cluster Topologies

Different cluster topologies require different constraint configurations:

  1. Multi-zone Clusters
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    

    This configuration works well for clusters spanning multiple availability zones within a single region. The constraint ensures that pods are distributed evenly across zones, minimizing the impact of zone-level failures.
    In a three-zone cluster with 9 replicas, this would result in 3 pods per zone. If a zone fails, two-thirds of the application capacity would remain available.
  2. Single-zone, Multi-rack Clusters
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/rack
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    

    For on-premises deployments or single-zone clusters with multiple racks, this configuration distributes pods across physical racks. This protects against rack-level failures such as power or network issues.
    With 4 racks and 12 replicas, this would place 3 pods per rack. A rack failure would result in a 25% capacity reduction.
  3. Multi-region, Multi-zone Clusters
    topologySpreadConstraints:
    - maxSkew: 2
      topologyKey: topology.kubernetes.io/region
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    

    This hierarchical approach first distributes pods across regions (with some flexibility) and then ensures even distribution across zones within each region. This configuration provides protection against both region and zone failures.
    In a cluster spanning 2 regions with 3 zones each (6 zones total) and 18 replicas, this would ideally place 9 pods per region and 3 pods per zone.
  4. Custom Topology Hierarchy
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    - maxSkew: 1
      topologyKey: custom.topology/network-segment
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: my-app
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-app
    

    This three-level hierarchy distributes pods across zones, then network segments within zones, and finally across nodes within network segments. The first two constraints use DoNotSchedule for strict enforcement, while the node-level constraint uses ScheduleAnyway for flexibility.
    This complex topology awareness provides resilience against multiple failure types simultaneously.

Advanced Use Cases

Topology spread constraints enable several advanced use cases:

Zone-aware StatefulSet Distribution

  • Distribute database instances across zones
  • Maintain even distribution of stateful workloads
  • Ensure data availability during zone failures
  • Balance read replicas across multiple zones
  • Minimize cross-zone data transfer costs
  • Provide consistent performance across zones
  • Enable zone-local reads with distributed writes
  • Support disaster recovery scenarios
  • Implement quorum-based consensus systems
  • Example StatefulSet configuration:
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: zone-aware-db
    spec:
      replicas: 6
      podManagementPolicy: Parallel
      selector:
        matchLabels:
          app: database
      template:
        metadata:
          labels:
            app: database
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: database
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - database
                topologyKey: kubernetes.io/hostname
          containers:
          - name: db
            image: postgres:13
            env:
            - name: ZONE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['topology.kubernetes.io/zone']
    

Hardware-aware Workload Distribution

  • Distribute GPU workloads across available GPU nodes
  • Balance CPU-intensive tasks across CPU types
  • Optimize for specialized hardware utilization
  • Avoid resource contention on accelerator cards
  • Balance network-intensive workloads across NICs
  • Ensure high-performance storage access
  • Distribute workloads based on CPU architecture
  • Optimize for NUMA topology
  • Balance memory-intensive workloads
  • Example for GPU workload distribution:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-workload
    spec:
      replicas: 12
      selector:
        matchLabels:
          app: ml-training
      template:
        metadata:
          labels:
            app: ml-training
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: nvidia.com/gpu.product
            whenUnsatisfiable: ScheduleAnyway
            labelSelector:
              matchLabels:
                app: ml-training
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: ml-training
          containers:
          - name: training
            image: ml-training:latest
            resources:
              limits:
                nvidia.com/gpu: 1
    

Cost-optimized Multi-cloud Distribution

  • Balance workloads across cloud providers
  • Optimize for spot instance availability
  • Distribute load based on regional pricing
  • Balance between on-premises and cloud resources
  • Implement cloud bursting patterns
  • Optimize for data transfer costs
  • Use constraints alongside node taints for cost zones
  • Balance reserved instances usage
  • Implement follow-the-sun deployment patterns
  • Example multi-cloud constraint:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cost-optimized-app
    spec:
      replicas: 20
      selector:
        matchLabels:
          app: web-service
      template:
        metadata:
          labels:
            app: web-service
        spec:
          topologySpreadConstraints:
          - maxSkew: 5  # Allows some imbalance for cost optimization
            topologyKey: cloud.kubernetes.io/provider
            whenUnsatisfiable: ScheduleAnyway
            labelSelector:
              matchLabels:
                app: web-service
          - maxSkew: 2
            topologyKey: node.kubernetes.io/instance-type
            whenUnsatisfiable: ScheduleAnyway
            labelSelector:
              matchLabels:
                app: web-service
          - maxSkew: 1
            topologyKey: cloud.kubernetes.io/instance-lifecycle
            whenUnsatisfiable: ScheduleAnyway
            labelSelector:
              matchLabels:
                app: web-service
          containers:
          - name: web
            image: web-service:latest
    

Application Component Co-location

  • Keep related components in the same zone for performance
  • Maintain component ratios across topology domains
  • Optimize for communication latency
  • Balance frontend and backend components
  • Distribute cache instances alongside application servers
  • Optimize data locality for analytics workloads
  • Implement sharded architectures with balanced distribution
  • Ensure messaging systems are properly distributed
  • Balance data producers and consumers
  • Example for component distribution:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: frontend
    spec:
      replicas: 9
      selector:
        matchLabels:
          app: frontend
      template:
        metadata:
          labels:
            app: frontend
            component: web
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                component: web
          containers:
          - name: frontend
            image: frontend:latest
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: backend
    spec:
      replicas: 9
      selector:
        matchLabels:
          app: backend
      template:
        metadata:
          labels:
            app: backend
            component: api
        spec:
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                component: api
          affinity:
            podAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      component: web
                  topologyKey: topology.kubernetes.io/zone
          containers:
          - name: backend
            image: backend:latest
    

These advanced use cases demonstrate the versatility of topology spread constraints beyond simple high-availability scenarios. By combining constraints with other Kubernetes features like affinity rules, resource requirements, and taints/tolerations, you can implement sophisticated workload placement strategies.

Real-world Example: Highly Available Database Cluster

Here's a comprehensive example of a highly available database cluster using topology spread constraints:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ha-postgresql
spec:
  serviceName: "postgresql"
  replicas: 5  # Primary + 4 replicas
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      terminationGracePeriodSeconds: 60
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: postgresql
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/rack
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: postgresql
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: postgresql
            topologyKey: kubernetes.io/hostname
      initContainers:
      - name: init-postgresql
        image: postgres:13-alpine
        command: ['/bin/sh', '-c']
        args:
        - |
          # Determine if this is primary (index 0) or replica
          HOSTNAME=$(hostname)
          ORDINAL=${HOSTNAME##*-}
          if [ "$ORDINAL" = "0" ]; then
            echo "Initializing as primary"
            # Primary-specific initialization
          else
            echo "Initializing as replica"
            # Replica-specific initialization
          fi
      containers:
      - name: postgresql
        image: postgres:13
        ports:
        - containerPort: 5432
          name: postgresql
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgresql-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ZONE
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
        - name: postgresql-config
          mountPath: /etc/postgresql
        resources:
          requests:
            cpu: 1
            memory: 2Gi
          limits:
            cpu: 2
            memory: 4Gi
        livenessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
      volumes:
      - name: postgresql-config
        configMap:
          name: postgresql-config
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "premium-storage"
      resources:
        requests:
          storage: 100Gi

This StatefulSet configuration ensures that:

  1. PostgreSQL instances are evenly distributed across zones (strictly enforced)
  2. Within zones, instances are distributed across racks (best effort)
  3. No two instances run on the same node (required anti-affinity)
  4. Each instance has stable storage and identity
  5. Proper initialization happens based on whether the instance is primary or replica
  6. Appropriate health checks are configured
  7. Resources are properly allocated

This example demonstrates how topology spread constraints can be combined with other Kubernetes features to create a comprehensive high-availability solution.

Monitoring and Troubleshooting

To effectively use topology spread constraints, you need to monitor and troubleshoot their behavior:

  1. Checking Pod Distribution
    # View pod distribution by zone
    kubectl get pods -l app=my-app -o wide --sort-by=".spec.nodeName" | \
      awk '{print $1, $7, $8}' | \
      column -t
    
    # Count pods per zone
    kubectl get pods -l app=my-app -o jsonpath='{.items[*].spec.nodeName}' | \
      tr ' ' '\n' | \
      xargs -I{} kubectl get node {} -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' | \
      sort | \
      uniq -c
    
  2. Diagnosing Pending Pods
    # Check pod events
    kubectl describe pod <pending-pod-name>
    
    # Look for scheduling events with messages about topology spread constraints
    # Example output:
    # Warning  FailedScheduling  8s (x22 over 5m)  default-scheduler  0/6 nodes are available: 
    # 1 node(s) didn't match Pod's node affinity/selector, 5 node(s) didn't match pod topology spread constraints.
    
  3. Visualizing Topology Distribution
    # Using kubectl-topology plugin (if installed)
    kubectl topology pods -l app=my-app --by-node
    
    # Custom script to visualize distribution
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: topology-inspector
    spec:
      containers:
      - name: inspector
        image: bitnami/kubectl
        command: 
        - "bash"
        - "-c"
        - |
          echo "Topology Distribution Analysis"
          echo "=============================="
          echo "Zone distribution:"
          kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}' | sort -k2 | awk '{print $2}' | uniq -c
          
          echo -e "\nPod distribution by zone:"
          for zone in $(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}' | sort | uniq); do
            echo -e "\nZone: $zone"
            kubectl get pods -l app=my-app -o wide | grep -i $(kubectl get nodes -l topology.kubernetes.io/zone=$zone -o jsonpath='{range .items[*]}{.metadata.name}{" "}{end}') | wc -l
          done
          
          echo -e "\nDetailed pod placement:"
          kubectl get pods -l app=my-app -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' | sort -k2
        
          # Stay alive for debugging
          sleep 3600
    EOF
    
    # Retrieve the visualization results
    kubectl logs -f topology-inspector
    
  4. Analyzing Scheduler Decisions
    • Enable scheduler logging with higher verbosity
    • Look for "TopologySpreadConstraint" entries in scheduler logs
    • Check scheduler metrics related to topology spread constraints
    # Enable verbose scheduler logging (if using kubeadm)
    kubectl -n kube-system edit deployment kube-scheduler
    
    # Add to spec.template.spec.containers[0].command:
    # - --v=4
    
    # Check logs for topology spread constraint decisions
    kubectl -n kube-system logs -l component=kube-scheduler | grep -i topologyspread
    
    # Check metrics for scheduling attempts and failures
    kubectl -n kube-system port-forward service/kube-scheduler 10259:10259
    curl -k https://localhost:10259/metrics | grep -i scheduler_scheduling_attempt
    

Integration with Pod Affinity and Anti-Affinity

Topology spread constraints work alongside pod affinity and anti-affinity rules, creating powerful combinations for advanced placement control:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: combined-placement-controls
spec:
  replicas: 12
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-app
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: web-app
              topologyKey: kubernetes.io/hostname
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cache
              topologyKey: kubernetes.io/hostname
      containers:
      - name: web-app
        image: web-app:latest

This configuration implements a sophisticated placement strategy:

  1. Primary High Availability: Strictly enforces even distribution across zones, ensuring the application remains available during zone-level failures
  2. Secondary Efficiency: Preferentially spreads pods across different nodes within each zone, minimizing the impact of node failures
  3. Performance Optimization: Tries to place web-app pods on the same nodes as cache pods when possible, improving data locality and reducing network latency
  4. Balanced Priorities: Uses required constraints for critical availability guarantees and preferred constraints for optimization

The different mechanisms complement each other:

  • Topology spread constraints handle the quantitative distribution across domains
  • Pod anti-affinity handles qualitative separation between specific pods
  • Pod affinity handles co-location for performance optimization

This combination addresses multiple concerns simultaneously:

  • High availability through proper distribution
  • Performance optimization through strategic co-location
  • Resource efficiency through balanced placement

Performance Considerations

Topology spread constraints can impact scheduling performance and behavior in ways that require careful consideration:

Advanced Configuration: MatchLabelKeys and NodeAffinityPolicy

Kubernetes 1.25+ introduced additional configuration options for topology spread constraints that provide even more flexibility:

apiVersion: v1
kind: Pod
metadata:
  name: advanced-constraints-pod
  labels:
    app: web
    environment: production
    team: alpha
    version: v2
    criticality: high
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - web
    matchLabelKeys:
      - environment
      - team
    nodeAffinityPolicy: Honor
    nodeTaintsPolicy: Honor
  containers:
  - name: web
    image: web-app:2.3

These advanced features provide powerful capabilities:

  1. matchLabelKeys
    • Automatically includes pod labels with these keys in the label selector
    • Simplifies configuration by dynamically incorporating pod metadata
    • Enables consistent constraints across multiple deployments
    • In the example, pods will be counted for skew if they match:
      • app=web (from explicit matchExpressions) AND
      • environment=production (from current pod's label via matchLabelKeys) AND
      • team=alpha (from current pod's label via matchLabelKeys)
    • Changes to pod labels will automatically affect constraint matching
    • Useful for multi-dimensional topologies (app + environment + team)
    • Reduces duplication in constraint specifications
  2. nodeAffinityPolicy
    • Controls whether node affinity/anti-affinity is respected when calculating pod topology spread skew
    • Values: Honor (default) or Ignore
    • With Honor, only nodes that satisfy the pod's node affinity are considered
    • With Ignore, all nodes are considered regardless of node affinity
    • Impacts skew calculation by changing the set of eligible nodes
    • Example use case: When using node affinity for specialized hardware but wanting even zone distribution
  3. nodeTaintsPolicy
    • Controls whether node taints are respected when calculating pod topology spread skew
    • Values: Honor (default) or Ignore
    • With Honor, nodes with taints that the pod doesn't tolerate are not considered
    • With Ignore, all nodes are considered regardless of taints
    • Useful when certain nodes are reserved but should still factor into distribution calculations
    • Example use case: When using taints for special workloads but wanting to maintain zone balance

These options allow for more sophisticated constraint policies that:

  • Adapt dynamically to pod metadata
  • Interact intelligently with other scheduling mechanisms
  • Provide more accurate skew calculations in complex environments
  • Reduce configuration overhead for multi-dimensional constraints

Best Practices

To effectively use topology spread constraints, follow these best practices:

Start with Relaxed Constraints

  • Begin with higher maxSkew values and relax as needed
  • Use ScheduleAnyway during initial implementation
  • Tighten constraints gradually based on observed behavior
  • Monitor scheduling success rates and adjust accordingly
  • Consider workload characteristics when setting initial constraints
  • Test with realistic pod counts and cluster configurations
  • Example progressive implementation:
    # Initial implementation
    topologySpreadConstraints:
    - maxSkew: 3
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      
    # After validation
    topologySpreadConstraints:
    - maxSkew: 2
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      
    # Final configuration
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
    

Combine with Pod Disruption Budgets

  • Use PDBs to maintain distribution during maintenance
  • Ensure budget aligns with topology constraints
  • Prevent disruptions that would violate constraints
  • Set PDB parameters based on topology domain counts
  • Account for expected maintenance scenarios in PDB settings
  • Consider zone-level impacts in multi-zone deployments
  • Example PDB configuration:
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: zonal-pdb
    spec:
      minAvailable: "33%"  # Ensures at least one zone worth of pods remains
      selector:
        matchLabels:
          app: my-app
    

Consider Startup Order for StatefulSets

  • StatefulSets create pods in sequential order
  • This may temporarily violate topology constraints
  • Design for graceful startup with higher initial maxSkew
  • Consider custom controllers for complex stateful workloads
  • Use init containers to handle bootstrap coordination
  • Implement readiness gates for topology-aware initialization
  • Example StatefulSet with ordinal-based priority:
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: distributed-db
    spec:
      podManagementPolicy: Parallel  # Use parallel for better distribution during startup
      replicas: 9
      template:
        # ... other configuration
        spec:
          initContainers:
          - name: topology-bootstrap
            image: bitnami/kubectl
            command: ["bash", "-c"]
            args:
              - |
                # Get pod ordinal
                ORDINAL=$(echo $HOSTNAME | rev | cut -d'-' -f1 | rev)
                # Get zone
                ZONE=$(kubectl get node $NODE_NAME -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}')
                # Topology-aware initialization logic
                if [ "$ORDINAL" -lt "3" ]; then
                  # Primary zone initialization
                  echo "Initializing as primary in zone $ZONE"
                else
                  # Secondary zone initialization
                  echo "Initializing as replica in zone $ZONE"
                fi
            env:
              - name: NODE_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: spec.nodeName
    

Label Nodes Appropriately

  • Ensure all nodes have required topology labels
  • Add custom topology domains as needed
  • Standardize label naming conventions
  • Validate node labels regularly
  • Document topology labeling scheme
  • Consider automation for label management
  • Implement label verification in CI/CD pipelines
  • Example node labeling:
    # Add rack label to nodes
    kubectl label nodes node1 node2 node3 topology.kubernetes.io/rack=rack1
    kubectl label nodes node4 node5 node6 topology.kubernetes.io/rack=rack2
    
    # Add custom topology domain
    kubectl label nodes node1 node4 custom.topology/power-supply=ups1
    kubectl label nodes node2 node5 custom.topology/power-supply=ups2
    kubectl label nodes node3 node6 custom.topology/power-supply=ups3
    
    # Add network segment information
    kubectl label nodes node1 node2 custom.topology/network-segment=segment1
    kubectl label nodes node3 node4 custom.topology/network-segment=segment2
    kubectl label nodes node5 node6 custom.topology/network-segment=segment3
    

Implement Hierarchical Constraints

  • Use multiple constraints with different topology keys
  • Order constraints from largest to smallest domains
  • Consider the impact of each level on the others
  • Test complex hierarchies thoroughly before production
  • Document the hierarchy and its purpose
  • Monitor distribution at each level
  • Example hierarchical implementation:
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/region
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: critical-app
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: critical-app
    - maxSkew: 2
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: critical-app
    

Conclusion

Kubernetes Topology Spread Constraints provide a powerful mechanism for controlling pod distribution across infrastructure topology domains. By leveraging this feature, you can:

  1. Improve application availability by distributing workloads across failure domains
  2. Optimize resource utilization by preventing concentration in specific regions or zones
  3. Enhance performance by controlling workload placement with respect to network topology
  4. Implement complex high-availability patterns for stateful and stateless workloads
  5. Achieve cost optimization by balancing workloads across different pricing domains
  6. Create sophisticated multi-dimensional distribution strategies
  7. Implement regulatory compliance requirements for geographic distribution
  8. Protect against various failure scenarios from hardware to regional outages

When combined with other Kubernetes scheduling features like affinity rules, taints, tolerations, and priority classes, topology spread constraints enable sophisticated workload placement strategies that can significantly improve the resilience and efficiency of your applications.

As Kubernetes environments grow in complexity and scale across multiple regions, cloud providers, and infrastructure types, topology spread constraints become an essential tool for managing workload distribution and ensuring consistent application performance and availability. They represent a key capability for organizations implementing global, highly available Kubernetes platforms for mission-critical applications.