Welcome to from-docker-to-kubernetes

Autoscaling & Resource Management

Understanding Kubernetes autoscaling mechanisms and resource management strategies

Autoscaling in Kubernetes

Kubernetes provides several autoscaling mechanisms to adjust resources based on workload demands, ensuring optimal performance and cost efficiency. Autoscaling is essential for applications with variable workloads, as it helps maintain performance during traffic spikes while reducing costs during periods of low activity.

Kubernetes offers three complementary autoscaling mechanisms that work at different levels:

  • Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas
  • Vertical Pod Autoscaler (VPA): Adjusts CPU and memory resources for containers
  • Cluster Autoscaler: Changes the number of nodes in the cluster

Each autoscaler operates independently but can be used together to create a comprehensive scaling strategy.

Horizontal Pod Autoscaler (HPA)

Basic Concept

  • Automatically scales pods: Increases or decreases the number of running pods in a deployment, statefulset, or replicaset
  • Based on CPU/memory usage: Monitors resource utilization across pods to make scaling decisions
  • Supports custom metrics: Can scale based on application-specific metrics from Prometheus or other sources
  • Maintains performance: Prevents degradation by adding pods before resources are exhausted
  • Manages traffic fluctuations: Adapts to changing demand patterns automatically
  • Built-in stabilization: Prevents thrashing by implementing cooldown periods between scaling operations

HPA Definition

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2        # Minimum number of pods to maintain
  maxReplicas: 10       # Maximum number of pods to scale to
  metrics:
  - type: Resource      # Using built-in resource metrics
    resource:
      name: cpu         # CPU utilization as the scaling metric
      target:
        type: Utilization      # Based on percentage utilization
        averageUtilization: 80 # Target 80% CPU utilization across pods

The HPA controller continuously:

  1. Fetches metrics from the Metrics API
  2. Calculates the desired replica count based on current vs. target utilization
  3. Updates the replicas field of the target resource
  4. Waits for the scaling action to take effect
  5. Repeats the process (default every 15 seconds)

Multiple Metrics

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 80
- type: Resource
  resource:
    name: memory
    target:
      type: Utilization
      averageUtilization: 80

Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler adjusts the CPU and memory resources allocated to containers within pods, rather than changing the number of pods. This is ideal for applications that cannot be horizontally scaled or that benefit from having properly sized resources.

VPA works by:

  1. Analyzing historical resource usage of containers
  2. Recommending optimal CPU and memory settings
  3. Applying these settings automatically (in Auto mode) or providing them for manual implementation
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: Auto  # Options: Auto, Recreate, Off, Initial
  resourcePolicy:
    containerPolicies:
    - containerName: '*'  # Apply to all containers in the pod
      minAllowed:         # Minimum resources to allocate
        cpu: 100m
        memory: 50Mi
      maxAllowed:         # Maximum resources to allocate
        cpu: 1
        memory: 500Mi
      controlledResources: ["cpu", "memory"]  # Resources to manage

VPA update modes explained:

  • Auto: Automatically updates pod resources during its lifetime (requires pod restart)
  • Recreate: Similar to Auto, but all pods are restarted when recommendations change
  • Off: Only provides recommendations without applying them (visible in VPA status)
  • Initial: Only applies recommendations at pod creation time

Cluster Autoscaler

Cluster Autoscaler intelligently manages node scaling by:

  • Regularly checking for pods that can't be scheduled
  • Simulating scheduling to determine if adding a node would help
  • Respecting pod disruption budgets when removing nodes
  • Considering node groups and their properties
  • Safely draining nodes before termination
  • Respecting scale-down delays to prevent thrashing
# Example for GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        # Required to get proper metrics and avoid self-eviction
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      serviceAccountName: cluster-autoscaler  # Account with node management permissions
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/cluster-autoscaler:v1.18.1
        command:
        - ./cluster-autoscaler
        - --cloud-provider=gce            # Cloud provider-specific
        - --nodes=2:10:default-pool       # Min:Max:NodePoolName
        - --scale-down-delay-after-add=10m  # Wait time after scaling up
        - --scale-down-unneeded-time=10m    # Time before considering a node for removal
        - --skip-nodes-with-local-storage=false  # Consider nodes with local storage
        - --balance-similar-node-groups    # Balance node groups with similar instances
        resources:
          requests:
            cpu: 100m
            memory: 300Mi

For different cloud providers, the configuration varies:

AWS:

--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>

Azure:

--node-resource-group=<resource-group>
--azure-vm-type=AKS

Resource Management

Resource management is a critical aspect of Kubernetes that affects scheduling, performance, and reliability. Properly configuring resources ensures that applications get what they need while maintaining cluster efficiency.

Resource Requests

  • Minimum guaranteed resources: The amount of resources the container is guaranteed to get
  • Used for scheduling decisions: Kubernetes scheduler only places pods on nodes with enough available resources
  • Affects pod placement: Higher requests may limit which nodes can accept the pod
  • Resource reservation: These resources are reserved for the container, even if unused
  • QoS classification: Contributes to the pod's Quality of Service class
  • Cluster capacity planning: Helps determine total cluster resource needs

Resource Limits

  • Maximum allowable resources: The upper bound a container can consume
  • Throttling enforcement: CPU is throttled if it exceeds the limit
  • OOM killer priority: Containers exceeding memory limits are terminated first
  • Performance boundary: Prevents a single container from consuming all resources
  • Noisy neighbor prevention: Protects co-located workloads from resource starvation
  • Predictable behavior: Creates consistent performance characteristics

Understanding the difference between requests and limits is crucial:

  • Requests: What the container is guaranteed to get
  • Limits: The maximum it can use before throttling/termination

When configuring resources, consider these principles:

  1. Set requests based on actual application needs
  2. Configure limits to prevent runaway containers
  3. Monitor actual usage to refine values over time
  4. Consider the application's behavior under resource constraints

QoS Classes

QoS classes affect several aspects of container runtime:

Pod Eviction Order:

  • Under resource pressure, BestEffort pods are evicted first
  • Then Burstable pods with the highest usage above requests
  • Guaranteed pods are evicted only as a last resort

Resource Reclamation:

  • Guaranteed pods have the highest priority for resource retention
  • Burstable pods may face CPU throttling when the node is under pressure
  • BestEffort pods receive resources only when available

Scheduling Priority:

  • While not directly affecting scheduling order, QoS impacts pod placement
  • Guaranteed pods have more predictable performance characteristics
# Guaranteed QoS Example
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "200Mi"  # 200 mebibytes of memory
        cpu: "500m"      # 500 millicores = 0.5 CPU cores
      limits:
        memory: "200Mi"  # Must match request for Guaranteed QoS
        cpu: "500m"      # Must match request for Guaranteed QoS

Burstable QoS Example:

apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "100Mi"
        cpu: "200m"
      limits:
        memory: "200Mi"  # Higher than request
        cpu: "800m"      # Higher than request

BestEffort QoS Example:

apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: app
    image: nginx
    # No resources specified

Resource Quotas

Resource Quotas allow cluster administrators to restrict resource consumption per namespace. This is essential for multi-tenant clusters where fair resource distribution is important.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: team-a
spec:
  hard:
    # Compute resource constraints
    requests.cpu: "10"      # Total CPU requests across all pods
    requests.memory: 20Gi   # Total memory requests across all pods
    limits.cpu: "20"        # Total CPU limits across all pods
    limits.memory: 40Gi     # Total memory limits across all pods
    
    # Object count constraints
    pods: "20"              # Maximum number of pods
    services: "10"          # Maximum number of services
    configmaps: "20"        # Maximum number of ConfigMaps
    persistentvolumeclaims: "5"  # Maximum number of PVCs
    
    # Storage constraints
    requests.storage: "500Gi"    # Total storage requests
    gold.storageclass.storage.k8s.io/requests.storage: "300Gi"  # Storage class specific

When a ResourceQuota is active in a namespace:

  1. Users must specify resource requests and limits for all containers
  2. The quota system tracks usage and prevents creation of resources that exceed quota
  3. Requests for resources that would exceed quota are rejected with a 403 FORBIDDEN error
  4. Quotas can be adjusted dynamically as business needs change

Limit Ranges

LimitRanges provide constraints on resource allocations per object in a namespace, allowing administrators to enforce minimum and maximum resource usage per pod or container. They also enable setting default values when not explicitly specified by developers.

Default Limits

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  - default:          # Default limits if not specified
      memory: 512Mi
      cpu: 500m
    defaultRequest:   # Default requests if not specified
      memory: 256Mi
      cpu: 250m
    type: Container   # These apply to containers

    # Optional fields for comprehensive control:
    # max:            # Maximum limits allowed
    #   memory: 2Gi
    #   cpu: 2
    # min:            # Minimum requests required
    #   memory: 50Mi
    #   cpu: 50m
    # maxLimitRequestRatio:  # Maximum limit to request ratio
    #   memory: 4
    #   cpu: 4

LimitRanges can enforce:

  • Default resource requests and limits when not specified
  • Minimum and maximum resource constraints
  • Ratio between requests and limits to prevent extreme differences
  • Storage size constraints for PersistentVolumeClaims

Container Constraints

apiVersion: v1
kind: LimitRange
metadata:
  name: constraint-limits
spec:
  limits:
  - min:
      cpu: 100m
      memory: 50Mi
    max:
      cpu: 2
      memory: 4Gi
    type: Container

Resource Metrics

Monitoring resource usage is essential for effective resource management and troubleshooting. Kubernetes provides built-in commands to view current resource consumption.

# View resource usage for pods
kubectl top pod -n default
# Sample output:
# NAME                       CPU(cores)   MEMORY(bytes)
# nginx-6799fc88d8-dnz7t     1m           9Mi
# redis-79486b66b7-lrm48     3m           11Mi

# View resource usage for specific pods
kubectl top pod -l app=nginx -n default

# View resource usage for nodes
kubectl top node
# Sample output:
# NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# node-1         368m         18%    1726Mi          45%
# node-2         132m         6%     1056Mi          27%

# Get more detailed resource information for a pod
kubectl describe pod <pod-name> | grep -A 15 Resources:

These metrics are provided by the Metrics Server, which must be installed in the cluster. For more comprehensive monitoring, consider solutions like:

  • Prometheus + Grafana for detailed metrics collection and visualization
  • Kubernetes Dashboard for a web UI showing cluster resources
  • Vertical Pod Autoscaler in recommendation mode for resource optimization insights

Custom Metrics Autoscaling

Scaling based on CPU and memory might not always reflect your application's actual load. Custom metrics autoscaling allows you to scale based on application-specific metrics like queue depth, request latency, or any other business metric.

Custom Metrics Adapter

  • Extends metrics API: Implements the Kubernetes metrics API for custom data sources
  • Supports business metrics: Allows scaling on application-specific indicators
  • Enables scaling on non-resource metrics: Queue length, request rate, error rate, etc.
  • Integrates with monitoring systems: Works with Prometheus, Datadog, Google Stackdriver, etc.
  • Enables application-specific scaling: Tailors scaling behavior to your specific workload
  • Multiple metric support: Can combine various metrics for complex scaling decisions

Custom HPA Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-processor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-processor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: External          # External metric from monitoring system
    external:
      metric:
        name: queue_messages_ready   # Metric name in monitoring system
        selector:
          matchLabels:
            queue: "worker_tasks"    # Additional labels to filter metric
      target:
        type: AverageValue           # How to interpret the metric
        averageValue: 30             # Target 30 messages per pod
  - type: Pods              # Example of pod metric (from Prometheus)
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000
  - type: Object            # Example of object metric (e.g., ingress)
    object:
      metric:
        name: requests_per_second
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: main-ingress
      target:
        type: Value
        value: 10000

Common custom metrics adapters:

  • Prometheus Adapter: Exposes Prometheus metrics to the Kubernetes API
  • Stackdriver Adapter: Exposes Google Cloud Monitoring metrics
  • Azure Adapter: Exposes Azure Monitor metrics
  • Datadog Adapter: Exposes Datadog metrics

Best Practices

Autoscaling Strategies

Implementing effective autoscaling goes beyond simply enabling the autoscalers. Strategic approaches can significantly improve application performance and resource efficiency.

Predictive Scaling

  • Historical patterns: Analyzing past usage to predict future needs
  • Machine learning models: Using ML to forecast load patterns with greater accuracy
  • Pre-emptive scaling: Scaling up before expected traffic increases (e.g., 9am workday start)
  • Avoiding cold starts: Initializing resources before they're needed to reduce latency
  • Handling known events: Planning for sales, promotions, or scheduled batch processes
  • Seasonal adjustments: Accommodating daily, weekly, or seasonal traffic patterns
  • Feedback loops: Continuously improving predictions based on actual outcomes

Proactive vs Reactive

  • Scale before demand (proactive): Preventing performance degradation by anticipating needs
  • Scale after threshold breach (reactive): Responding to actual measured conditions
  • Hybrid approaches: Combining predictive baseline with reactive fine-tuning
  • Cost vs performance trade-offs: Balancing resource efficiency with user experience
  • Business requirements alignment: Matching scaling strategy to business priorities
  • Over-provisioning critical components: Maintaining excess capacity for mission-critical services
  • Graceful degradation planning: Defining how systems behave under extreme load

Advanced scaling implementations:

  1. Multi-dimensional scaling: Using combinations of metrics (CPU, memory, custom)
  2. Schedule-based scaling: Setting different min/max replicas based on time of day
  3. Cascading autoscaling: Coordinating scaling across multiple dependent components
  4. Controlled rollouts: Gradually increasing capacity to validate system behavior
  5. Circuit breaking: Automatically degrading non-critical features during peak loads

Advanced Configuration

# HPA with behavior configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max

Troubleshooting Autoscaling

Resource Optimization

Right-sizing

  • Analyze actual usage
  • Adjust requests and limits
  • Implement VPA recommendations
  • Periodic review
  • Workload profiling

Cost Management

  • Set namespace quotas
  • Monitor utilization
  • Optimize node selection
  • Use spot/preemptible instances
  • Implement scale-to-zero

Implementation Checklist

Remember that autoscaling is an iterative process. Begin with conservative settings, monitor behavior in production, and adjust as you learn more about your application's actual needs and performance characteristics.