Autoscaling & Resource Management

Understanding Kubernetes autoscaling mechanisms and resource management strategies

Kubernetes provides several autoscaling mechanisms to adjust resources based on workload demands, ensuring optimal performance and cost efficiency. Autoscaling is essential for applications with variable workloads, as it helps maintain performance during traffic spikes while reducing costs during periods of low activity.

Kubernetes offers three complementary autoscaling mechanisms that work at different levels:

Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas
Vertical Pod Autoscaler (VPA): Adjusts CPU and memory resources for containers
Cluster Autoscaler: Changes the number of nodes in the cluster

Each autoscaler operates independently but can be used together to create a comprehensive scaling strategy.

Horizontal Pod Autoscaler (HPA)

Basic Concept

Automatically scales pods: Increases or decreases the number of running pods in a deployment, statefulset, or replicaset
Based on CPU/memory usage: Monitors resource utilization across pods to make scaling decisions
Supports custom metrics: Can scale based on application-specific metrics from Prometheus or other sources
Maintains performance: Prevents degradation by adding pods before resources are exhausted
Manages traffic fluctuations: Adapts to changing demand patterns automatically
Built-in stabilization: Prevents thrashing by implementing cooldown periods between scaling operations

HPA Definition

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2        # Minimum number of pods to maintain
  maxReplicas: 10       # Maximum number of pods to scale to
  metrics:
  - type: Resource      # Using built-in resource metrics
    resource:
      name: cpu         # CPU utilization as the scaling metric
      target:
        type: Utilization      # Based on percentage utilization
        averageUtilization: 80 # Target 80% CPU utilization across pods

The HPA controller continuously:

Fetches metrics from the Metrics API
Calculates the desired replica count based on current vs. target utilization
Updates the replicas field of the target resource
Waits for the scaling action to take effect
Repeats the process (default every 15 seconds)

Multiple Metrics

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 80
- type: Resource
  resource:
    name: memory
    target:
      type: Utilization
      averageUtilization: 80

Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler adjusts the CPU and memory resources allocated to containers within pods, rather than changing the number of pods. This is ideal for applications that cannot be horizontally scaled or that benefit from having properly sized resources.

VPA works by:

Analyzing historical resource usage of containers
Recommending optimal CPU and memory settings
Applying these settings automatically (in Auto mode) or providing them for manual implementation

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: Auto  # Options: Auto, Recreate, Off, Initial
  resourcePolicy:
    containerPolicies:
    - containerName: '*'  # Apply to all containers in the pod
      minAllowed:         # Minimum resources to allocate
        cpu: 100m
        memory: 50Mi
      maxAllowed:         # Maximum resources to allocate
        cpu: 1
        memory: 500Mi
      controlledResources: ["cpu", "memory"]  # Resources to manage

VPA update modes explained:

Auto: Automatically updates pod resources during its lifetime (requires pod restart)
Recreate: Similar to Auto, but all pods are restarted when recommendations change
Off: Only provides recommendations without applying them (visible in VPA status)
Initial: Only applies recommendations at pod creation time

Cluster Autoscaler

Cluster Autoscaler automatically adjusts the size of the Kubernetes cluster based on:

Pending pods that cannot be scheduled due to insufficient resources
Nodes in the cluster that have been underutilized for an extended period

The autoscaler works across various cloud providers including AWS, GCP, Azure, and others, with provider-specific implementations.

Cluster Autoscaler intelligently manages node scaling by:

Regularly checking for pods that can't be scheduled
Simulating scheduling to determine if adding a node would help
Respecting pod disruption budgets when removing nodes
Considering node groups and their properties
Safely draining nodes before termination
Respecting scale-down delays to prevent thrashing

# Example for GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        # Required to get proper metrics and avoid self-eviction
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      serviceAccountName: cluster-autoscaler  # Account with node management permissions
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/cluster-autoscaler:v1.18.1
        command:
        - ./cluster-autoscaler
        - --cloud-provider=gce            # Cloud provider-specific
        - --nodes=2:10:default-pool       # Min:Max:NodePoolName
        - --scale-down-delay-after-add=10m  # Wait time after scaling up
        - --scale-down-unneeded-time=10m    # Time before considering a node for removal
        - --skip-nodes-with-local-storage=false  # Consider nodes with local storage
        - --balance-similar-node-groups    # Balance node groups with similar instances
        resources:
          requests:
            cpu: 100m
            memory: 300Mi

For different cloud providers, the configuration varies:

AWS:

--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>

Azure:

--node-resource-group=<resource-group>
--azure-vm-type=AKS

Resource Management

Resource management is a critical aspect of Kubernetes that affects scheduling, performance, and reliability. Properly configuring resources ensures that applications get what they need while maintaining cluster efficiency.

Resource Requests

Minimum guaranteed resources: The amount of resources the container is guaranteed to get
Used for scheduling decisions: Kubernetes scheduler only places pods on nodes with enough available resources
Affects pod placement: Higher requests may limit which nodes can accept the pod
Resource reservation: These resources are reserved for the container, even if unused
QoS classification: Contributes to the pod's Quality of Service class
Cluster capacity planning: Helps determine total cluster resource needs

Resource Limits

Maximum allowable resources: The upper bound a container can consume
Throttling enforcement: CPU is throttled if it exceeds the limit
OOM killer priority: Containers exceeding memory limits are terminated first
Performance boundary: Prevents a single container from consuming all resources
Noisy neighbor prevention: Protects co-located workloads from resource starvation
Predictable behavior: Creates consistent performance characteristics

Understanding the difference between requests and limits is crucial:

Requests: What the container is guaranteed to get
Limits: The maximum it can use before throttling/termination

When configuring resources, consider these principles:

Set requests based on actual application needs
Configure limits to prevent runaway containers
Monitor actual usage to refine values over time
Consider the application's behavior under resource constraints

QoS Classes

Kubernetes assigns one of three QoS classes to pods:

Guaranteed: Requests = Limits for all containers (highest priority)
Burstable: At least one container has Requests < Limits (medium priority)
BestEffort: No Requests or Limits specified (lowest priority)

These classes directly impact how the kubelet handles resource pressure and which pods get evicted first when nodes run out of resources.

QoS classes affect several aspects of container runtime:

Pod Eviction Order:

Under resource pressure, BestEffort pods are evicted first
Then Burstable pods with the highest usage above requests
Guaranteed pods are evicted only as a last resort

Resource Reclamation:

Guaranteed pods have the highest priority for resource retention
Burstable pods may face CPU throttling when the node is under pressure
BestEffort pods receive resources only when available

Scheduling Priority:

While not directly affecting scheduling order, QoS impacts pod placement
Guaranteed pods have more predictable performance characteristics

# Guaranteed QoS Example
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "200Mi"  # 200 mebibytes of memory
        cpu: "500m"      # 500 millicores = 0.5 CPU cores
      limits:
        memory: "200Mi"  # Must match request for Guaranteed QoS
        cpu: "500m"      # Must match request for Guaranteed QoS

Burstable QoS Example:

apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "100Mi"
        cpu: "200m"
      limits:
        memory: "200Mi"  # Higher than request
        cpu: "800m"      # Higher than request

BestEffort QoS Example:

apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: app
    image: nginx
    # No resources specified

Resource Quotas

Resource Quotas allow cluster administrators to restrict resource consumption per namespace. This is essential for multi-tenant clusters where fair resource distribution is important.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: team-a
spec:
  hard:
    # Compute resource constraints
    requests.cpu: "10"      # Total CPU requests across all pods
    requests.memory: 20Gi   # Total memory requests across all pods
    limits.cpu: "20"        # Total CPU limits across all pods
    limits.memory: 40Gi     # Total memory limits across all pods
    
    # Object count constraints
    pods: "20"              # Maximum number of pods
    services: "10"          # Maximum number of services
    configmaps: "20"        # Maximum number of ConfigMaps
    persistentvolumeclaims: "5"  # Maximum number of PVCs
    
    # Storage constraints
    requests.storage: "500Gi"    # Total storage requests
    gold.storageclass.storage.k8s.io/requests.storage: "300Gi"  # Storage class specific

When a ResourceQuota is active in a namespace:

Users must specify resource requests and limits for all containers
The quota system tracks usage and prevents creation of resources that exceed quota
Requests for resources that would exceed quota are rejected with a 403 FORBIDDEN error
Quotas can be adjusted dynamically as business needs change

Limit Ranges

LimitRanges provide constraints on resource allocations per object in a namespace, allowing administrators to enforce minimum and maximum resource usage per pod or container. They also enable setting default values when not explicitly specified by developers.

Default Limits

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  - default:          # Default limits if not specified
      memory: 512Mi
      cpu: 500m
    defaultRequest:   # Default requests if not specified
      memory: 256Mi
      cpu: 250m
    type: Container   # These apply to containers

    # Optional fields for comprehensive control:
    # max:            # Maximum limits allowed
    #   memory: 2Gi
    #   cpu: 2
    # min:            # Minimum requests required
    #   memory: 50Mi
    #   cpu: 50m
    # maxLimitRequestRatio:  # Maximum limit to request ratio
    #   memory: 4
    #   cpu: 4

LimitRanges can enforce:

Default resource requests and limits when not specified
Minimum and maximum resource constraints
Ratio between requests and limits to prevent extreme differences
Storage size constraints for PersistentVolumeClaims

Container Constraints

apiVersion: v1
kind: LimitRange
metadata:
  name: constraint-limits
spec:
  limits:
  - min:
      cpu: 100m
      memory: 50Mi
    max:
      cpu: 2
      memory: 4Gi
    type: Container

Resource Metrics

Monitoring resource usage is essential for effective resource management and troubleshooting. Kubernetes provides built-in commands to view current resource consumption.

# View resource usage for pods
kubectl top pod -n default
# Sample output:
# NAME                       CPU(cores)   MEMORY(bytes)
# nginx-6799fc88d8-dnz7t     1m           9Mi
# redis-79486b66b7-lrm48     3m           11Mi

# View resource usage for specific pods
kubectl top pod -l app=nginx -n default

# View resource usage for nodes
kubectl top node
# Sample output:
# NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# node-1         368m         18%    1726Mi          45%
# node-2         132m         6%     1056Mi          27%

# Get more detailed resource information for a pod
kubectl describe pod <pod-name> | grep -A 15 Resources:

These metrics are provided by the Metrics Server, which must be installed in the cluster. For more comprehensive monitoring, consider solutions like:

Prometheus + Grafana for detailed metrics collection and visualization
Kubernetes Dashboard for a web UI showing cluster resources
Vertical Pod Autoscaler in recommendation mode for resource optimization insights

Custom Metrics Autoscaling

Scaling based on CPU and memory might not always reflect your application's actual load. Custom metrics autoscaling allows you to scale based on application-specific metrics like queue depth, request latency, or any other business metric.

Custom Metrics Adapter

Extends metrics API: Implements the Kubernetes metrics API for custom data sources
Supports business metrics: Allows scaling on application-specific indicators
Enables scaling on non-resource metrics: Queue length, request rate, error rate, etc.
Integrates with monitoring systems: Works with Prometheus, Datadog, Google Stackdriver, etc.
Enables application-specific scaling: Tailors scaling behavior to your specific workload
Multiple metric support: Can combine various metrics for complex scaling decisions

Custom HPA Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-processor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-processor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: External          # External metric from monitoring system
    external:
      metric:
        name: queue_messages_ready   # Metric name in monitoring system
        selector:
          matchLabels:
            queue: "worker_tasks"    # Additional labels to filter metric
      target:
        type: AverageValue           # How to interpret the metric
        averageValue: 30             # Target 30 messages per pod
  - type: Pods              # Example of pod metric (from Prometheus)
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000
  - type: Object            # Example of object metric (e.g., ingress)
    object:
      metric:
        name: requests_per_second
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: main-ingress
      target:
        type: Value
        value: 10000

Common custom metrics adapters:

Prometheus Adapter: Exposes Prometheus metrics to the Kubernetes API
Stackdriver Adapter: Exposes Google Cloud Monitoring metrics
Azure Adapter: Exposes Azure Monitor metrics
Datadog Adapter: Exposes Datadog metrics

Best Practices

Set appropriate requests and limits
Use HPA for handling variable load
Implement VPA for optimizing resource allocation
Configure Cluster Autoscaler for infrastructure scaling
Monitor actual resource usage
Set resource quotas for fair sharing
Implement default limit ranges
Choose appropriate QoS classes

Autoscaling Strategies

Implementing effective autoscaling goes beyond simply enabling the autoscalers. Strategic approaches can significantly improve application performance and resource efficiency.

Predictive Scaling

Historical patterns: Analyzing past usage to predict future needs
Machine learning models: Using ML to forecast load patterns with greater accuracy
Pre-emptive scaling: Scaling up before expected traffic increases (e.g., 9am workday start)
Avoiding cold starts: Initializing resources before they're needed to reduce latency
Handling known events: Planning for sales, promotions, or scheduled batch processes
Seasonal adjustments: Accommodating daily, weekly, or seasonal traffic patterns
Feedback loops: Continuously improving predictions based on actual outcomes

Proactive vs Reactive

Scale before demand (proactive): Preventing performance degradation by anticipating needs
Scale after threshold breach (reactive): Responding to actual measured conditions
Hybrid approaches: Combining predictive baseline with reactive fine-tuning
Cost vs performance trade-offs: Balancing resource efficiency with user experience
Business requirements alignment: Matching scaling strategy to business priorities
Over-provisioning critical components: Maintaining excess capacity for mission-critical services
Graceful degradation planning: Defining how systems behave under extreme load

Advanced scaling implementations:

Multi-dimensional scaling: Using combinations of metrics (CPU, memory, custom)
Schedule-based scaling: Setting different min/max replicas based on time of day
Cascading autoscaling: Coordinating scaling across multiple dependent components
Controlled rollouts: Gradually increasing capacity to validate system behavior
Circuit breaking: Automatically degrading non-critical features during peak loads

Advanced Configuration

# HPA with behavior configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max

Troubleshooting Autoscaling

Common autoscaling issues:

Metrics unavailability
Resource constraints preventing scaling
Inappropriate threshold settings
Scaling limits too restrictive
Stabilization window too long
Insufficient node resources

Resource Optimization

Right-sizing

Analyze actual usage
Adjust requests and limits
Implement VPA recommendations
Periodic review
Workload profiling

Cost Management

Set namespace quotas
Monitor utilization
Optimize node selection
Use spot/preemptible instances
Implement scale-to-zero

Implementation Checklist

Before implementing autoscaling:

Define scaling objectives
- Performance targets (response time, throughput)
- Cost constraints
- Reliability requirements
- Priority of different workloads
Identify appropriate metrics
- Resource metrics (CPU, memory)
- Custom application metrics
- Business KPIs
- Leading indicators of load
Set reasonable thresholds
- Based on application benchmarking
- Not too sensitive (to avoid thrashing)
- Not too conservative (to avoid delayed scaling)
- Different for scale-up vs scale-down
Determine min/max replica counts
- Minimum for baseline availability
- Maximum based on budget and resource constraints
- Consider quota limits
- Plan for worst-case scenarios
Configure scaling behavior
- Stabilization windows
- Step vs. continuous scaling
- Scale-up vs. scale-down rates
- Cooldown periods
Implement proper monitoring
- Alerting on scaling events
- Dashboards for scaling metrics
- Historical trending
- Cost analysis
Test scaling scenarios
- Load testing to verify scaling triggers
- Chaos testing to ensure resilience
- Edge cases (e.g., metric unavailability)
- Gradual vs. sudden load changes
Document scaling policies
- Scaling objectives and strategies
- Expected behavior under different conditions
- Troubleshooting guidelines
- Regular review process

Remember that autoscaling is an iterative process. Begin with conservative settings, monitor behavior in production, and adjust as you learn more about your application's actual needs and performance characteristics.

Edit this page

Service Mesh & Ingress

Understanding Kubernetes service mesh architecture and ingress controllers

Operators & CRDs

Understanding Kubernetes Operators, Custom Resources, and extending Kubernetes