Welcome to from-docker-to-kubernetes

Kubernetes Horizontal Pod Autoscaler Advanced Configurations

Advanced configuration techniques and patterns for Kubernetes Horizontal Pod Autoscaler

Introduction to Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) is a powerful Kubernetes resource that automatically scales the number of pods in a deployment, replicaset, or statefulset based on observed metrics. While the basic HPA implementation scales based on CPU and memory utilization, advanced configurations enable sophisticated scaling behaviors based on custom and external metrics, complex scaling algorithms, and integration with other Kubernetes components.

At its core, the HPA follows a control loop pattern, periodically adjusting the number of replicas to match the specified metric targets. The controller fetches metrics from a series of APIs: the resource metrics API (for CPU/memory), the custom metrics API (for in-cluster metrics), and the external metrics API (for external service metrics). Based on these metrics, the controller calculates the desired number of replicas and adjusts the scale accordingly.

Advanced HPA configurations are essential for applications with complex scaling requirements that go beyond simple CPU or memory utilization. This is particularly important for applications with:

  1. Workload-specific metrics: Applications where CPU/memory don't directly correlate with user load
  2. External dependencies: Systems that need to scale based on external service metrics
  3. Business-driven scaling: Workloads that scale based on business metrics like queue length or request rates
  4. Complex scaling behaviors: Applications requiring sophisticated scaling algorithms with stabilization windows

Understanding advanced HPA configurations enables architects and operators to implement precise, application-specific autoscaling strategies that optimize both performance and resource utilization.

Scaling on Custom Metrics

Beyond the default CPU and memory metrics, HPAs can scale based on application-specific custom metrics collected within the Kubernetes cluster:

Implementing the Custom Metrics API

  • Deploy a metrics solution that implements the Kubernetes custom metrics API
  • Popular options include Prometheus Adapter, Datadog, and Azure Monitor
  • Metrics must be exposed following the Kubernetes metrics schema
  • The adapter translates between your metrics system and the Kubernetes API
  • Example installation of Prometheus Adapter:
    # Add the Prometheus community Helm repository
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    
    # Install Prometheus Adapter with custom configuration
    helm install prometheus-adapter prometheus-community/prometheus-adapter \
      --namespace monitoring \
      --values prometheus-adapter-values.yaml
    

Creating Application-Specific Metrics

  • Instrument your application to expose custom metrics
  • Common libraries: Prometheus client, Micrometer, StatsD
  • Metrics should be relevant to application performance and load
  • Examples include request count, queue depth, response time
  • Sample Go code for exposing a custom metric:
    package main
    
    import (
        "net/http"
        
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var (
        activeRequests = prometheus.NewGauge(prometheus.GaugeOpts{
            Name: "http_requests_active",
            Help: "The current number of active HTTP requests.",
        })
        
        requestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "http_requests_total",
                Help: "The total number of HTTP requests.",
            },
            []string{"method", "endpoint", "status"},
        )
    )
    
    func init() {
        prometheus.MustRegister(activeRequests)
        prometheus.MustRegister(requestsTotal)
    }
    
    func main() {
        http.Handle("/metrics", promhttp.Handler())
        http.ListenAndServe(":8080", nil)
    }
    

Configuring Prometheus Adapter

  • Create a configuration file that defines how to convert Prometheus metrics to Kubernetes metrics
  • Specify metrics discovery rules and naming conventions
  • Map Prometheus queries to Kubernetes object resources
  • Define scaling metrics for specific deployments or applications
  • Example prometheus-adapter-values.yaml:
    rules:
      default: false
      custom:
        - seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace:
                resource: namespace
              pod:
                resource: pod
          name:
            matches: "http_requests_per_second"
            as: "http_requests_per_second"
          metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
        
        - seriesQuery: 'rabbitmq_queue_messages{namespace!="",service!=""}'
          resources:
            overrides:
              namespace:
                resource: namespace
              service:
                resource: service
          name:
            matches: "rabbitmq_queue_messages"
            as: "rabbitmq_queue_messages"
          metricsQuery: 'avg(rabbitmq_queue_messages{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    

HPA on Custom Metrics

  • Configure an HPA to use the custom metrics exposed by the adapter
  • Specify the target metric type as "Pods" or "Object"
  • Define an appropriate target value based on application characteristics
  • Example HPA using request rate metric:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-server-hpa
      namespace: production
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-server
      minReplicas: 3
      maxReplicas: 20
      metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: 500
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
    

Scaling on External Metrics

External metrics enable scaling based on metrics from systems outside the Kubernetes cluster, such as cloud services or external monitoring systems:

# HPA based on SQS queue length
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: message-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: message-processor
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: sqs_messages_visible
        selector:
          matchLabels:
            queue-name: my-processing-queue
      target:
        type: AverageValue
        averageValue: 30

# HPA based on Prometheus metrics from external API
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 5
  maxReplicas: 100
  metrics:
  - type: External
    external:
      metric:
        name: nginx_connections_active
        selector:
          matchLabels:
            service: gateway
      target:
        type: Value
        value: 5000

Implementation of external metrics requires:

  1. An external metrics adapter that fetches metrics from external systems
  2. Configuration of the adapter to expose metrics in the Kubernetes API format
  3. Proper IAM or authentication configuration for accessing external metrics sources

Common external metric adapters include:

  • Prometheus Adapter with federated Prometheus instances
  • CloudWatch Adapter for AWS metrics
  • Azure Metrics Adapter for Azure Monitor metrics
  • Stackdriver Adapter for Google Cloud metrics
  • Custom adapters for proprietary monitoring systems

Advanced Scaling Behaviors

The behavior field in HPA specifications enables fine-grained control over scaling decisions:

Multi-Metric and Compound Scaling

Combining multiple metrics in a single HPA enables sophisticated scaling decisions based on various aspects of application performance:

Combining CPU and Custom Metrics

  • Configure HPA with both resource and custom metrics
  • Scale based on whichever metric requires more pods
  • Balance resource utilization with application-specific needs
  • Useful for applications with varying workload characteristics
  • Example multi-metric HPA:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: web-api-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: web-api
      minReplicas: 5
      maxReplicas: 50
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: 1000
    
  • Scale based on metrics from dependent services
  • Use the "Object" metric type to reference specific Kubernetes objects
  • Coordinate scaling between frontend and backend services
  • Prevent bottlenecks in service chains
  • Example HPA with object metrics:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: worker-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: worker
      minReplicas: 3
      maxReplicas: 30
      metrics:
      - type: Object
        object:
          metric:
            name: rabbitmq_queue_messages
          describedObject:
            apiVersion: v1
            kind: Service
            name: rabbitmq
          target:
            type: Value
            value: 100
    

Weighted Multi-Metric Scaling

  • Implement custom metrics that combine multiple factors
  • Create composite metrics in your monitoring system
  • Use PromQL or similar query languages to create weighted averages
  • Balance multiple aspects of application performance
  • Example composite metric in Prometheus Adapter:
    rules:
      custom:
        - seriesQuery: '{__name__=~".*"}'
          resources:
            overrides:
              namespace:
                resource: namespace
              pod:
                resource: pod
          name:
            matches: "composite_load_factor"
            as: "composite_load_factor"
          metricsQuery: >
            (
              sum(rate(http_request_duration_seconds_sum{<<.LabelMatchers>>}[2m])) /
              sum(rate(http_request_duration_seconds_count{<<.LabelMatchers>>}[2m]))
            ) * 10 +
            (
              sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) / 100
            )
    

Multiple HPAs for Complex Scaling Scenarios

  • Create separate HPAs for different scaling concerns
  • Use non-overlapping min/max replica ranges
  • Implement different scaling behaviors for different metrics
  • Create modular scaling configurations
  • Example deployment with multiple HPAs:
    # Base HPA for CPU-based scaling with wider stabilization
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-cpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-server
      minReplicas: 5
      maxReplicas: 30
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 75
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 600
    ---
    # Response time HPA with more aggressive scaling
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-response-time-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-server
      minReplicas: 5
      maxReplicas: 50
      metrics:
      - type: Pods
        pods:
          metric:
            name: request_duration_seconds
          target:
            type: AverageValue
            averageValue: 0.1
      behavior:
        scaleUp:
          policies:
          - type: Percent
            value: 300
            periodSeconds: 60
          stabilizationWindowSeconds: 0
    

Integrating with KEDA for Event-Driven Autoscaling

Kubernetes Event-Driven Autoscaling (KEDA) extends HPA capabilities for event-driven and serverless workloads:

  1. KEDA Architecture and Components
    • KEDA Operator extends Kubernetes with ScaledObject CRD
    • KEDA Metrics Adapter converts event sources to metrics
    • Supports 40+ event sources out of the box
    • Seamlessly integrates with existing HPA infrastructure
    • Example KEDA installation:
      # Add KEDA Helm repository
      helm repo add kedacore https://kedacore.github.io/charts
      
      # Install KEDA in its own namespace
      helm install keda kedacore/keda --namespace keda --create-namespace
      
  2. Scaling on Message Queues
    • Automatically scale based on queue length
    • Support for RabbitMQ, Kafka, Azure Service Bus, AWS SQS, etc.
    • Configure queue-specific authentication and connection details
    • Set appropriate scaling thresholds for message processing
    • Example RabbitMQ scaling:
      apiVersion: keda.sh/v1alpha1
      kind: ScaledObject
      metadata:
        name: rabbitmq-scaler
        namespace: processing
      spec:
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: order-processor
        minReplicaCount: 0   # Scale to zero when no messages
        maxReplicaCount: 30
        triggers:
        - type: rabbitmq
          metadata:
            protocol: amqp
            queueName: orders
            host: rabbitmq.messaging
            queueLength: '5'  # Target messages per pod
            vhostName: /
          authenticationRef:
            name: rabbitmq-trigger-auth
      ---
      apiVersion: keda.sh/v1alpha1
      kind: TriggerAuthentication
      metadata:
        name: rabbitmq-trigger-auth
        namespace: processing
      spec:
        secretTargetRef:
        - parameter: username
          name: rabbitmq-auth
          key: username
        - parameter: password
          name: rabbitmq-auth
          key: password
      
  3. Cron-Based Scaling
    • Schedule scaling operations based on time patterns
    • Prepare for predictable traffic patterns
    • Scale up before anticipated load and down afterward
    • Support for timezone-aware scheduling
    • Example cron-based scaling:
      apiVersion: keda.sh/v1alpha1
      kind: ScaledObject
      metadata:
        name: scheduled-scaler
        namespace: web
      spec:
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: web-app
        minReplicaCount: 3
        maxReplicaCount: 20
        triggers:
        - type: cron
          metadata:
            timezone: "UTC"
            start: 30 8 * * 1-5  # 8:30 AM UTC weekdays
            end: 30 17 * * 1-5   # 5:30 PM UTC weekdays
            desiredReplicas: "15"
        - type: cron
          metadata:
            timezone: "UTC"
            start: 30 17 * * 1-5  # 5:30 PM UTC weekdays
            end: 30 8 * * 1-5     # 8:30 AM UTC weekdays
            desiredReplicas: "3"
      
  4. Advanced KEDA Patterns
    • Combine multiple triggers in a single ScaledObject
    • Implement scaling cooldown periods
    • Use scaling jobs for worker-based processing
    • Create custom scalers for specialized metrics
    • Example multi-trigger scaling:
      apiVersion: keda.sh/v1alpha1
      kind: ScaledObject
      metadata:
        name: advanced-scaler
        namespace: processing
      spec:
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: data-processor
        pollingInterval: 15
        cooldownPeriod: 300
        minReplicaCount: 1
        maxReplicaCount: 50
        advanced:
          restoreToOriginalReplicaCount: true
          horizontalPodAutoscalerConfig:
            behavior:
              scaleDown:
                stabilizationWindowSeconds: 300
                policies:
                - type: Percent
                  value: 10
                  periodSeconds: 60
        triggers:
        - type: kafka
          metadata:
            bootstrapServers: kafka.messaging:9092
            consumerGroup: processing-group
            topic: data-events
            lagThreshold: '10'
        - type: prometheus
          metadata:
            serverAddress: http://prometheus.monitoring:9090
            metricName: processing_latency_seconds
            threshold: '0.5'
            query: avg(processing_latency_seconds{namespace="processing"})
      

Performance Testing and Tuning

Proper testing and tuning ensures HPA configurations perform optimally in production:

Specialized HPA Use Cases

Advanced HPA configurations can address specialized scaling requirements:

Machine Learning Inference Scaling

  • Scale based on GPU utilization and inference queue depth
  • Use custom metrics from ML frameworks
  • Implement predictive scaling for batch inference jobs
  • Balance performance and expensive GPU resource costs
  • Example ML inference scaling:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: inference-server-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: inference-server
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: gpu_utilization
          target:
            type: AverageValue
            averageValue: 75
      - type: Pods
        pods:
          metric:
            name: inference_queue_depth
          target:
            type: AverageValue
            averageValue: 5
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
        scaleUp:
          policies:
          - type: Pods
            value: 1
            periodSeconds: 30
    

Batch Processing Workloads

  • Implement job-aware scaling for batch processors
  • Scale based on job queue length and processing latency
  • Use KEDA with job-specific triggers
  • Implement scale-to-zero for efficient resource usage
  • Example batch processing scaling with KEDA:
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: batch-processor-scaler
      namespace: data-processing
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: batch-processor
      minReplicaCount: 0
      maxReplicaCount: 100
      pollingInterval: 10
      cooldownPeriod: 300
      triggers:
      - type: postgresql
        metadata:
          connectionFromEnv: POSTGRES_CONNECTION
          query: "SELECT COUNT(*) FROM jobs WHERE status = 'pending'"
          targetQueryValue: "5"
          activationTargetQueryValue: "1"
      - type: prometheus
        metadata:
          serverAddress: http://prometheus.monitoring:9090
          query: sum(rate(job_processing_duration_seconds_sum[5m])) / sum(rate(job_processing_duration_seconds_count[5m]))
          threshold: '30'
    

Game Server Scaling

  • Implement player-count based scaling for game servers
  • Scale based on connection metrics and server load
  • Use predictive scaling for known peak times
  • Implement gradual scaling to prevent player disruption
  • Example game server scaling:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: game-server-hpa
      namespace: gaming
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: game-server
      minReplicas: 5
      maxReplicas: 100
      metrics:
      - type: External
        external:
          metric:
            name: player_count_per_server
          target:
            type: AverageValue
            averageValue: 75
      - type: Pods
        pods:
          metric:
            name: server_tick_latency_ms
          target:
            type: AverageValue
            averageValue: 20
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 600
          policies:
          - type: Percent
            value: 5
            periodSeconds: 60
        scaleUp:
          policies:
          - type: Percent
            value: 20
            periodSeconds: 60
          - type: Pods
            value: 10
            periodSeconds: 60
          selectPolicy: Max
    

Seasonal and Time-Based Scaling

  • Implement predictive scaling for known traffic patterns
  • Combine KEDA cron triggers with metrics-based HPA
  • Pre-scale for anticipated traffic spikes
  • Scale down during known quiet periods
  • Example combination of predictive and reactive scaling:
    # KEDA ScaledObject for time-based predictive scaling
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: predictive-scaler
      namespace: retail
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: retail-api
      minReplicaCount: 3
      maxReplicaCount: 50
      triggers:
      - type: cron
        metadata:
          timezone: "America/New_York"
          start: 0 8 * * *      # 8:00 AM
          end: 0 9 * * *        # 9:00 AM
          desiredReplicas: "20" # Pre-scale for morning rush
      - type: cron
        metadata:
          timezone: "America/New_York"
          start: 0 12 * * *     # 12:00 PM
          end: 0 13 * * *       # 1:00 PM
          desiredReplicas: "25" # Pre-scale for lunch time
      - type: cron
        metadata:
          timezone: "America/New_York"
          start: 0 17 * * *     # 5:00 PM
          end: 0 18 * * *       # 6:00 PM
          desiredReplicas: "15" # Pre-scale for evening
    ---
    # Standard HPA for reactive scaling based on actual load
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: reactive-scaler
      namespace: retail
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: retail-api
      minReplicaCount: 3
      maxReplicaCount: 50
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 65
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: 800
    

Best Practices and Pitfalls

Implementing effective HPAs requires attention to several best practices and awareness of common pitfalls:

Integration with Vertical Pod Autoscaler

Combining Horizontal Pod Autoscaler with Vertical Pod Autoscaler (VPA) creates a comprehensive scaling strategy:

  1. Complementary Autoscaling
    • HPA handles scaling out for increased load
    • VPA optimizes resource requests for individual pods
    • Achieve both horizontal and vertical efficiency
    • Properly configure to prevent conflicts
    • Example VPA configuration alongside HPA:
      apiVersion: autoscaling.k8s.io/v1
      kind: VerticalPodAutoscaler
      metadata:
        name: api-server-vpa
      spec:
        targetRef:
          apiVersion: "apps/v1"
          kind: Deployment
          name: api-server
        updatePolicy:
          updateMode: "Auto"
        resourcePolicy:
          containerPolicies:
          - containerName: '*'
            minAllowed:
              cpu: 50m
              memory: 100Mi
            maxAllowed:
              cpu: 1000m
              memory: 1Gi
            controlledResources: ["cpu", "memory"]
      
  2. Avoiding Conflicts
    • Configure non-overlapping responsibilities
    • Use VPA in "Off" or "Initial" mode with active HPA
    • Let HPA handle replica count, VPA handle resource requests
    • Monitor for oscillation between the two systems
    • Example configuration for coexistence:
      apiVersion: autoscaling.k8s.io/v1
      kind: VerticalPodAutoscaler
      metadata:
        name: api-server-vpa
      spec:
        targetRef:
          apiVersion: "apps/v1"
          kind: Deployment
          name: api-server
        updatePolicy:
          updateMode: "Initial"  # Only set resources when pods are created
        resourcePolicy:
          containerPolicies:
          - containerName: '*'
            minAllowed:
              cpu: 50m
              memory: 100Mi
            maxAllowed:
              cpu: 1000m
              memory: 1Gi
      

The Kubernetes autoscaling ecosystem continues to evolve with several emerging trends:

  1. Machine Learning-Based Autoscaling
    • Predictive scaling based on historical patterns
    • Anomaly detection for unusual traffic patterns
    • Reinforcement learning for optimal scaling decisions
    • Custom controllers using ML frameworks
    • Example pattern:
      Metrics Collection → ML Model Training → Prediction Generation → Proactive Scaling
      
  2. Cost-Aware Autoscaling
    • Balance performance requirements with cost constraints
    • Implement budget-based scaling decisions
    • Optimize for spot/preemptible instance usage
    • Integration with FinOps tooling
    • Example cost-aware scaling approach:
      1. Define performance SLOs and cost constraints
      2. Implement custom metrics for cost-per-request
      3. Create scaling policies with cost thresholds
      4. Optimize instance type selection based on workload
      5. Scale down aggressively during low-traffic periods
      
  3. Federated Autoscaling
    • Coordinate scaling across multiple clusters
    • Implement global load balancing with local scaling
    • Optimize workload placement across regions
    • Support for hybrid and multi-cloud environments
    • Example federated architecture:
      Global Traffic Director → Regional Load Balancers → Cluster-Level HPAs → Pod Scaling
      

Advanced Horizontal Pod Autoscaler configurations enable sophisticated, application-specific scaling behaviors that optimize both performance and resource utilization. By leveraging custom metrics, external metrics, and advanced scaling behaviors, organizations can create autoscaling strategies tailored to their specific workload characteristics and business requirements.

As Kubernetes continues to evolve, the autoscaling ecosystem will likely become even more sophisticated, with increased integration between different scaling components, better predictive capabilities, and more fine-grained control over scaling decisions.