Kubernetes Horizontal Pod Autoscaler Advanced Configurations

Advanced configuration techniques and patterns for Kubernetes Horizontal Pod Autoscaler

Introduction to Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) is a powerful Kubernetes resource that automatically scales the number of pods in a deployment, replicaset, or statefulset based on observed metrics. While the basic HPA implementation scales based on CPU and memory utilization, advanced configurations enable sophisticated scaling behaviors based on custom and external metrics, complex scaling algorithms, and integration with other Kubernetes components.

At its core, the HPA follows a control loop pattern, periodically adjusting the number of replicas to match the specified metric targets. The controller fetches metrics from a series of APIs: the resource metrics API (for CPU/memory), the custom metrics API (for in-cluster metrics), and the external metrics API (for external service metrics). Based on these metrics, the controller calculates the desired number of replicas and adjusts the scale accordingly.

Advanced HPA configurations are essential for applications with complex scaling requirements that go beyond simple CPU or memory utilization. This is particularly important for applications with:

Workload-specific metrics: Applications where CPU/memory don't directly correlate with user load
External dependencies: Systems that need to scale based on external service metrics
Business-driven scaling: Workloads that scale based on business metrics like queue length or request rates
Complex scaling behaviors: Applications requiring sophisticated scaling algorithms with stabilization windows

Understanding advanced HPA configurations enables architects and operators to implement precise, application-specific autoscaling strategies that optimize both performance and resource utilization.

Scaling on Custom Metrics

Beyond the default CPU and memory metrics, HPAs can scale based on application-specific custom metrics collected within the Kubernetes cluster:

Implementing the Custom Metrics API

Deploy a metrics solution that implements the Kubernetes custom metrics API
Popular options include Prometheus Adapter, Datadog, and Azure Monitor
Metrics must be exposed following the Kubernetes metrics schema
The adapter translates between your metrics system and the Kubernetes API

Example installation of Prometheus Adapter:

# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Install Prometheus Adapter with custom configuration
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --values prometheus-adapter-values.yaml

Creating Application-Specific Metrics

Instrument your application to expose custom metrics
Common libraries: Prometheus client, Micrometer, StatsD
Metrics should be relevant to application performance and load
Examples include request count, queue depth, response time

Sample Go code for exposing a custom metric:

package main

import (
    "net/http"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    activeRequests = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "http_requests_active",
        Help: "The current number of active HTTP requests.",
    })
    
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "The total number of HTTP requests.",
        },
        []string{"method", "endpoint", "status"},
    )
)

func init() {
    prometheus.MustRegister(activeRequests)
    prometheus.MustRegister(requestsTotal)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Configuring Prometheus Adapter

Create a configuration file that defines how to convert Prometheus metrics to Kubernetes metrics
Specify metrics discovery rules and naming conventions
Map Prometheus queries to Kubernetes object resources
Define scaling metrics for specific deployments or applications

Example prometheus-adapter-values.yaml:

rules:
  default: false
  custom:
    - seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "http_requests_per_second"
        as: "http_requests_per_second"
      metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
    
    - seriesQuery: 'rabbitmq_queue_messages{namespace!="",service!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          service:
            resource: service
      name:
        matches: "rabbitmq_queue_messages"
        as: "rabbitmq_queue_messages"
      metricsQuery: 'avg(rabbitmq_queue_messages{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

HPA on Custom Metrics

Configure an HPA to use the custom metrics exposed by the adapter
Specify the target metric type as "Pods" or "Object"
Define an appropriate target value based on application characteristics

Example HPA using request rate metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 500
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

Scaling on External Metrics

External metrics enable scaling based on metrics from systems outside the Kubernetes cluster, such as cloud services or external monitoring systems:

# HPA based on SQS queue length
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: message-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: message-processor
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: sqs_messages_visible
        selector:
          matchLabels:
            queue-name: my-processing-queue
      target:
        type: AverageValue
        averageValue: 30

# HPA based on Prometheus metrics from external API
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 5
  maxReplicas: 100
  metrics:
  - type: External
    external:
      metric:
        name: nginx_connections_active
        selector:
          matchLabels:
            service: gateway
      target:
        type: Value
        value: 5000

Implementation of external metrics requires:

An external metrics adapter that fetches metrics from external systems
Configuration of the adapter to expose metrics in the Kubernetes API format
Proper IAM or authentication configuration for accessing external metrics sources

Common external metric adapters include:

Prometheus Adapter with federated Prometheus instances
CloudWatch Adapter for AWS metrics
Azure Metrics Adapter for Azure Monitor metrics
Stackdriver Adapter for Google Cloud metrics
Custom adapters for proprietary monitoring systems

Advanced Scaling Behaviors

The behavior field in HPA specifications enables fine-grained control over scaling decisions:

Stabilization Windows
- Control how long the HPA waits before scaling down
- Prevent rapid fluctuations in pod count ("flapping")
- Configure different windows for scale up and scale down
- Define longer windows for mission-critical applications
- Example stabilization configuration:
  behavior: scaleDown: stabilizationWindowSeconds: 300 # 5 minutes scaleUp: stabilizationWindowSeconds: 60 # 1 minute

Scaling Policies

Define multiple policies for scaling decisions
Control the rate of scaling with percentage or pod-based rules
Implement progressive scaling for large deployments
Combine multiple policies with selectPolicy field

Example scaling policies:

behavior:
  scaleUp:
    policies:
    - type: Percent
      value: 100
      periodSeconds: 60
    - type: Pods
      value: 10
      periodSeconds: 60
    selectPolicy: Max
    stabilizationWindowSeconds: 0
  scaleDown:
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60
    - type: Pods
      value: 5
      periodSeconds: 60
    selectPolicy: Min
    stabilizationWindowSeconds: 300

Controlled Scale-Down
- Implement gradual scale-down to prevent service disruption
- Set percentage-based policies to limit the rate of termination
- Use longer periods for critical services
- Allow time for connection draining and graceful termination
- Example controlled scale-down:
  behavior: scaleDown: policies: - type: Percent value: 10 # Only scale down by 10% at a time periodSeconds: 120 stabilizationWindowSeconds: 600
Rapid Scale-Up
- Configure aggressive scaling for sudden traffic spikes
- Use shorter periods for faster response to load
- Combine with pod-based limits to prevent over-provisioning
- Useful for event-driven and batch processing workloads
- Example rapid scale-up:
  behavior: scaleUp: policies: - type: Percent value: 200 # Double the pods on each scaling event periodSeconds: 30 - type: Pods value: 15 # But never add more than 15 pods at once periodSeconds: 30 selectPolicy: Max stabilizationWindowSeconds: 0 # React immediately to load

Multi-Metric and Compound Scaling

Combining multiple metrics in a single HPA enables sophisticated scaling decisions based on various aspects of application performance:

Combining CPU and Custom Metrics

Configure HPA with both resource and custom metrics
Scale based on whichever metric requires more pods
Balance resource utilization with application-specific needs
Useful for applications with varying workload characteristics

Example multi-metric HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000

Scale based on metrics from dependent services
Use the "Object" metric type to reference specific Kubernetes objects
Coordinate scaling between frontend and backend services
Prevent bottlenecks in service chains

Example HPA with object metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Object
    object:
      metric:
        name: rabbitmq_queue_messages
      describedObject:
        apiVersion: v1
        kind: Service
        name: rabbitmq
      target:
        type: Value
        value: 100

Weighted Multi-Metric Scaling

Implement custom metrics that combine multiple factors
Create composite metrics in your monitoring system
Use PromQL or similar query languages to create weighted averages
Balance multiple aspects of application performance

Example composite metric in Prometheus Adapter:

rules:
  custom:
    - seriesQuery: '{__name__=~".*"}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "composite_load_factor"
        as: "composite_load_factor"
      metricsQuery: >
        (
          sum(rate(http_request_duration_seconds_sum{<<.LabelMatchers>>}[2m])) /
          sum(rate(http_request_duration_seconds_count{<<.LabelMatchers>>}[2m]))
        ) * 10 +
        (
          sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) / 100
        )

Multiple HPAs for Complex Scaling Scenarios

Create separate HPAs for different scaling concerns
Use non-overlapping min/max replica ranges
Implement different scaling behaviors for different metrics
Create modular scaling configurations

Example deployment with multiple HPAs:

# Base HPA for CPU-based scaling with wider stabilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-cpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 5
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 75
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600
---
# Response time HPA with more aggressive scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-response-time-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: request_duration_seconds
      target:
        type: AverageValue
        averageValue: 0.1
  behavior:
    scaleUp:
      policies:
      - type: Percent
        value: 300
        periodSeconds: 60
      stabilizationWindowSeconds: 0

Integrating with KEDA for Event-Driven Autoscaling

Kubernetes Event-Driven Autoscaling (KEDA) extends HPA capabilities for event-driven and serverless workloads:

KEDA Architecture and Components
- KEDA Operator extends Kubernetes with ScaledObject CRD
- KEDA Metrics Adapter converts event sources to metrics
- Supports 40+ event sources out of the box
- Seamlessly integrates with existing HPA infrastructure
- Example KEDA installation:
  # Add KEDA Helm repository helm repo add kedacore https://kedacore.github.io/charts # Install KEDA in its own namespace helm install keda kedacore/keda --namespace keda --create-namespace

Scaling on Message Queues

Automatically scale based on queue length
Support for RabbitMQ, Kafka, Azure Service Bus, AWS SQS, etc.
Configure queue-specific authentication and connection details
Set appropriate scaling thresholds for message processing

Example RabbitMQ scaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rabbitmq-scaler
  namespace: processing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicaCount: 0   # Scale to zero when no messages
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      protocol: amqp
      queueName: orders
      host: rabbitmq.messaging
      queueLength: '5'  # Target messages per pod
      vhostName: /
    authenticationRef:
      name: rabbitmq-trigger-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: rabbitmq-trigger-auth
  namespace: processing
spec:
  secretTargetRef:
  - parameter: username
    name: rabbitmq-auth
    key: username
  - parameter: password
    name: rabbitmq-auth
    key: password

Cron-Based Scaling

Schedule scaling operations based on time patterns
Prepare for predictable traffic patterns
Scale up before anticipated load and down afterward
Support for timezone-aware scheduling

Example cron-based scaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scheduled-scaler
  namespace: web
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicaCount: 3
  maxReplicaCount: 20
  triggers:
  - type: cron
    metadata:
      timezone: "UTC"
      start: 30 8 * * 1-5  # 8:30 AM UTC weekdays
      end: 30 17 * * 1-5   # 5:30 PM UTC weekdays
      desiredReplicas: "15"
  - type: cron
    metadata:
      timezone: "UTC"
      start: 30 17 * * 1-5  # 5:30 PM UTC weekdays
      end: 30 8 * * 1-5     # 8:30 AM UTC weekdays
      desiredReplicas: "3"

Advanced KEDA Patterns

Combine multiple triggers in a single ScaledObject
Implement scaling cooldown periods
Use scaling jobs for worker-based processing
Create custom scalers for specialized metrics

Example multi-trigger scaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: advanced-scaler
  namespace: processing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: data-processor
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 1
  maxReplicaCount: 50
  advanced:
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 10
            periodSeconds: 60
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka.messaging:9092
      consumerGroup: processing-group
      topic: data-events
      lagThreshold: '10'
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: processing_latency_seconds
      threshold: '0.5'
      query: avg(processing_latency_seconds{namespace="processing"})

Performance Testing and Tuning

Proper testing and tuning ensures HPA configurations perform optimally in production:

Load Testing for Autoscaling

Generate realistic traffic patterns to test scaling
Simulate both gradual and sudden traffic spikes
Measure scaling responsiveness and stability
Test both scaling up and scaling down behavior

Example load testing approach:

# Using k6 for HTTP load testing with ramp-up and plateau
cat <<EOF > load-test.js
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 100 },   // Ramp up to 100 users
    { duration: '10m', target: 100 },  // Stay at 100 users
    { duration: '5m', target: 500 },   // Ramp up to 500 users
    { duration: '20m', target: 500 },  // Stay at 500 users
    { duration: '5m', target: 0 },     // Ramp down to 0 users
  ],
};

export default function() {
  http.get('https://api.example.com/test-endpoint');
  sleep(1);
}
EOF

# Run the load test
k6 run load-test.js

HPA Metrics Analysis

Monitor HPA decision-making in real-time
Track scaling events and their triggers
Analyze metric fluctuations and their impact
Identify potential flapping or unnecessary scaling

Example HPA monitoring commands:

# Watch HPA status and decisions
kubectl get hpa -w

# Describe HPA for detailed view of current metrics and targets
kubectl describe hpa api-server-hpa

# Get HPA events
kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler

# Use kube-prometheus-stack to visualize HPA metrics
# Example Grafana query for HPA metrics
sum(kube_horizontalpodautoscaler_status_current_replicas{namespace="production"}) by (horizontalpodautoscaler)

# Track scaling events
sum(rate(kube_horizontalpodautoscaler_status_desired_replicas{namespace="production"}[5m])) by (horizontalpodautoscaler)

Optimizing Scaling Parameters

Adjust scaling thresholds based on performance data
Fine-tune stabilization windows to balance responsiveness and stability
Optimize scaling policies based on application-specific needs
Implement gradual parameter adjustments with careful monitoring

Example tuning process:

1. Start with conservative settings:
   - CPU target: 70% utilization
   - Stabilization window: 300 seconds
   - Scale up policy: 100% with 60-second period
   - Scale down policy: 10% with 60-second period

2. Conduct load tests and collect data:
   - Monitor resource utilization vs. pod count
   - Measure scaling response time
   - Check for oscillations in replica count

3. Adjust parameters based on observations:
   - If scaling is too slow: Reduce stabilization window, increase scale-up percentage
   - If oscillating: Increase stabilization window, reduce scale percentages
   - If resource utilization spikes: Lower target thresholds

4. Retest and iterate until optimal performance is achieved

Simulating Production Scenarios

Test realistic traffic patterns from production data
Simulate failures and service degradation
Create chaos testing scenarios for scaling behavior
Verify scaling during maintenance operations

Example chaos testing with Litmus:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: autoscaling-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=web-server'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  monitoring: true
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '300' # 5 minutes
            - name: CHAOS_INTERVAL
              value: '30'
            - name: FORCE
              value: 'true'
            - name: PODS_AFFECTED_PERC
              value: '25'

Specialized HPA Use Cases

Advanced HPA configurations can address specialized scaling requirements:

Machine Learning Inference Scaling

Scale based on GPU utilization and inference queue depth
Use custom metrics from ML frameworks
Implement predictive scaling for batch inference jobs
Balance performance and expensive GPU resource costs

Example ML inference scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: 75
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: 5
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      policies:
      - type: Pods
        value: 1
        periodSeconds: 30

Batch Processing Workloads

Implement job-aware scaling for batch processors
Scale based on job queue length and processing latency
Use KEDA with job-specific triggers
Implement scale-to-zero for efficient resource usage

Example batch processing scaling with KEDA:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: batch-processor-scaler
  namespace: data-processing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: batch-processor
  minReplicaCount: 0
  maxReplicaCount: 100
  pollingInterval: 10
  cooldownPeriod: 300
  triggers:
  - type: postgresql
    metadata:
      connectionFromEnv: POSTGRES_CONNECTION
      query: "SELECT COUNT(*) FROM jobs WHERE status = 'pending'"
      targetQueryValue: "5"
      activationTargetQueryValue: "1"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      query: sum(rate(job_processing_duration_seconds_sum[5m])) / sum(rate(job_processing_duration_seconds_count[5m]))
      threshold: '30'

Game Server Scaling

Implement player-count based scaling for game servers
Scale based on connection metrics and server load
Use predictive scaling for known peak times
Implement gradual scaling to prevent player disruption

Example game server scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: game-server-hpa
  namespace: gaming
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: game-server
  minReplicas: 5
  maxReplicas: 100
  metrics:
  - type: External
    external:
      metric:
        name: player_count_per_server
      target:
        type: AverageValue
        averageValue: 75
  - type: Pods
    pods:
      metric:
        name: server_tick_latency_ms
      target:
        type: AverageValue
        averageValue: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 5
        periodSeconds: 60
    scaleUp:
      policies:
      - type: Percent
        value: 20
        periodSeconds: 60
      - type: Pods
        value: 10
        periodSeconds: 60
      selectPolicy: Max

Seasonal and Time-Based Scaling

Implement predictive scaling for known traffic patterns
Combine KEDA cron triggers with metrics-based HPA
Pre-scale for anticipated traffic spikes
Scale down during known quiet periods

Example combination of predictive and reactive scaling:

# KEDA ScaledObject for time-based predictive scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: predictive-scaler
  namespace: retail
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: retail-api
  minReplicaCount: 3
  maxReplicaCount: 50
  triggers:
  - type: cron
    metadata:
      timezone: "America/New_York"
      start: 0 8 * * *      # 8:00 AM
      end: 0 9 * * *        # 9:00 AM
      desiredReplicas: "20" # Pre-scale for morning rush
  - type: cron
    metadata:
      timezone: "America/New_York"
      start: 0 12 * * *     # 12:00 PM
      end: 0 13 * * *       # 1:00 PM
      desiredReplicas: "25" # Pre-scale for lunch time
  - type: cron
    metadata:
      timezone: "America/New_York"
      start: 0 17 * * *     # 5:00 PM
      end: 0 18 * * *       # 6:00 PM
      desiredReplicas: "15" # Pre-scale for evening
---
# Standard HPA for reactive scaling based on actual load
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: reactive-scaler
  namespace: retail
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: retail-api
  minReplicaCount: 3
  maxReplicaCount: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 800

Best Practices and Pitfalls

Implementing effective HPAs requires attention to several best practices and awareness of common pitfalls:

Setting Appropriate Resource Requests
- Define accurate CPU and memory requests
- HPA uses requests as the baseline for utilization percentages
- Under-provisioned requests lead to premature scaling
- Over-provisioned requests cause delayed scaling
- Example of proper resource configuration:
  resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi
Avoiding Metric Conflicts
- Choose complementary metrics that don't conflict
- Understand the relationship between different metrics
- Test multi-metric HPAs thoroughly before production
- Monitor for scaling conflicts between metrics
- Example of potentially conflicting metrics:
  Problem: Scaling on both CPU utilization and request latency When CPU utilization is high, request latency typically increases This can cause double-scaling where both metrics trigger scaling events Solution: Choose one primary metric or use weighted composite metrics
Handling Startup Delays
- Account for application startup time in scaling decisions
- Configure appropriate readiness probes
- Use initialDelaySeconds in readiness probes
- Implement pre-warming for critical applications
- Example readiness probe configuration:
  readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 3 successThreshold: 1 failureThreshold: 3
Testing Before Production
- Validate HPA configurations in staging environments
- Simulate expected traffic patterns and spikes
- Monitor and adjust based on real-world behavior
- Gradually roll out changes to production
- Example testing strategy:
  1. Configure identical HPA in staging and production 2. Run load tests in staging that mirror production patterns 3. Monitor scaling behavior and resource utilization 4. Adjust thresholds and policies based on observations 5. Implement changes in production with careful monitoring
Documentation and Monitoring
- Document scaling decisions and rationale
- Set up alerts for unexpected scaling events
- Monitor scaling patterns over time
- Regularly review and adjust configurations
- Example monitoring dashboard metrics:
  - Current/desired replica count over time - Scaling events frequency and triggers - Metric values vs. target thresholds - Resource utilization per pod - Application performance vs. replica count

Integration with Vertical Pod Autoscaler

Combining Horizontal Pod Autoscaler with Vertical Pod Autoscaler (VPA) creates a comprehensive scaling strategy:

Complementary Autoscaling

HPA handles scaling out for increased load
VPA optimizes resource requests for individual pods
Achieve both horizontal and vertical efficiency
Properly configure to prevent conflicts

Example VPA configuration alongside HPA:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 50m
        memory: 100Mi
      maxAllowed:
        cpu: 1000m
        memory: 1Gi
      controlledResources: ["cpu", "memory"]

Avoiding Conflicts

Configure non-overlapping responsibilities
Use VPA in "Off" or "Initial" mode with active HPA
Let HPA handle replica count, VPA handle resource requests
Monitor for oscillation between the two systems

Example configuration for coexistence:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Initial"  # Only set resources when pods are created
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 50m
        memory: 100Mi
      maxAllowed:
        cpu: 1000m
        memory: 1Gi

Future Trends in Kubernetes Autoscaling

The Kubernetes autoscaling ecosystem continues to evolve with several emerging trends:

Machine Learning-Based Autoscaling
- Predictive scaling based on historical patterns
- Anomaly detection for unusual traffic patterns
- Reinforcement learning for optimal scaling decisions
- Custom controllers using ML frameworks
- Example pattern:
  Metrics Collection → ML Model Training → Prediction Generation → Proactive Scaling
Cost-Aware Autoscaling
- Balance performance requirements with cost constraints
- Implement budget-based scaling decisions
- Optimize for spot/preemptible instance usage
- Integration with FinOps tooling
- Example cost-aware scaling approach:
  1. Define performance SLOs and cost constraints 2. Implement custom metrics for cost-per-request 3. Create scaling policies with cost thresholds 4. Optimize instance type selection based on workload 5. Scale down aggressively during low-traffic periods
Federated Autoscaling
- Coordinate scaling across multiple clusters
- Implement global load balancing with local scaling
- Optimize workload placement across regions
- Support for hybrid and multi-cloud environments
- Example federated architecture:
  Global Traffic Director → Regional Load Balancers → Cluster-Level HPAs → Pod Scaling

Advanced Horizontal Pod Autoscaler configurations enable sophisticated, application-specific scaling behaviors that optimize both performance and resource utilization. By leveraging custom metrics, external metrics, and advanced scaling behaviors, organizations can create autoscaling strategies tailored to their specific workload characteristics and business requirements.

As Kubernetes continues to evolve, the autoscaling ecosystem will likely become even more sophisticated, with increased integration between different scaling components, better predictive capabilities, and more fine-grained control over scaling decisions.

Edit this page

Kubernetes Ephemeral Containers

Understanding and utilizing ephemeral containers for debugging and troubleshooting in Kubernetes

Kubernetes Persistent Storage Best Practices

Comprehensive strategies and best practices for managing persistent storage in Kubernetes environments