Kubernetes for AI/ML Workloads

Deploying, managing, and scaling AI and machine learning workloads on Kubernetes

Kubernetes has emerged as a powerful platform for deploying, managing, and scaling artificial intelligence (AI) and machine learning (ML) workloads. The inherent orchestration capabilities of Kubernetes, combined with its extensibility and ecosystem, make it well-suited to address the unique challenges of AI/ML workflows:

Resource management: Efficiently allocating GPU, TPU, and specialized hardware resources
Workload orchestration: Handling complex training jobs, batch inference, and serving pipelines
Scalability: Dynamically scaling resources based on training and inference demands
Reproducibility: Ensuring consistent environments for model training and deployment
Workflow automation: Streamlining ML pipelines from data ingestion to model serving

This comprehensive guide explores best practices, tools, and patterns for effectively running AI/ML workloads on Kubernetes, helping organizations build robust, scalable, and efficient machine learning infrastructure.

Kubernetes Architecture for AI/ML

Core Components for AI/ML Workloads

A Kubernetes cluster optimized for AI/ML typically includes several specialized components:

Node pools with hardware accelerators: GPU/TPU-enabled worker nodes
Custom schedulers: For optimized placement of ML workloads
Storage systems: Optimized for ML datasets and model artifacts
Resource quotas: For fair sharing of compute resources
Custom operators: For ML-specific resources and workflows

GPU/TPU Resource Management

Kubernetes provides mechanisms to request and allocate hardware accelerators:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 NVIDIA GPUs

For Google Cloud TPUs:

apiVersion: v1
kind: Pod
metadata:
  name: tpu-pod
spec:
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-tpu
    resources:
      limits:
        cloud-tpus.google.com/v3: 8  # Request 8 TPU v3 cores

Machine Learning Frameworks on Kubernetes

Framework-specific Considerations

TensorFlow

Distributed Training

Kubernetes facilitates distributed training through its networking and orchestration capabilities:

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4  # Launch 4 pods in parallel
  completions: 4  # Require 4 successful completions
  template:
    spec:
      containers:
      - name: trainer
        image: custom-ml-framework:latest
        env:
        - name: WORLD_SIZE
          value: "4"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure

MLOps Tools and Frameworks

Kubeflow

Kubeflow is a comprehensive ML platform for Kubernetes with components for:

Kubeflow Pipelines: Workflow orchestration for ML pipelines
Katib: Hyperparameter tuning and neural architecture search
KFServing/KServe: Model serving with autoscaling
Training Operators: Distributed training for various frameworks
Notebooks: Interactive development environments

Kubeflow Pipeline example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-pipeline-
spec:
  entrypoint: ml-pipeline
  templates:
  - name: ml-pipeline
    dag:
      tasks:
      - name: data-preprocessing
        template: data-preprocessing
      - name: model-training
        template: model-training
        dependencies: [data-preprocessing]
      - name: model-evaluation
        template: model-evaluation
        dependencies: [model-training]
      - name: model-deployment
        template: model-deployment
        dependencies: [model-evaluation]
        
  - name: data-preprocessing
    container:
      image: data-preprocessing:latest
      command: ["python", "/app/preprocess.py"]
      volumeMounts:
      - name: data
        mountPath: /data
        
  - name: model-training
    container:
      image: model-training:latest
      command: ["python", "/app/train.py"]
      resources:
        limits:
          nvidia.com/gpu: 2
          
  - name: model-evaluation
    container:
      image: model-evaluation:latest
      command: ["python", "/app/evaluate.py"]
      
  - name: model-deployment
    container:
      image: model-deployment:latest
      command: ["python", "/app/deploy.py"]

MLflow on Kubernetes

MLflow provides experiment tracking, model registry, and deployment capabilities:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: mlflow/mlflow:latest
        ports:
        - containerPort: 5000
        env:
        - name: MLFLOW_S3_ENDPOINT_URL
          value: "http://minio-service:9000"
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: accessKey
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: secretKey
        - name: MLFLOW_TRACKING_URI
          value: "postgresql://mlflow:mlflow@postgres-service:5432/mlflow"
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --backend-store-uri=postgresql://mlflow:mlflow@postgres-service:5432/mlflow
        - --default-artifact-root=s3://mlflow/artifacts

Model Serving on Kubernetes

Model Serving Architectures

Kubernetes supports various model serving architectures:

Single-model Serving

Simple deployment with one model per service
Direct scaling based on single model demand
Simpler monitoring and management

Multi-model Serving

Multiple models served from single deployment
Efficient resource utilization
Typically uses specialized serving frameworks

Batch Inference

Process large volumes of data asynchronously
Optimized for throughput over latency
Often implemented using Kubernetes Jobs or CronJobs

TensorFlow Serving Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
      - name: tensorflow-serving
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: "resnet"
        volumeMounts:
        - name: model-volume
          mountPath: /models/resnet
        resources:
          limits:
            memory: "2Gi"
            cpu: "1"
            nvidia.com/gpu: 1
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-storage
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-serving
spec:
  selector:
    app: tensorflow-serving
  ports:
  - port: 8501
    targetPort: 8501
  type: ClusterIP

KServe (formerly KFServing)

KServe provides Kubernetes custom resources for serverless inference:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-sentiment"
spec:
  predictor:
    tensorflow:
      storageUri: "s3://models/bert-sentiment/model"
      resources:
        limits:
          memory: "2Gi"
          cpu: "1"
          nvidia.com/gpu: 1
      containerConcurrency: 10
  transformer:
    containers:
    - image: "gcr.io/project/bert-preprocessor:v1"
      name: bert-preprocessor

Horizontal Pod Autoscaling for ML Models

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_requests_per_second
      target:
        type: AverageValue
        averageValue: 100

Data Management for ML on Kubernetes

Dataset Management

Efficient data management is critical for ML workflows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-dataset-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500Gi
  storageClassName: fast-nfs

For read-only datasets used across multiple pods:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: read-only-dataset
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: high-performance-ro

Data Preprocessing at Scale

Kubernetes Jobs can efficiently handle large-scale data preprocessing:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-preprocessing
spec:
  parallelism: 20  # Process 20 chunks in parallel
  completions: 100  # 100 total chunks to process
  template:
    spec:
      containers:
      - name: data-processor
        image: data-processor:latest
        env:
        - name: CHUNK_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        volumeMounts:
        - name: raw-data
          mountPath: /raw
        - name: processed-data
          mountPath: /processed
      volumes:
      - name: raw-data
        persistentVolumeClaim:
          claimName: raw-dataset-pvc
      - name: processed-data
        persistentVolumeClaim:
          claimName: processed-dataset-pvc
      restartPolicy: OnFailure

Resource Optimization for ML Workloads

NVIDIA MPS and time-slicing enable efficient GPU utilization:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-shared-pod
  annotations:
    nvidia.com/mps-capable: "true"
spec:
  containers:
  - name: ml-container-1
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 0.5  # Request 50% of a GPU
  - name: ml-container-2
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 0.5  # Request 50% of a GPU

Node Affinity for Specialized Hardware

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - Tesla-V100
            - A100
  containers:
  - name: ml-trainer
    image: ml-training:latest
    resources:
      limits:
        nvidia.com/gpu: 4

Resource Quotas for ML Teams

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-team-quota
  namespace: ml-team-a
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 500Gi
    requests.nvidia.com/gpu: 20
    limits.nvidia.com/gpu: 20

Security Considerations for ML on Kubernetes

Model and Data Security

Protecting sensitive models and training data:

Encrypted Storage

Use encrypted PVs for model and data storage
Implement at-rest encryption for sensitive datasets
Apply role-based access control for storage resources

Secure Model Serving

Implement TLS for model serving endpoints
Use network policies to restrict access to inference services
Apply pod security policies for model serving containers

Access Controls

Implement fine-grained RBAC for ML resources
Use service accounts with minimal permissions
Separate development, training, and production environments

Network Policies for ML Services

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-model-server-policy
spec:
  podSelector:
    matchLabels:
      app: model-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          purpose: application
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 8501

Monitoring and Observability

ML-specific Metrics

Key metrics to monitor for ML workloads:

Training metrics: Loss, accuracy, learning rate
Inference metrics: Latency, throughput, error rates
Resource utilization: GPU memory, GPU utilization, RAM
Model drift: Prediction distribution, feature distribution
Batch processing: Job completion rates, processing time

Prometheus and Grafana Integration

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: model-serving-monitor
spec:
  selector:
    matchLabels:
      app: model-server
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-serving-alerts
spec:
  groups:
  - name: ml.serving.alerts
    rules:
    - alert: HighModelLatency
      expr: histogram_quantile(0.95, sum(rate(model_inference_latency_seconds_bucket[5m])) by (le, model)) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High inference latency for model"
        description: "95th percentile latency for model {{ $labels.model }} is above 100ms threshold"

Case Studies: Real-world ML on Kubernetes

Large-scale Training Infrastructure

Architecture for large-scale distributed training:

apiVersion: v1
kind: ConfigMap
metadata:
  name: training-config
data:
  config.yaml: |
    training:
      learning_rate: 0.001
      batch_size: 256
      epochs: 100
      distributed:
        strategy: "data_parallel"
        nodes: 16
        gpus_per_node: 8
---
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
spec:
  parallelism: 16  # 16 nodes
  completions: 16
  backoffLimit: 4
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: job-name
                  operator: In
                  values:
                  - distributed-training-job
              topologyKey: kubernetes.io/hostname
      containers:
      - name: distributed-trainer
        image: ml-training:latest
        resources:
          limits:
            nvidia.com/gpu: 8
            memory: "128Gi"
            cpu: "32"
        volumeMounts:
        - name: training-config
          mountPath: /etc/config
        - name: dataset
          mountPath: /data
        - name: model-output
          mountPath: /output
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: training-config
        configMap:
          name: training-config
      - name: dataset
        persistentVolumeClaim:
          claimName: training-dataset
      - name: model-output
        persistentVolumeClaim:
          claimName: model-checkpoint-store
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi
      restartPolicy: OnFailure

Online Serving with Canary Deployments

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ml-inference
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/maxScale: "20"
    spec:
      containers:
      - image: gcr.io/project/model-server:v2
        resources:
          limits:
            nvidia.com/gpu: 1
  traffic:
  - tag: current
    revisionName: ml-inference-v1
    percent: 90
  - tag: candidate
    revisionName: ml-inference-v2
    percent: 10

Future Trends in Kubernetes for AI/ML

Emerging Technologies and Approaches

Hardware Acceleration Beyond GPUs

Direct integration with FPGAs, ASICs, and custom ML accelerators
Enhanced scheduling for heterogeneous compute resources
Specialized operators for new accelerator types

Federated Learning

Coordinating model training across multiple Kubernetes clusters
Privacy-preserving distributed training architectures
Multi-cluster model aggregation and synchronization

AI Workflow Automation

End-to-end MLOps pipelines with enhanced automation
GitOps for ML models and infrastructure
Continuous training and evaluation with automated deployment

Conclusion

Kubernetes has become the de facto platform for running AI and ML workloads at scale, providing the orchestration, resource management, and ecosystem integration needed for modern machine learning operations. By leveraging Kubernetes' capabilities for hardware acceleration, distributed training, and scalable serving, organizations can build flexible, efficient, and production-ready ML infrastructure.

As the field continues to evolve, Kubernetes is adapting to meet the specialized requirements of AI/ML workloads, with custom resource definitions, specialized operators, and enhanced integration with ML frameworks and tools. Whether you're building a research environment, developing production ML pipelines, or deploying large-scale inference services, Kubernetes provides a robust foundation for your AI/ML infrastructure.

Edit this page

Kubernetes Multi-tenancy

Understanding and implementing multi-tenant architectures in Kubernetes

Kubernetes Event-Driven Autoscaling (KEDA)

Implementing advanced event-driven autoscaling solutions in Kubernetes to scale workloads based on external metrics and event sources

On this page

Introduction to AI/ML on Kubernetes
Kubernetes Architecture for AI/ML
- Core Components for AI/ML Workloads
- GPU/TPU Resource Management
Machine Learning Frameworks on Kubernetes
- Framework-specific Considerations
- Distributed Training
MLOps Tools and Frameworks
- Kubeflow
- MLflow on Kubernetes
Model Serving on Kubernetes
Data Management for ML on Kubernetes
- Dataset Management
- Data Preprocessing at Scale
Resource Optimization for ML Workloads
Security Considerations for ML on Kubernetes
Monitoring and Observability
- ML-specific Metrics
- Prometheus and Grafana Integration
Case Studies: Real-world ML on Kubernetes
- Large-scale Training Infrastructure
- Online Serving with Canary Deployments
Future Trends in Kubernetes for AI/ML
Conclusion

Star on GitHub Create Issues