Welcome to from-docker-to-kubernetes

Kubernetes for AI/ML Workloads

Deploying, managing, and scaling AI and machine learning workloads on Kubernetes

Introduction to AI/ML on Kubernetes

Kubernetes has emerged as a powerful platform for deploying, managing, and scaling artificial intelligence (AI) and machine learning (ML) workloads. The inherent orchestration capabilities of Kubernetes, combined with its extensibility and ecosystem, make it well-suited to address the unique challenges of AI/ML workflows:

  • Resource management: Efficiently allocating GPU, TPU, and specialized hardware resources
  • Workload orchestration: Handling complex training jobs, batch inference, and serving pipelines
  • Scalability: Dynamically scaling resources based on training and inference demands
  • Reproducibility: Ensuring consistent environments for model training and deployment
  • Workflow automation: Streamlining ML pipelines from data ingestion to model serving

This comprehensive guide explores best practices, tools, and patterns for effectively running AI/ML workloads on Kubernetes, helping organizations build robust, scalable, and efficient machine learning infrastructure.

Kubernetes Architecture for AI/ML

Core Components for AI/ML Workloads

A Kubernetes cluster optimized for AI/ML typically includes several specialized components:

  1. Node pools with hardware accelerators: GPU/TPU-enabled worker nodes
  2. Custom schedulers: For optimized placement of ML workloads
  3. Storage systems: Optimized for ML datasets and model artifacts
  4. Resource quotas: For fair sharing of compute resources
  5. Custom operators: For ML-specific resources and workflows

GPU/TPU Resource Management

Kubernetes provides mechanisms to request and allocate hardware accelerators:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 NVIDIA GPUs

For Google Cloud TPUs:

apiVersion: v1
kind: Pod
metadata:
  name: tpu-pod
spec:
  containers:
  - name: ml-container
    image: tensorflow/tensorflow:latest-tpu
    resources:
      limits:
        cloud-tpus.google.com/v3: 8  # Request 8 TPU v3 cores

Machine Learning Frameworks on Kubernetes

Framework-specific Considerations

Distributed Training

Kubernetes facilitates distributed training through its networking and orchestration capabilities:

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4  # Launch 4 pods in parallel
  completions: 4  # Require 4 successful completions
  template:
    spec:
      containers:
      - name: trainer
        image: custom-ml-framework:latest
        env:
        - name: WORLD_SIZE
          value: "4"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure

MLOps Tools and Frameworks

Kubeflow

Kubeflow is a comprehensive ML platform for Kubernetes with components for:

  1. Kubeflow Pipelines: Workflow orchestration for ML pipelines
  2. Katib: Hyperparameter tuning and neural architecture search
  3. KFServing/KServe: Model serving with autoscaling
  4. Training Operators: Distributed training for various frameworks
  5. Notebooks: Interactive development environments

Kubeflow Pipeline example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-pipeline-
spec:
  entrypoint: ml-pipeline
  templates:
  - name: ml-pipeline
    dag:
      tasks:
      - name: data-preprocessing
        template: data-preprocessing
      - name: model-training
        template: model-training
        dependencies: [data-preprocessing]
      - name: model-evaluation
        template: model-evaluation
        dependencies: [model-training]
      - name: model-deployment
        template: model-deployment
        dependencies: [model-evaluation]
        
  - name: data-preprocessing
    container:
      image: data-preprocessing:latest
      command: ["python", "/app/preprocess.py"]
      volumeMounts:
      - name: data
        mountPath: /data
        
  - name: model-training
    container:
      image: model-training:latest
      command: ["python", "/app/train.py"]
      resources:
        limits:
          nvidia.com/gpu: 2
          
  - name: model-evaluation
    container:
      image: model-evaluation:latest
      command: ["python", "/app/evaluate.py"]
      
  - name: model-deployment
    container:
      image: model-deployment:latest
      command: ["python", "/app/deploy.py"]

MLflow on Kubernetes

MLflow provides experiment tracking, model registry, and deployment capabilities:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: mlflow/mlflow:latest
        ports:
        - containerPort: 5000
        env:
        - name: MLFLOW_S3_ENDPOINT_URL
          value: "http://minio-service:9000"
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: accessKey
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: secretKey
        - name: MLFLOW_TRACKING_URI
          value: "postgresql://mlflow:mlflow@postgres-service:5432/mlflow"
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --backend-store-uri=postgresql://mlflow:mlflow@postgres-service:5432/mlflow
        - --default-artifact-root=s3://mlflow/artifacts

Model Serving on Kubernetes

Model Serving Architectures

Kubernetes supports various model serving architectures:

Single-model Serving

  • Simple deployment with one model per service
  • Direct scaling based on single model demand
  • Simpler monitoring and management

Multi-model Serving

  • Multiple models served from single deployment
  • Efficient resource utilization
  • Typically uses specialized serving frameworks

Batch Inference

  • Process large volumes of data asynchronously
  • Optimized for throughput over latency
  • Often implemented using Kubernetes Jobs or CronJobs

TensorFlow Serving Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
      - name: tensorflow-serving
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: "resnet"
        volumeMounts:
        - name: model-volume
          mountPath: /models/resnet
        resources:
          limits:
            memory: "2Gi"
            cpu: "1"
            nvidia.com/gpu: 1
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-storage
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-serving
spec:
  selector:
    app: tensorflow-serving
  ports:
  - port: 8501
    targetPort: 8501
  type: ClusterIP

KServe (formerly KFServing)

KServe provides Kubernetes custom resources for serverless inference:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-sentiment"
spec:
  predictor:
    tensorflow:
      storageUri: "s3://models/bert-sentiment/model"
      resources:
        limits:
          memory: "2Gi"
          cpu: "1"
          nvidia.com/gpu: 1
      containerConcurrency: 10
  transformer:
    containers:
    - image: "gcr.io/project/bert-preprocessor:v1"
      name: bert-preprocessor

Horizontal Pod Autoscaling for ML Models

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_requests_per_second
      target:
        type: AverageValue
        averageValue: 100

Data Management for ML on Kubernetes

Dataset Management

Efficient data management is critical for ML workflows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-dataset-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500Gi
  storageClassName: fast-nfs

For read-only datasets used across multiple pods:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: read-only-dataset
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: high-performance-ro

Data Preprocessing at Scale

Kubernetes Jobs can efficiently handle large-scale data preprocessing:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-preprocessing
spec:
  parallelism: 20  # Process 20 chunks in parallel
  completions: 100  # 100 total chunks to process
  template:
    spec:
      containers:
      - name: data-processor
        image: data-processor:latest
        env:
        - name: CHUNK_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        volumeMounts:
        - name: raw-data
          mountPath: /raw
        - name: processed-data
          mountPath: /processed
      volumes:
      - name: raw-data
        persistentVolumeClaim:
          claimName: raw-dataset-pvc
      - name: processed-data
        persistentVolumeClaim:
          claimName: processed-dataset-pvc
      restartPolicy: OnFailure

Resource Optimization for ML Workloads

GPU Sharing and Partitioning

NVIDIA MPS and time-slicing enable efficient GPU utilization:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-shared-pod
  annotations:
    nvidia.com/mps-capable: "true"
spec:
  containers:
  - name: ml-container-1
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 0.5  # Request 50% of a GPU
  - name: ml-container-2
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 0.5  # Request 50% of a GPU

Node Affinity for Specialized Hardware

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - Tesla-V100
            - A100
  containers:
  - name: ml-trainer
    image: ml-training:latest
    resources:
      limits:
        nvidia.com/gpu: 4

Resource Quotas for ML Teams

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-team-quota
  namespace: ml-team-a
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 500Gi
    requests.nvidia.com/gpu: 20
    limits.nvidia.com/gpu: 20

Security Considerations for ML on Kubernetes

Model and Data Security

Protecting sensitive models and training data:

Encrypted Storage

  • Use encrypted PVs for model and data storage
  • Implement at-rest encryption for sensitive datasets
  • Apply role-based access control for storage resources

Secure Model Serving

  • Implement TLS for model serving endpoints
  • Use network policies to restrict access to inference services
  • Apply pod security policies for model serving containers

Access Controls

  • Implement fine-grained RBAC for ML resources
  • Use service accounts with minimal permissions
  • Separate development, training, and production environments

Network Policies for ML Services

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-model-server-policy
spec:
  podSelector:
    matchLabels:
      app: model-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          purpose: application
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 8501

Monitoring and Observability

ML-specific Metrics

Key metrics to monitor for ML workloads:

  1. Training metrics: Loss, accuracy, learning rate
  2. Inference metrics: Latency, throughput, error rates
  3. Resource utilization: GPU memory, GPU utilization, RAM
  4. Model drift: Prediction distribution, feature distribution
  5. Batch processing: Job completion rates, processing time

Prometheus and Grafana Integration

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: model-serving-monitor
spec:
  selector:
    matchLabels:
      app: model-server
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-serving-alerts
spec:
  groups:
  - name: ml.serving.alerts
    rules:
    - alert: HighModelLatency
      expr: histogram_quantile(0.95, sum(rate(model_inference_latency_seconds_bucket[5m])) by (le, model)) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High inference latency for model"
        description: "95th percentile latency for model {{ $labels.model }} is above 100ms threshold"

Case Studies: Real-world ML on Kubernetes

Large-scale Training Infrastructure

Architecture for large-scale distributed training:

apiVersion: v1
kind: ConfigMap
metadata:
  name: training-config
data:
  config.yaml: |
    training:
      learning_rate: 0.001
      batch_size: 256
      epochs: 100
      distributed:
        strategy: "data_parallel"
        nodes: 16
        gpus_per_node: 8
---
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
spec:
  parallelism: 16  # 16 nodes
  completions: 16
  backoffLimit: 4
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: job-name
                  operator: In
                  values:
                  - distributed-training-job
              topologyKey: kubernetes.io/hostname
      containers:
      - name: distributed-trainer
        image: ml-training:latest
        resources:
          limits:
            nvidia.com/gpu: 8
            memory: "128Gi"
            cpu: "32"
        volumeMounts:
        - name: training-config
          mountPath: /etc/config
        - name: dataset
          mountPath: /data
        - name: model-output
          mountPath: /output
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: training-config
        configMap:
          name: training-config
      - name: dataset
        persistentVolumeClaim:
          claimName: training-dataset
      - name: model-output
        persistentVolumeClaim:
          claimName: model-checkpoint-store
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi
      restartPolicy: OnFailure

Online Serving with Canary Deployments

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ml-inference
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/maxScale: "20"
    spec:
      containers:
      - image: gcr.io/project/model-server:v2
        resources:
          limits:
            nvidia.com/gpu: 1
  traffic:
  - tag: current
    revisionName: ml-inference-v1
    percent: 90
  - tag: candidate
    revisionName: ml-inference-v2
    percent: 10

Emerging Technologies and Approaches

Hardware Acceleration Beyond GPUs

  • Direct integration with FPGAs, ASICs, and custom ML accelerators
  • Enhanced scheduling for heterogeneous compute resources
  • Specialized operators for new accelerator types

Federated Learning

  • Coordinating model training across multiple Kubernetes clusters
  • Privacy-preserving distributed training architectures
  • Multi-cluster model aggregation and synchronization

AI Workflow Automation

  • End-to-end MLOps pipelines with enhanced automation
  • GitOps for ML models and infrastructure
  • Continuous training and evaluation with automated deployment

Conclusion

Kubernetes has become the de facto platform for running AI and ML workloads at scale, providing the orchestration, resource management, and ecosystem integration needed for modern machine learning operations. By leveraging Kubernetes' capabilities for hardware acceleration, distributed training, and scalable serving, organizations can build flexible, efficient, and production-ready ML infrastructure.

As the field continues to evolve, Kubernetes is adapting to meet the specialized requirements of AI/ML workloads, with custom resource definitions, specialized operators, and enhanced integration with ML frameworks and tools. Whether you're building a research environment, developing production ML pipelines, or deploying large-scale inference services, Kubernetes provides a robust foundation for your AI/ML infrastructure.