Welcome to from-docker-to-kubernetes

Kubernetes AI/ML Platform Integration

Comprehensive guide to deploying, scaling, and managing AI/ML workloads in Kubernetes with distributed training orchestration, model serving patterns, and GPU/TPU optimization

Introduction to Kubernetes AI/ML Platform Integration

Training Orchestration

Coordinate distributed training across multiple nodes and accelerators

Model Serving

Deploy models with high availability, scalability, and version control

Pipeline Automation

Streamline end-to-end ML workflows from data preparation to inference

Resource Optimization

Efficiently manage GPU, TPU, and specialized hardware accelerators

This comprehensive guide explores architecture patterns, implementation strategies, and operational best practices for integrating AI/ML platforms with Kubernetes, enabling organizations to build scalable, production-grade machine learning infrastructure.

AI/ML on Kubernetes Architecture

Core Components

A complete Kubernetes AI/ML platform consists of several essential components working together:

┌────────────────────────────────────────────────────────────────────────────┐
│                                                                            │
│                    Kubernetes AI/ML Platform                               │
│                                                                            │
│  ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐     │
│  │                 │      │                 │      │                 │     │
│  │  Data Pipeline  │      │  Training       │      │  Model Serving  │     │
│  │  Orchestration  │ ─────▶  Orchestration  │ ─────▶  Infrastructure │     │
│  │                 │      │                 │      │                 │     │
│  └─────────────────┘      └─────────────────┘      └─────────────────┘     │
│         │                        │                        │                │
│         │                        │                        │                │
│         ▼                        ▼                        ▼                │
│  ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐     │
│  │                 │      │                 │      │                 │     │
│  │  Storage &      │      │  Accelerator    │      │  Monitoring &   │     │
│  │  Data Management│      │  Management     │      │  Observability  │     │
│  │                 │      │                 │      │                 │     │
│  └─────────────────┘      └─────────────────┘      └─────────────────┘     │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Platform Integration Options

  • Kubeflow: Comprehensive ML toolkit with pipelines, notebook environments, and model serving

  • MLflow on Kubernetes: Experiment tracking, model registry, and deployment management

  • Seldon Core: Advanced model serving with canary deployments and explainability

  • Ray on Kubernetes: Distributed computing framework for scalable ML workloads

  • KServe: Serverless inference serving with multi-model, multi-framework support

Setting Up the Foundation

Prerequisites

Infrastructure Preparation

Configure your Kubernetes infrastructure for ML workloads:

# Install NVIDIA device plugin for GPU support
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

# Verify GPU nodes are properly configured
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'

# Install Node Feature Discovery
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/nfd-master.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/nfd-worker-daemonset.yaml

Storage Configuration

Set up appropriate storage for ML datasets and model artifacts:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ml-data-storage
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  fsType: ext4
  replication-type: none
volumeBindingMode: WaitForFirstConsumer
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: ml-data-storage

Kubeflow Deployment

Core Installation

Deploy Kubeflow as a comprehensive ML platform:

# Install Kubeflow using the manifests approach
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Apply the manifests using kustomize
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Kubeflow Components Configuration

Configure essential Kubeflow components:

# Jupyter Notebook Controller configuration
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: data-science-notebook
  namespace: kubeflow-user-example-com
spec:
  template:
    spec:
      containers:
      - name: notebook
        image: jupyter/tensorflow-notebook:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
          requests:
            cpu: 1
            memory: 4Gi
        volumeMounts:
        - name: data
          mountPath: /home/jovyan/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: training-data-pvc
# Kubeflow Pipelines configuration
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-training-pipeline-
  namespace: kubeflow
spec:
  entrypoint: ml-pipeline
  arguments:
    parameters:
    - name: data-path
      value: /mnt/data
    - name: epochs
      value: "50"
    - name: batch-size
      value: "64"
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: training-data-pvc
  templates:
  - name: ml-pipeline
    steps:
    - - name: data-preprocessing
        template: data-preprocessing
    - - name: model-training
        template: model-training
    - - name: model-evaluation
        template: model-evaluation
    - - name: model-deployment
        template: model-deployment

Distributed Training Orchestration

MPI Operator for Distributed Training

Configure MPI-based distributed training for deep learning:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-benchmark
spec:
  slotsPerWorker: 4
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:0.21.1-tf2.4.1-torch1.8.0-mxnet1.8.0.0-py3.7-cuda11.0
            name: mpi-launcher
            command:
            - mpirun
            - -np
            - "4"
            - --allow-run-as-root
            - python
            - /examples/tensorflow2_keras_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - image: horovod/horovod:0.21.1-tf2.4.1-torch1.8.0-mxnet1.8.0.0-py3.7-cuda11.0
            name: mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 1
                cpu: 4
                memory: 16Gi
              requests:
                nvidia.com/gpu: 1
                cpu: 2
                memory: 8Gi

PyTorch Distributed Training

Implement PyTorch distributed training with the PyTorch operator:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-dist
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
              command:
                - python
                - /opt/pytorch/examples/imagenet/main.py
                - --arch=resnet50
                - --batch-size=128
                - --epochs=90
                - --dist-backend=nccl
                - --multiprocessing-distributed
                - --world-size=3
                - --rank=0
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
              command:
                - python
                - /opt/pytorch/examples/imagenet/main.py
                - --arch=resnet50
                - --batch-size=128
                - --epochs=90
                - --dist-backend=nccl
                - --multiprocessing-distributed
              resources:
                limits:
                  nvidia.com/gpu: 1

TensorFlow Training

Configure TensorFlow distributed training:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tensorflow-training
spec:
  cleanPodPolicy: Running
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.7.0-gpu
            command:
              - python
              - /opt/model/train.py
              - --model_dir=/opt/model/output
              - --train_steps=10000
            env:
              - name: TF_CONFIG
                value: '{"cluster": {"chief": ["tensorflow-training-chief-0.tensorflow-training.ml.svc.cluster.local:2222"], "worker": ["tensorflow-training-worker-0.tensorflow-training.ml.svc.cluster.local:2222", "tensorflow-training-worker-1.tensorflow-training.ml.svc.cluster.local:2222"]}, "task": {"type": "chief", "index": 0}}'
            ports:
            - containerPort: 2222
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:2.7.0-gpu
            command:
              - python
              - /opt/model/train.py
              - --model_dir=/opt/model/output
              - --train_steps=10000
            ports:
            - containerPort: 2222
            resources:
              limits:
                nvidia.com/gpu: 1

Model Serving Infrastructure

KServe/KFServing

Deploy models with KServe for sophisticated inference serving:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: mnist-classifier
spec:
  predictor:
    tensorflow:
      storageUri: "gs://kserve-examples/models/tensorflow/mnist"
      resources:
        limits:
          cpu: "4"
          memory: 8Gi
          nvidia.com/gpu: 1
        requests:
          cpu: "1"
          memory: 2Gi

Advanced KServe configuration with canary deployment:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: image-classifier
  annotations:
    serving.kserve.io/deploymentMode: Serverless
    serving.kserve.io/autoscalerClass: hpa
    serving.kserve.io/metrics: '{"target": 100}'
    serving.kserve.io/targetUtilizationPercentage: '75'
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    containers:
    - name: kserve-container
      image: kserve/image-classifier:latest
      ports:
      - containerPort: 8080
      env:
      - name: MODEL_PATH
        value: /mnt/models
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
          nvidia.com/gpu: 1
        requests:
          cpu: "1"
          memory: 2Gi
      volumeMounts:
      - name: model-volume
        mountPath: /mnt/models
    volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: model-store-pvc
  canaryTrafficPercent: 20
  canary:
    predictor:
      containers:
      - name: kserve-container
        image: kserve/image-classifier:next
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_PATH
          value: /mnt/models
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
            nvidia.com/gpu: 1
          requests:
            cpu: "1"
            memory: 2Gi

Seldon Core

Deploy models with Seldon Core for advanced serving capabilities:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: sentiment-analysis
  namespace: models
spec:
  name: sentiment-analyzer
  predictors:
  - name: main
    replicas: 2
    graph:
      name: sentiment-model
      implementation: SKLEARN_SERVER
      modelUri: "gs://seldon-models/sklearn/sentiment"
      envSecretRefName: gcp-credentials
    engineResources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 1
        memory: 2Gi
    componentSpecs:
    - spec:
        containers:
        - name: sentiment-model
          resources:
            requests:
              cpu: 1
              memory: 2Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 2
              memory: 4Gi
              nvidia.com/gpu: 1

Model Servers with TensorFlow Serving

Deploy TensorFlow models with TensorFlow Serving:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tensorflow-serving
  template:
    metadata:
      labels:
        app: tensorflow-serving
    spec:
      containers:
      - name: tensorflow-serving
        image: tensorflow/serving:2.6.0
        args:
        - "--model_config_file=/models/models.config"
        - "--model_config_file_poll_wait_seconds=60"
        - "--rest_api_port=8501"
        - "--port=8500"
        - "--rest_api_timeout_in_ms=5000"
        ports:
        - containerPort: 8501
          name: rest
        - containerPort: 8500
          name: grpc
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            nvidia.com/gpu: 1
          requests:
            cpu: "1"
            memory: 2Gi
        volumeMounts:
        - name: model-configs
          mountPath: /models
        - name: model-data
          mountPath: /model-data
      volumes:
      - name: model-configs
        configMap:
          name: model-config
      - name: model-data
        persistentVolumeClaim:
          claimName: model-store-pvc
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-config
data:
  models.config: |
    model_config_list: {
      config: {
        name: "image_classifier",
        base_path: "/model-data/image_classifier",
        model_platform: "tensorflow"
      },
      config: {
        name: "text_classifier",
        base_path: "/model-data/text_classifier",
        model_platform: "tensorflow"
      }
    }

MLOps Pipelines and Workflows

Kubeflow Pipelines

Create reproducible ML workflows with Kubeflow Pipelines:

# pipeline.py
import kfp
from kfp import dsl
from kfp import components

# Define pipeline components
preprocess_op = components.load_component_from_file('preprocess_component.yaml')
train_op = components.load_component_from_file('train_component.yaml')
evaluate_op = components.load_component_from_file('evaluate_component.yaml')
deploy_op = components.load_component_from_file('deploy_component.yaml')

# Define the pipeline
@dsl.pipeline(
    name='ML Training Pipeline',
    description='End-to-end ML training and deployment pipeline'
)
def ml_pipeline(
    data_path: str,
    epochs: int = 10,
    batch_size: int = 32,
    learning_rate: float = 0.001,
    model_name: str = 'image-classifier'
):
    # Preprocess data
    preprocess_task = preprocess_op(
        data_path=data_path
    )
    
    # Train model
    train_task = train_op(
        preprocessed_data=preprocess_task.outputs['preprocessed_data'],
        epochs=epochs,
        batch_size=batch_size,
        learning_rate=learning_rate
    )
    train_task.set_gpu_limit(1)
    
    # Evaluate model
    evaluate_task = evaluate_op(
        model=train_task.outputs['model'],
        test_data=preprocess_task.outputs['test_data']
    )
    
    # Deploy model if accuracy threshold is met
    with dsl.Condition(evaluate_task.outputs['accuracy'] > 0.85):
        deploy_task = deploy_op(
            model=train_task.outputs['model'],
            model_name=model_name
        )

# Compile the pipeline
kfp.compiler.Compiler().compile(ml_pipeline, 'ml_pipeline.yaml')

Component definition example:

# train_component.yaml
name: Train Model
description: Trains a machine learning model
inputs:
  - name: preprocessed_data
    description: Preprocessed training data
    type: Dataset
  - name: epochs
    description: Number of training epochs
    type: Integer
    default: 10
  - name: batch_size
    description: Training batch size
    type: Integer
    default: 32
  - name: learning_rate
    description: Learning rate for optimization
    type: Float
    default: 0.001
outputs:
  - name: model
    description: Trained model
    type: Model
implementation:
  container:
    image: gcr.io/ml-pipeline/tensorflow-gpu-training:latest
    command:
      - python
      - /opt/train.py
      - --data-path
      - {inputPath: preprocessed_data}
      - --epochs
      - {inputValue: epochs}
      - --batch-size
      - {inputValue: batch_size}
      - --learning-rate
      - {inputValue: learning_rate}
      - --model-output
      - {outputPath: model}
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: 16Gi
      requests:
        cpu: 4
        memory: 8Gi

Argo Workflows for ML

Use Argo Workflows for complex ML pipelines:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: machine-learning-workflow-
spec:
  entrypoint: ml-pipeline
  volumes:
  - name: workdir
    persistentVolumeClaim:
      claimName: ml-pipeline-data
  arguments:
    parameters:
    - name: data-path
      value: /mnt/data/training
    - name: model-name
      value: vision-transformer
  templates:
  - name: ml-pipeline
    inputs:
      parameters:
      - name: data-path
      - name: model-name
    dag:
      tasks:
      - name: fetch-data
        template: fetch-data
        arguments:
          parameters:
          - name: data-path
            value: "{{inputs.parameters.data-path}}"
      - name: preprocess
        dependencies: [fetch-data]
        template: preprocess
        arguments:
          parameters:
          - name: input-path
            value: "{{tasks.fetch-data.outputs.parameters.output-path}}"
      - name: train-model
        dependencies: [preprocess]
        template: train-model
        arguments:
          parameters:
          - name: preprocessed-data
            value: "{{tasks.preprocess.outputs.parameters.preprocessed-data}}"
          - name: model-name
            value: "{{inputs.parameters.model-name}}"
      - name: evaluate-model
        dependencies: [train-model]
        template: evaluate-model
        arguments:
          parameters:
          - name: model-path
            value: "{{tasks.train-model.outputs.parameters.model-path}}"
          - name: test-data
            value: "{{tasks.preprocess.outputs.parameters.test-data}}"
      - name: deploy-model
        dependencies: [evaluate-model]
        template: deploy-model
        arguments:
          parameters:
          - name: model-path
            value: "{{tasks.train-model.outputs.parameters.model-path}}"
          - name: model-name
            value: "{{inputs.parameters.model-name}}"
          - name: accuracy
            value: "{{tasks.evaluate-model.outputs.parameters.accuracy}}"
        when: "{{tasks.evaluate-model.outputs.parameters.accuracy}} >= 0.85"

  - name: train-model
    inputs:
      parameters:
      - name: preprocessed-data
      - name: model-name
    outputs:
      parameters:
      - name: model-path
        valueFrom:
          path: /tmp/model-path
    container:
      image: tensorflow/tensorflow:2.7.0-gpu
      command: [python, /opt/train.py]
      args:
      - --data-path
      - "{{inputs.parameters.preprocessed-data}}"
      - --model-name
      - "{{inputs.parameters.model-name}}"
      - --output-path
      - "/mnt/output/models/{{inputs.parameters.model-name}}"
      resources:
        limits:
          memory: 16Gi
          nvidia.com/gpu: 1
        requests:
          memory: 8Gi
          cpu: 4
      volumeMounts:
      - name: workdir
        mountPath: /mnt/output

GPU and Hardware Acceleration

GPU Resource Management

Configure effective GPU resource allocation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu-inference
  template:
    metadata:
      labels:
        app: gpu-inference
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-a100
      containers:
      - name: inference-server
        image: inference-server:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
          requests:
            cpu: 2
            memory: 4Gi
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: compute,utility
        - name: GPU_MEMORY_FRACTION
          value: "0.8"

GPU Sharing and Partitioning

Implement GPU sharing for more efficient resource utilization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-sharing-inference
spec:
  replicas: 4
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      nodeSelector:
        nvidia.com/gpu.product: Tesla-A100
      containers:
      - name: inference-server
        image: inference-server:latest
        resources:
          limits:
            nvidia.com/gpu: 0.5  # Using GPU sharing with MIG (Multi-Instance GPU)
            memory: 4Gi
          requests:
            cpu: 1
            memory: 2Gi
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TF_FORCE_GPU_ALLOW_GROWTH
          value: "true"

Multi-GPU Training Configuration

Configure multi-GPU training for larger models:

apiVersion: batch/v1
kind: Job
metadata:
  name: multi-gpu-training
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        nvidia.com/gpu.count: "4+"
      containers:
      - name: training
        image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
        command:
        - python
        - -m
        - torch.distributed.launch
        - --nproc_per_node=4
        - /workspace/train.py
        - --batch-size=128
        - --epochs=100
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: 32Gi
          requests:
            nvidia.com/gpu: 4
            cpu: 16
            memory: 24Gi
        env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: eth0
        volumeMounts:
        - name: dataset
          mountPath: /data
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dataset
        persistentVolumeClaim:
          claimName: training-data
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi

Experiment Tracking and Model Registry

MLflow on Kubernetes

Deploy MLflow for experiment tracking and model registry:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: mlflow/mlflow:latest
        ports:
        - containerPort: 5000
        command:
        - mlflow
        - server
        - --backend-store-uri=postgresql://mlflow:password@postgres-mlflow:5432/mlflow
        - --default-artifact-root=s3://mlflow-artifacts/
        - --host=0.0.0.0
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: mlflow-s3-credentials
              key: secret-key
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 1
            memory: 2Gi
        readinessProbe:
          httpGet:
            path: /
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow
spec:
  selector:
    app: mlflow
  ports:
  - port: 80
    targetPort: 5000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mlflow-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - mlflow.example.com
    secretName: mlflow-tls
  rules:
  - host: mlflow.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: mlflow
            port:
              number: 80

TensorBoard for Experiment Visualization

Deploy TensorBoard for training visualization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorboard
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorboard
  template:
    metadata:
      labels:
        app: tensorboard
    spec:
      containers:
      - name: tensorboard
        image: tensorflow/tensorflow:2.7.0
        command:
        - tensorboard
        - --logdir=/logs
        - --bind_all
        ports:
        - containerPort: 6006
        volumeMounts:
        - name: logs
          mountPath: /logs
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 1Gi
      volumes:
      - name: logs
        persistentVolumeClaim:
          claimName: training-logs-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: tensorboard
spec:
  selector:
    app: tensorboard
  ports:
  - port: 80
    targetPort: 6006
  type: ClusterIP

Production ML Patterns

A/B Testing with Canary Deployments

Implement A/B testing for ML models:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: model-ab-test
spec:
  hosts:
  - "inference.example.com"
  gateways:
  - inference-gateway
  http:
  - match:
    - headers:
        cookie:
          regex: "^(.*?;)?(group=a)(;.*)?$"
    route:
    - destination:
        host: model-a-service
        port:
          number: 80
      weight: 100
  - match:
    - headers:
        cookie:
          regex: "^(.*?;)?(group=b)(;.*)?$"
    route:
    - destination:
        host: model-b-service
        port:
          number: 80
      weight: 100
  - route:
    - destination:
        host: model-a-service
        port:
          number: 80
      weight: 90
    - destination:
        host: model-b-service
        port:
          number: 80
      weight: 10
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: model-a-service
spec:
  host: model-a-service
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: model-b-service
spec:
  host: model-b-service
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN

Model Monitoring and Observability

Deploy monitoring for production ML models:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'model-metrics'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app]
          action: keep
          regex: inference-service
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
data:
  model-performance.json: |-
    {
      "annotations": {
        "list": []
      },
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": 1,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheus"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "ms"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 1,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "title": "Model Inference Latency",
          "type": "timeseries"
        }
      ],
      "refresh": "5s",
      "schemaVersion": 38,
      "style": "dark",
      "tags": ["ml", "inference"],
      "templating": {
        "list": []
      },
      "time": {
        "from": "now-6h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "",
      "title": "ML Model Performance",
      "version": 0,
      "weekStart": ""
    }

Model Versioning and Rollback

Implement robust model versioning and rollback strategies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-versioning-demo
  annotations:
    kubernetes.io/change-cause: "Deploy model version v2.3.1"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-inference
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: model-inference
        model-version: v2.3.1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8080"
    spec:
      containers:
      - name: model-server
        image: model-registry.example.com/inference-server:v2.3.1
        imagePullPolicy: Always
        env:
        - name: MODEL_PATH
          value: "gs://models/text-classification/v2.3.1"
        - name: MODEL_VERSION
          value: "v2.3.1"
        resources:
          limits:
            cpu: 2
            memory: 4Gi
            nvidia.com/gpu: 1
          requests:
            cpu: 1
            memory: 2Gi
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8081
          name: metrics
        readinessProbe:
          httpGet:
            path: /v1/health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /v1/health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 15
        volumeMounts:
        - name: model-config
          mountPath: /etc/model-server
      volumes:
      - name: model-config
        configMap:
          name: model-server-config

Best Practices and Recommendations

1

Right-size resources: Match container requests to workload needs

2

Use GPU node pools: Create dedicated GPU node pools

3

Pod priorities: Prioritize training jobs appropriately

4

QoS classes: Use Guaranteed QoS for inference services

5

Optimize data access: Use high-performance storage

1

Secure artifacts: Protect model files with access controls

2

Implement RBAC: Use fine-grained access control

3

Network policies: Segment ML workloads appropriately

4

Secrets management: Secure API keys and credentials

5

Image scanning: Scan ML container images regularly

1

Horizontal Pod Autoscaling: Scale based on metrics

2

Vertical Pod Autoscaling: Adjust resource requests

3

Node Autoscaling: Scale based on pending pods

4

Spot Instances: Use for non-critical training

5

Distributed training: Scale across multiple nodes

Conclusion

Kubernetes has become the platform of choice for orchestrating AI/ML workloads at scale, providing the foundation for modern machine learning operations. By integrating specialized ML platforms with Kubernetes, organizations can build robust, scalable, and production-grade infrastructure for the entire machine learning lifecycle.

Infrastructure Standardization

Consistent deployment patterns across development and production

Resource Efficiency

Optimal utilization of specialized hardware accelerators

Workflow Automation

End-to-end ML pipelines with reproducible results

Operational Resilience

High availability and fault tolerance for ML services

Scalability

Seamless scaling from experimentation to production deployment

Observability

Comprehensive monitoring and troubleshooting capabilities