Kubernetes has emerged as a powerful platform for deploying, managing, and scaling artificial intelligence (AI) and machine learning (ML) workloads. The inherent orchestration capabilities of Kubernetes, combined with its extensibility and ecosystem, make it well-suited to address the unique challenges of AI/ML workflows:
Resource management : Efficiently allocating GPU, TPU, and specialized hardware resourcesWorkload orchestration : Handling complex training jobs, batch inference, and serving pipelinesScalability : Dynamically scaling resources based on training and inference demandsReproducibility : Ensuring consistent environments for model training and deploymentWorkflow automation : Streamlining ML pipelines from data ingestion to model servingThis comprehensive guide explores best practices, tools, and patterns for effectively running AI/ML workloads on Kubernetes, helping organizations build robust, scalable, and efficient machine learning infrastructure.
A Kubernetes cluster optimized for AI/ML typically includes several specialized components:
Node pools with hardware accelerators : GPU/TPU-enabled worker nodesCustom schedulers : For optimized placement of ML workloadsStorage systems : Optimized for ML datasets and model artifactsResource quotas : For fair sharing of compute resourcesCustom operators : For ML-specific resources and workflowsKubernetes provides mechanisms to request and allocate hardware accelerators:
apiVersion : v1
kind : Pod
metadata :
name : gpu-pod
spec :
containers :
- name : ml-container
image : tensorflow/tensorflow:latest-gpu
resources :
limits :
nvidia.com/gpu : 2 # Request 2 NVIDIA GPUs
For Google Cloud TPUs:
apiVersion : v1
kind : Pod
metadata :
name : tpu-pod
spec :
containers :
- name : ml-container
image : tensorflow/tensorflow:latest-tpu
resources :
limits :
cloud-tpus.google.com/v3 : 8 # Request 8 TPU v3 cores
apiVersion : v1
kind : Pod
metadata :
name : tensorflow-training
spec :
containers :
- name : tensorflow
image : tensorflow/tensorflow:2.9.0-gpu
resources :
limits :
nvidia.com/gpu : 4
volumeMounts :
- name : dataset
mountPath : /data
- name : model-output
mountPath : /output
command : [ "python" , "/app/train.py" , "--epochs=100" , "--batch-size=64" ]
volumes :
- name : dataset
persistentVolumeClaim :
claimName : training-data-pvc
- name : model-output
persistentVolumeClaim :
claimName : model-output-pvc
apiVersion : v1
kind : Pod
metadata :
name : pytorch-training
spec :
containers :
- name : pytorch
image : pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime
resources :
limits :
nvidia.com/gpu : 4
env :
- name : NCCL_DEBUG
value : "INFO"
- name : NCCL_SOCKET_IFNAME
value : "eth0"
volumeMounts :
- name : dataset
mountPath : /data
- name : model-output
mountPath : /output
command : [ "python" , "/app/train.py" ]
volumes :
- name : dataset
persistentVolumeClaim :
claimName : training-data-pvc
- name : model-output
persistentVolumeClaim :
claimName : model-output-pvc
apiVersion : v1
kind : Pod
metadata :
name : jax-training
spec :
containers :
- name : jax
image : gcr.io/jax-project/jax:latest
resources :
limits :
nvidia.com/gpu : 4
volumeMounts :
- name : dataset
mountPath : /data
- name : model-output
mountPath : /output
command : [ "python" , "/app/train_flax.py" ]
volumes :
- name : dataset
persistentVolumeClaim :
claimName : training-data-pvc
- name : model-output
persistentVolumeClaim :
claimName : model-output-pvc
Kubernetes facilitates distributed training through its networking and orchestration capabilities:
apiVersion : batch/v1
kind : Job
metadata :
name : distributed-training
spec :
parallelism : 4 # Launch 4 pods in parallel
completions : 4 # Require 4 successful completions
template :
spec :
containers :
- name : trainer
image : custom-ml-framework:latest
env :
- name : WORLD_SIZE
value : "4"
- name : RANK
valueFrom :
fieldRef :
fieldPath : metadata.annotations['batch.kubernetes.io/job-completion-index']
resources :
limits :
nvidia.com/gpu : 1
restartPolicy : OnFailure
Kubeflow is a comprehensive ML platform for Kubernetes with components for:
Kubeflow Pipelines : Workflow orchestration for ML pipelinesKatib : Hyperparameter tuning and neural architecture searchKFServing/KServe : Model serving with autoscalingTraining Operators : Distributed training for various frameworksNotebooks : Interactive development environmentsKubeflow Pipeline example:
apiVersion : argoproj.io/v1alpha1
kind : Workflow
metadata :
generateName : ml-pipeline-
spec :
entrypoint : ml-pipeline
templates :
- name : ml-pipeline
dag :
tasks :
- name : data-preprocessing
template : data-preprocessing
- name : model-training
template : model-training
dependencies : [ data-preprocessing ]
- name : model-evaluation
template : model-evaluation
dependencies : [ model-training ]
- name : model-deployment
template : model-deployment
dependencies : [ model-evaluation ]
- name : data-preprocessing
container :
image : data-preprocessing:latest
command : [ "python" , "/app/preprocess.py" ]
volumeMounts :
- name : data
mountPath : /data
- name : model-training
container :
image : model-training:latest
command : [ "python" , "/app/train.py" ]
resources :
limits :
nvidia.com/gpu : 2
- name : model-evaluation
container :
image : model-evaluation:latest
command : [ "python" , "/app/evaluate.py" ]
- name : model-deployment
container :
image : model-deployment:latest
command : [ "python" , "/app/deploy.py" ]
MLflow provides experiment tracking, model registry, and deployment capabilities:
apiVersion : apps/v1
kind : Deployment
metadata :
name : mlflow-tracking
spec :
replicas : 1
selector :
matchLabels :
app : mlflow
template :
metadata :
labels :
app : mlflow
spec :
containers :
- name : mlflow
image : mlflow/mlflow:latest
ports :
- containerPort : 5000
env :
- name : MLFLOW_S3_ENDPOINT_URL
value : "http://minio-service:9000"
- name : AWS_ACCESS_KEY_ID
valueFrom :
secretKeyRef :
name : mlflow-s3-credentials
key : accessKey
- name : AWS_SECRET_ACCESS_KEY
valueFrom :
secretKeyRef :
name : mlflow-s3-credentials
key : secretKey
- name : MLFLOW_TRACKING_URI
value : "postgresql://mlflow:mlflow@postgres-service:5432/mlflow"
command :
- mlflow
- server
- --host=0.0.0.0
- --backend-store-uri=postgresql://mlflow:mlflow@postgres-service:5432/mlflow
- --default-artifact-root=s3://mlflow/artifacts
Kubernetes supports various model serving architectures:
Simple deployment with one model per service Direct scaling based on single model demand Simpler monitoring and management Multiple models served from single deployment Efficient resource utilization Typically uses specialized serving frameworks Process large volumes of data asynchronously Optimized for throughput over latency Often implemented using Kubernetes Jobs or CronJobs
apiVersion : apps/v1
kind : Deployment
metadata :
name : tensorflow-serving
spec :
replicas : 3
selector :
matchLabels :
app : tensorflow-serving
template :
metadata :
labels :
app : tensorflow-serving
spec :
containers :
- name : tensorflow-serving
image : tensorflow/serving:latest
ports :
- containerPort : 8501
env :
- name : MODEL_NAME
value : "resnet"
volumeMounts :
- name : model-volume
mountPath : /models/resnet
resources :
limits :
memory : "2Gi"
cpu : "1"
nvidia.com/gpu : 1
volumes :
- name : model-volume
persistentVolumeClaim :
claimName : model-storage
---
apiVersion : v1
kind : Service
metadata :
name : tensorflow-serving
spec :
selector :
app : tensorflow-serving
ports :
- port : 8501
targetPort : 8501
type : ClusterIP
KServe provides Kubernetes custom resources for serverless inference:
apiVersion : "serving.kserve.io/v1beta1"
kind : "InferenceService"
metadata :
name : "bert-sentiment"
spec :
predictor :
tensorflow :
storageUri : "s3://models/bert-sentiment/model"
resources :
limits :
memory : "2Gi"
cpu : "1"
nvidia.com/gpu : 1
containerConcurrency : 10
transformer :
containers :
- image : "gcr.io/project/bert-preprocessor:v1"
name : bert-preprocessor
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : model-server-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : model-server
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
- type : Pods
pods :
metric :
name : inference_requests_per_second
target :
type : AverageValue
averageValue : 100
Efficient data management is critical for ML workflows:
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : ml-dataset-pvc
spec :
accessModes :
- ReadWriteMany
resources :
requests :
storage : 500Gi
storageClassName : fast-nfs
For read-only datasets used across multiple pods:
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : read-only-dataset
spec :
accessModes :
- ReadOnlyMany
resources :
requests :
storage : 1Ti
storageClassName : high-performance-ro
Kubernetes Jobs can efficiently handle large-scale data preprocessing:
apiVersion : batch/v1
kind : Job
metadata :
name : data-preprocessing
spec :
parallelism : 20 # Process 20 chunks in parallel
completions : 100 # 100 total chunks to process
template :
spec :
containers :
- name : data-processor
image : data-processor:latest
env :
- name : CHUNK_INDEX
valueFrom :
fieldRef :
fieldPath : metadata.annotations['batch.kubernetes.io/job-completion-index']
volumeMounts :
- name : raw-data
mountPath : /raw
- name : processed-data
mountPath : /processed
volumes :
- name : raw-data
persistentVolumeClaim :
claimName : raw-dataset-pvc
- name : processed-data
persistentVolumeClaim :
claimName : processed-dataset-pvc
restartPolicy : OnFailure
NVIDIA MPS and time-slicing enable efficient GPU utilization:
apiVersion : v1
kind : Pod
metadata :
name : gpu-shared-pod
annotations :
nvidia.com/mps-capable : "true"
spec :
containers :
- name : ml-container-1
image : tensorflow/tensorflow:latest-gpu
resources :
limits :
nvidia.com/gpu : 0.5 # Request 50% of a GPU
- name : ml-container-2
image : pytorch/pytorch:latest
resources :
limits :
nvidia.com/gpu : 0.5 # Request 50% of a GPU
apiVersion : v1
kind : Pod
metadata :
name : gpu-training-pod
spec :
affinity :
nodeAffinity :
requiredDuringSchedulingIgnoredDuringExecution :
nodeSelectorTerms :
- matchExpressions :
- key : nvidia.com/gpu.product
operator : In
values :
- Tesla-V100
- A100
containers :
- name : ml-trainer
image : ml-training:latest
resources :
limits :
nvidia.com/gpu : 4
apiVersion : v1
kind : ResourceQuota
metadata :
name : ml-team-quota
namespace : ml-team-a
spec :
hard :
requests.cpu : "100"
requests.memory : 200Gi
limits.cpu : "200"
limits.memory : 500Gi
requests.nvidia.com/gpu : 20
limits.nvidia.com/gpu : 20
Protecting sensitive models and training data:
Use encrypted PVs for model and data storage Implement at-rest encryption for sensitive datasets Apply role-based access control for storage resources Implement TLS for model serving endpoints Use network policies to restrict access to inference services Apply pod security policies for model serving containers Implement fine-grained RBAC for ML resources Use service accounts with minimal permissions Separate development, training, and production environments
apiVersion : networking.k8s.io/v1
kind : NetworkPolicy
metadata :
name : ml-model-server-policy
spec :
podSelector :
matchLabels :
app : model-server
policyTypes :
- Ingress
ingress :
- from :
- namespaceSelector :
matchLabels :
purpose : application
- podSelector :
matchLabels :
role : frontend
ports :
- protocol : TCP
port : 8501
Key metrics to monitor for ML workloads:
Training metrics : Loss, accuracy, learning rateInference metrics : Latency, throughput, error ratesResource utilization : GPU memory, GPU utilization, RAMModel drift : Prediction distribution, feature distributionBatch processing : Job completion rates, processing time
apiVersion : monitoring.coreos.com/v1
kind : ServiceMonitor
metadata :
name : model-serving-monitor
spec :
selector :
matchLabels :
app : model-server
endpoints :
- port : metrics
interval : 15s
path : /metrics
---
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : ml-serving-alerts
spec :
groups :
- name : ml.serving.alerts
rules :
- alert : HighModelLatency
expr : histogram_quantile(0.95, sum(rate(model_inference_latency_seconds_bucket[5m])) by (le, model)) > 0.1
for : 5m
labels :
severity : warning
annotations :
summary : "High inference latency for model"
description : "95th percentile latency for model {{ $labels.model }} is above 100ms threshold"
Architecture for large-scale distributed training:
apiVersion : v1
kind : ConfigMap
metadata :
name : training-config
data :
config.yaml : |
training:
learning_rate: 0.001
batch_size: 256
epochs: 100
distributed:
strategy: "data_parallel"
nodes: 16
gpus_per_node: 8
---
apiVersion : batch/v1
kind : Job
metadata :
name : distributed-training-job
spec :
parallelism : 16 # 16 nodes
completions : 16
backoffLimit : 4
template :
spec :
affinity :
podAntiAffinity :
requiredDuringSchedulingIgnoredDuringExecution :
- labelSelector :
matchExpressions :
- key : job-name
operator : In
values :
- distributed-training-job
topologyKey : kubernetes.io/hostname
containers :
- name : distributed-trainer
image : ml-training:latest
resources :
limits :
nvidia.com/gpu : 8
memory : "128Gi"
cpu : "32"
volumeMounts :
- name : training-config
mountPath : /etc/config
- name : dataset
mountPath : /data
- name : model-output
mountPath : /output
- name : shm
mountPath : /dev/shm
volumes :
- name : training-config
configMap :
name : training-config
- name : dataset
persistentVolumeClaim :
claimName : training-dataset
- name : model-output
persistentVolumeClaim :
claimName : model-checkpoint-store
- name : shm
emptyDir :
medium : Memory
sizeLimit : 16Gi
restartPolicy : OnFailure
apiVersion : serving.knative.dev/v1
kind : Service
metadata :
name : ml-inference
spec :
template :
metadata :
annotations :
autoscaling.knative.dev/minScale : "2"
autoscaling.knative.dev/maxScale : "20"
spec :
containers :
- image : gcr.io/project/model-server:v2
resources :
limits :
nvidia.com/gpu : 1
traffic :
- tag : current
revisionName : ml-inference-v1
percent : 90
- tag : candidate
revisionName : ml-inference-v2
percent : 10
Direct integration with FPGAs, ASICs, and custom ML accelerators Enhanced scheduling for heterogeneous compute resources Specialized operators for new accelerator types Coordinating model training across multiple Kubernetes clusters Privacy-preserving distributed training architectures Multi-cluster model aggregation and synchronization End-to-end MLOps pipelines with enhanced automation GitOps for ML models and infrastructure Continuous training and evaluation with automated deployment Kubernetes has become the de facto platform for running AI and ML workloads at scale, providing the orchestration, resource management, and ecosystem integration needed for modern machine learning operations. By leveraging Kubernetes' capabilities for hardware acceleration, distributed training, and scalable serving, organizations can build flexible, efficient, and production-ready ML infrastructure.
As the field continues to evolve, Kubernetes is adapting to meet the specialized requirements of AI/ML workloads, with custom resource definitions, specialized operators, and enhanced integration with ML frameworks and tools. Whether you're building a research environment, developing production ML pipelines, or deploying large-scale inference services, Kubernetes provides a robust foundation for your AI/ML infrastructure.