Effective monitoring and logging are crucial for maintaining healthy Kubernetes clusters and applications. A comprehensive monitoring and logging strategy provides visibility into cluster health, application performance, and helps troubleshoot issues before they affect end users.
Kubernetes observability consists of three main pillars:
Monitoring : Collecting and analyzing metrics about system performance and behaviorLogging : Capturing and storing event data and messages from applications and system componentsTracing : Following the path of requests through distributed systems to identify bottlenecksBasic cluster metrics : Lightweight, in-memory metrics collectorCPU and memory usage : Resource utilization for pods and nodesPod and node metrics : Key metrics for orchestration decisionsHorizontal Pod Autoscaling : Provides metrics for the HPA controllerSimple API : Accessible via Kubernetes API (metrics.k8s.io
)No historical data : Only provides current state (not for long-term analysis)Time-series database : Purpose-built database for metrics with timestampsMetrics collection : Pull-based architecture with service discoveryQuery language (PromQL) : Powerful language for metrics analysis and aggregationAlert management : Define alert conditions and routingFederation : Scale to large deployments by federating multiple Prometheus serversService discovery : Auto-discovers targets via Kubernetes APIExtensive integrations : Wide ecosystem of exporters for various systemsVisualization platform : Create rich, interactive dashboardsDashboard creation : Drag-and-drop interface with extensive customizationMetrics exploration : Query and explore metrics in real-timeAlert notification : Define alerts based on metrics thresholdsAnnotation : Mark events on time-series graphsMulti-datasource : Connect to Prometheus, Elasticsearch, and many other data sourcesTemplate variables : Create dynamic, reusable dashboardsBelow is a simplified Prometheus deployment. In production environments, you would typically use the Prometheus Operator or a solution like Kube Prometheus Stack.
apiVersion : apps/v1
kind : Deployment
metadata :
name : prometheus
namespace : monitoring
labels :
app : prometheus
component : core
spec :
replicas : 1
selector :
matchLabels :
app : prometheus
component : core
template :
metadata :
labels :
app : prometheus
component : core
spec :
serviceAccountName : prometheus # Must have permissions to access API server
containers :
- name : prometheus
image : prom/prometheus:v2.30.3
args :
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
ports :
- containerPort : 9090
name : web
volumeMounts :
- name : config-volume
mountPath : /etc/prometheus
- name : prometheus-data
mountPath : /prometheus
resources :
requests :
cpu : 500m
memory : 500Mi
limits :
cpu : 1
memory : 1Gi
volumes :
- name : config-volume
configMap :
name : prometheus-config
- name : prometheus-data
persistentVolumeClaim :
claimName : prometheus-data
The Prometheus configuration would be stored in a ConfigMap:
apiVersion : v1
kind : ConfigMap
metadata :
name : prometheus-config
namespace : monitoring
data :
prometheus.yml : |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
The kubectl top
command provides a quick way to view resource usage for pods:
kubectl top pod -n default
Output example:
NAME CPU(cores) MEMORY(bytes)
nginx-85b98978db-2cgnj 1m 9Mi
redis-78586f566b-jz5jq 1m 10Mi
webapp-7d7ffd5cc9-nv2xz 10m 45Mi
View resource utilization across all nodes in the cluster:
Output example:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
worker-node-1 248m 12% 1867Mi 24%
worker-node-2 134m 6% 2507Mi 32%
worker-node-3 324m 16% 2120Mi 27%
The Prometheus Operator uses ServiceMonitor CRDs to define which services should be monitored:
apiVersion : monitoring.coreos.com/v1
kind : ServiceMonitor
metadata :
name : app-metrics
namespace : monitoring
labels :
release : prometheus # Used by Prometheus Operator for selection
spec :
selector :
matchLabels :
app : myapp # Will monitor all services with this label
namespaceSelector :
matchNames :
- default
- myapp-namespace
endpoints :
- port : metrics # The service port name to scrape
path : /metrics # The metrics path (default is /metrics)
interval : 15s # Scrape interval
scrapeTimeout : 10s # Timeout for each scrape request
honorLabels : true # Keep original metric labels
metricRelabelings : # Optional transformations of metrics
- sourceLabels : [ __name__ ]
regex : 'api_http_requests_total'
action : keep
For custom metrics to work with Horizontal Pod Autoscalers, you'll need to deploy the Prometheus Adapter:
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : myapp-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : myapp
minReplicas : 2
maxReplicas : 10
metrics :
- type : Pods
pods :
metric :
name : http_requests_per_second # Custom metric
target :
type : AverageValue
averageValue : 100
Components of logging:
Container logs Application logs written to stdout/stderr Captured by container runtime Accessible via kubectl logs
Limited retention and search capabilities Lost when containers are restarted or deleted Node-level logging Container logs stored on each node in /var/log/containers/
Kubelet manages log rotation and cleanup Logs may include system components (kubelet, container runtime) Limited by node storage capacity Lost when nodes are terminated Cluster-level logging Centralized logging architecture Log aggregation from all nodes and components Persistent storage independent of node lifecycle Enables cross-node and historical log analysis Required for production environments Log aggregation Collection agents (Fluentd, Fluent Bit, Filebeat) Transport mechanism (Kafka, Redis, direct HTTP) Log forwarding and filtering Buffering and retry mechanisms Metadata enrichment (adding Kubernetes context) Log analysis Indexing and storage (Elasticsearch) Search and query capabilities Visualization (Kibana, Grafana) Alerting based on log patterns Long-term archiving and compliance In Kubernetes, containers should log to stdout and stderr instead of files. This example demonstrates a simple logging pattern:
apiVersion : v1
kind : Pod
metadata :
name : counter
labels :
app : counter
component : example
spec :
containers :
- name : count
image : busybox
args :
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)";
i=$((i+1));
sleep 1;
done
resources :
requests :
memory : "64Mi"
cpu : "100m"
limits :
memory : "128Mi"
cpu : "200m"
You can access these logs using kubectl:
# Follow logs in real-time
kubectl logs -f counter
# Show logs with timestamps
kubectl logs --timestamps counter
# Show logs for previous container instance (after restart)
kubectl logs --previous counter
# Show last 100 lines
kubectl logs --tail=100 counter
For multi-container pods, you must specify the container name:
kubectl logs counter -c count
Best practices for application logging:
Use structured logging (JSON format) Include consistent metadata (request ID, user ID, etc.) Use appropriate log levels (DEBUG, INFO, WARN, ERROR) Don't write logs to files inside containers Consider log volume in high-traffic services Log storage : Distributed, scalable document store optimized for searchFull-text search : Advanced search capabilities with analyzers and tokenizersAnalytics : Real-time aggregations and analytics on log dataVisualization : Integration with visualization toolsScalability : Horizontal scaling with sharding and replicationSchema flexibility : Dynamic mapping for varying log formatsRetention management : Index lifecycle policies for log rotation and archivalLog collection : Tail files and receive events from various sourcesUnified logging : Consistent format across all logging sourcesMultiple outputs : Send to multiple destinations (Elasticsearch, S3, etc.)Plugin architecture : Extensible with custom pluginsBuffering : Handles log spikes and network issuesKubernetes integration : Native metadata enrichmentFiltering : Parse, transform, and filter logs before forwardingPerformance : Fluent Bit is a lightweight alternative for edge casesLog visualization : Rich visualization of log dataDashboard creation : Custom dashboards for different teams/use casesLog exploration : Interactive exploration and search of log dataAlert management : Alerting based on log patterns and thresholdsReporting : Generate reports from log dataMachine learning : Anomaly detection in log patternsRole-based access : Control access to different log sourcesLoki : Horizontally-scalable, multi-tenant log aggregation system designed for cost efficiencyGrafana : Can visualize logs from multiple sources including Loki and ElasticsearchELK Stack : Combined Elasticsearch, Logstash, and Kibana deploymentDatadog/New Relic/Dynatrace : Commercial observability platforms with integrated loggingCloudWatch Logs : Native solution for AWS EKS clustersGoogle Cloud Logging : Native solution for GKE clustersAzure Monitor : Native solution for AKS clustersHere's an example of a production-ready Elasticsearch deployment as part of an EFK (Elasticsearch, Fluentd, Kibana) stack:
apiVersion : apps/v1
kind : StatefulSet
metadata :
name : elasticsearch
namespace : logging
labels :
app : elasticsearch
component : logging
spec :
serviceName : elasticsearch
replicas : 3
updateStrategy :
type : RollingUpdate
selector :
matchLabels :
app : elasticsearch
template :
metadata :
labels :
app : elasticsearch
spec :
initContainers :
- name : fix-permissions
image : busybox
command : [ "sh" , "-c" , "chown -R 1000:1000 /usr/share/elasticsearch/data" ]
volumeMounts :
- name : data
mountPath : /usr/share/elasticsearch/data
- name : increase-vm-max-map
image : busybox
command : [ "sysctl" , "-w" , "vm.max_map_count=262144" ]
securityContext :
privileged : true
- name : increase-fd-ulimit
image : busybox
command : [ "sh" , "-c" , "ulimit -n 65536" ]
containers :
- name : elasticsearch
image : docker.elastic.co/elasticsearch/elasticsearch:7.15.0
env :
- name : cluster.name
value : kubernetes-logs
- name : node.name
valueFrom :
fieldRef :
fieldPath : metadata.name
- name : discovery.seed_hosts
value : "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name : cluster.initial_master_nodes
value : "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name : ES_JAVA_OPTS
value : "-Xms1g -Xmx1g"
resources :
limits :
cpu : 1000m
memory : 2Gi
requests :
cpu : 500m
memory : 1Gi
ports :
- containerPort : 9200
name : http
- containerPort : 9300
name : transport
volumeMounts :
- name : data
mountPath : /usr/share/elasticsearch/data
readinessProbe :
httpGet :
path : /_cluster/health
port : http
initialDelaySeconds : 20
timeoutSeconds : 5
volumes :
- name : data
persistentVolumeClaim :
claimName : elasticsearch-data
volumeClaimTemplates :
- metadata :
name : data
spec :
accessModes : [ "ReadWriteOnce" ]
storageClassName : standard
resources :
requests :
storage : 100Gi
A complementary Fluentd DaemonSet to collect logs from all nodes:
apiVersion : apps/v1
kind : DaemonSet
metadata :
name : fluentd
namespace : logging
labels :
app : fluentd
component : logging
spec :
selector :
matchLabels :
app : fluentd
template :
metadata :
labels :
app : fluentd
spec :
serviceAccount : fluentd
serviceAccountName : fluentd
tolerations :
- key : node-role.kubernetes.io/master
effect : NoSchedule
containers :
- name : fluentd
image : fluent/fluentd-kubernetes-daemonset:v1.13-debian-elasticsearch7
env :
- name : FLUENT_ELASTICSEARCH_HOST
value : "elasticsearch.logging.svc.cluster.local"
- name : FLUENT_ELASTICSEARCH_PORT
value : "9200"
- name : FLUENT_ELASTICSEARCH_SCHEME
value : "http"
- name : FLUENT_ELASTICSEARCH_USER
value : "elastic"
- name : FLUENT_ELASTICSEARCH_PASSWORD
valueFrom :
secretKeyRef :
name : elasticsearch-credentials
key : password
resources :
limits :
memory : 512Mi
requests :
cpu : 100m
memory : 200Mi
volumeMounts :
- name : varlog
mountPath : /var/log
- name : varlibdockercontainers
mountPath : /var/lib/docker/containers
readOnly : true
- name : config
mountPath : /fluentd/etc/fluent.conf
subPath : fluent.conf
volumes :
- name : varlog
hostPath :
path : /var/log
- name : varlibdockercontainers
hostPath :
path : /var/lib/docker/containers
- name : config
configMap :
name : fluentd-config
And Kibana for visualization:
apiVersion : apps/v1
kind : Deployment
metadata :
name : kibana
namespace : logging
labels :
app : kibana
component : logging
spec :
replicas : 1
selector :
matchLabels :
app : kibana
template :
metadata :
labels :
app : kibana
spec :
containers :
- name : kibana
image : docker.elastic.co/kibana/kibana:7.15.0
env :
- name : ELASTICSEARCH_HOSTS
value : http://elasticsearch:9200
resources :
limits :
cpu : 1000m
memory : 1Gi
requests :
cpu : 500m
memory : 512Mi
ports :
- containerPort : 5601
name : http
readinessProbe :
httpGet :
path : /api/status
port : http
initialDelaySeconds : 30
timeoutSeconds : 10
Set appropriate resource limits Prevent monitoring tools from consuming excessive resources Ensure monitoring components have sufficient resources Monitor the monitoring system itself Consider dedicated nodes for monitoring infrastructure Configure alerts properly Define meaningful alert thresholds based on business impact Implement alert severity levels (info, warning, critical) Prevent alert fatigue with proper grouping and routing Establish clear escalation paths and on-call procedures Include runbooks with actionable resolution steps Use persistent storage Ensure metrics data survives pod restarts Use appropriate storage classes for performance Size volumes appropriately for retention needs Consider using remote storage for long-term metrics Implement retention policies Define tiered retention based on metric importance Downsample older metrics to reduce storage requirements Archive historical data to cold storage if needed Align retention with compliance requirements Regular backup of metrics Back up critical metrics data Test restoration procedures Include dashboards and alert configurations in backups Version control monitoring configurations Monitor critical components Monitor both applications and infrastructure Include all cluster components (etcd, API server, etc.) Monitor external dependencies Set up synthetic monitoring for end-user experience Implement SLOs (Service Level Objectives) for key services Prometheus alerts are defined as PrometheusRules in the Prometheus Operator:
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : high-cpu-usage
namespace : monitoring
labels :
prometheus : k8s # Label used by Prometheus to find rules
role : alert-rules
spec :
groups :
- name : cpu
rules :
- alert : HighCPUUsage
expr : sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, namespace) > 0.9
for : 5m # Must be true for this duration before firing
labels :
severity : warning
team : platform
annotations :
summary : "High CPU usage for pod {{ $labels.pod }} in namespace {{ $labels.namespace }}"
description : "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been using more than 90% of its CPU limit for 5 minutes."
runbook_url : "https://wiki.example.com/runbooks/high-cpu-usage"
dashboard_url : "https://grafana.example.com/d/pods?var-namespace={{ $labels.namespace }}&var-pod={{ $labels.pod }}"
- alert : PodCrashLooping
expr : rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
for : 15m
labels :
severity : critical
team : app-team
annotations :
summary : "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
description : "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value | printf \" %.0f \" }} times in the last 15 minutes"
AlertManager then handles routing and notification:
apiVersion : monitoring.coreos.com/v1
kind : Alertmanager
metadata :
name : main
namespace : monitoring
spec :
replicas : 3
configSecret : alertmanager-config
---
apiVersion : v1
kind : Secret
metadata :
name : alertmanager-config
namespace : monitoring
stringData :
alertmanager.yaml : |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['namespace', 'alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match_re:
service: ^(foo|bar)$
receiver: 'team-X-mails'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: "{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:*\n{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`\n{{ end }}{{ end }}"
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-integration-key'
send_resolved: true
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.com'
send_resolved: true
Resource utilization : CPU, memory, disk, and network usagePerformance metrics : Request latency, throughput, error ratesHealth status : Pod/node availability, readiness/liveness probe resultsAlert overview : Active alerts, recent resolutions, alert historyCluster capacity : Available resources, scheduling headroomWorkload metrics : Deployment status, replica counts, scaling eventsCustom application metrics : Business-specific indicatorsExample Grafana dashboard JSON for pod resources:
{
"title" : "Kubernetes Pod Resources" ,
"panels" : [
{
"title" : "CPU Usage" ,
"type" : "graph" ,
"targets" : [
{
"expr" : "sum(rate(container_cpu_usage_seconds_total{namespace= \" $namespace \" , pod= \" $pod \" , container!= \" POD \" , container!= \"\" }[5m])) by (container)" ,
"legendFormat" : "{{container}}"
}
]
},
{
"title" : "Memory Usage" ,
"type" : "graph" ,
"targets" : [
{
"expr" : "sum(container_memory_usage_bytes{namespace= \" $namespace \" , pod= \" $pod \" , container!= \" POD \" , container!= \"\" }) by (container)" ,
"legendFormat" : "{{container}}"
}
]
}
],
"templating" : {
"list" : [
{
"name" : "namespace" ,
"type" : "query" ,
"query" : "label_values(kube_pod_info, namespace)"
},
{
"name" : "pod" ,
"type" : "query" ,
"query" : "label_values(kube_pod_info{namespace= \" $namespace \" }, pod)"
}
]
}
}
Log patterns : Visualize log volume patterns and trendsError rates : Track error frequency and typesApplication logs : Filter and search application-specific logsSystem logs : Monitor Kubernetes components and node-level logsLog correlations : Connect logs to related metrics and eventsAudit logs : Track security-relevant actions in the clusterCustom parsing : Extract and visualize structured log fieldsExample Kibana dashboard configuration:
{
"objects" : [
{
"attributes" : {
"title" : "Kubernetes Logs Overview" ,
"panelsJSON" : "[
{
\" embeddableConfig \" : {},
\" gridData \" : {
\" h \" : 15,
\" i \" : \" 1 \" ,
\" w \" : 24,
\" x \" : 0,
\" y \" : 0
},
\" panelIndex \" : \" 1 \" ,
\" panelRefName \" : \" panel_0 \" ,
\" title \" : \" Log Volume Over Time \" ,
\" version \" : \" 7.10.0 \"
},
{
\" embeddableConfig \" : {},
\" gridData \" : {
\" h \" : 15,
\" i \" : \" 2 \" ,
\" w \" : 24,
\" x \" : 24,
\" y \" : 0
},
\" panelIndex \" : \" 2 \" ,
\" panelRefName \" : \" panel_1 \" ,
\" title \" : \" Errors by Namespace \" ,
\" version \" : \" 7.10.0 \"
}
]" ,
"timeRestore" : false ,
"kibanaSavedObjectMeta" : {
"searchSourceJSON" : "{ \" query \" :{ \" language \" : \" kuery \" , \" query \" : \"\" }, \" filter \" :[]}"
}
},
"id" : "kubernetes-logs-overview" ,
"type" : "dashboard"
}
]
}
Modern observability platforms combine metrics, logs, and traces:
Cross-referencing : Navigate from metrics to related logsService maps : Visualize service dependenciesCPU usage : Container and node CPU utilization, throttling eventsMemory consumption : RSS, working set, page faults, OOM eventsNetwork traffic : Bytes sent/received, packet rate, connection statesDisk I/O : Read/write operations, latency, queue depth, errorsResponse times : Request latency percentiles (p50, p95, p99), processing timeAvailability : Percentage of successful requests (100% - error rate)
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Latency : Request duration distribution and percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Error rate : Percentage of failed requests
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Throughput : Request rate per second
sum(rate(http_requests_total[5m]))
Saturation : How "full" your service is (resource utilization)
max(sum by(pod) (rate(container_cpu_usage_seconds_total[5m])) / sum by(pod) (kube_pod_container_resource_limits_cpu_cores))
Targets set for service performance and reliability Example: 99.9% availability over 30 days Error budgets: Allowable downtime within SLO Measuring SLO compliance:
# Recording rule for availability SLO
- record: availability:slo_ratio_30d
expr: sum_over_time(availability:ratio[30d]) / count_over_time(availability:ratio[30d])
Log aggregation collects logs from all nodes and applications into a centralized system. A Fluentd DaemonSet is a common approach:
apiVersion : apps/v1
kind : DaemonSet
metadata :
name : fluentd
namespace : logging
labels :
k8s-app : fluentd-logging
spec :
selector :
matchLabels :
name : fluentd
template :
metadata :
labels :
name : fluentd
spec :
tolerations :
- key : node-role.kubernetes.io/master
effect : NoSchedule
containers :
- name : fluentd
image : fluent/fluentd:v1.13
env :
- name : FLUENT_ELASTICSEARCH_HOST
value : "elasticsearch.logging"
- name : FLUENT_ELASTICSEARCH_PORT
value : "9200"
resources :
limits :
memory : 500Mi
requests :
cpu : 100m
memory : 200Mi
volumeMounts :
- name : varlog
mountPath : /var/log
- name : varlibdockercontainers
mountPath : /var/lib/docker/containers
readOnly : true
- name : config-volume
mountPath : /fluentd/etc/fluent.conf
subPath : fluent.conf
terminationGracePeriodSeconds : 30
volumes :
- name : varlog
hostPath :
path : /var/log
- name : varlibdockercontainers
hostPath :
path : /var/lib/docker/containers
- name : config-volume
configMap :
name : fluentd-config
Structured logging greatly improves searchability and analysis:
{
"timestamp" : "2023-04-01T12:34:56.789Z" ,
"level" : "INFO" ,
"service" : "payment-service" ,
"trace_id" : "abc123def456" ,
"user_id" : "user-789" ,
"message" : "Payment processed successfully" ,
"amount" : 125.50 ,
"currency" : "USD" ,
"payment_method" : "credit_card"
}
Managing log volume is crucial for both cost and performance:
Hot storage : Recent logs (1-7 days) in fast storage for active searchingWarm storage : Medium-term logs (7-30 days) for less frequent accessCold storage : Long-term archival (30+ days) for compliance and auditingExample Elasticsearch Index Lifecycle Policy:
{
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms" ,
"actions" : {
"rollover" : {
"max_age" : "1d" ,
"max_size" : "50gb"
},
"set_priority" : {
"priority" : 100
}
}
},
"warm" : {
"min_age" : "7d" ,
"actions" : {
"shrink" : {
"number_of_shards" : 1
},
"forcemerge" : {
"max_num_segments" : 1
},
"set_priority" : {
"priority" : 50
}
}
},
"cold" : {
"min_age" : "30d" ,
"actions" : {
"set_priority" : {
"priority" : 0
}
}
},
"delete" : {
"min_age" : "90d" ,
"actions" : {
"delete" : {}
}
}
}
}
}
Common monitoring issues:
Resource constraints Symptom: Monitoring components OOMKilled or throttled Solution: Increase resource limits, enable vertical pod autoscaler Prevention: Regularly check resource usage trends and scale proactively Storage problems Symptom: PersistentVolume filling up, failed writes Solution: Expand PVs, implement proper retention policies Prevention: Set up alerts for storage capacity thresholds (>80%) Network connectivity Symptom: Failed scrapes, missing metrics, connection timeouts Solution: Check network policies, DNS resolution, firewall rules Prevention: Monitor network latency and error rates between components Configuration errors Symptom: Invalid scrape configs, dashboard errors, alert rule syntax errors Solution: Validate configurations, use linters, implement CI/CD checks Prevention: Version control monitoring configurations, implement change management Data retention Symptom: Missing historical data, excessive disk usage, slow queries Solution: Optimize retention periods, implement data downsampling Prevention: Configure appropriate retention policies based on data importance Cardinality explosion Symptom: High memory usage in Prometheus, slow queries, OOM issues Solution: Limit label cardinality, use recording rules, optimize queries Prevention: Review metric naming and labeling practices, monitor cardinality growth Certificate expiration Symptom: TLS errors in monitoring components Solution: Renew certificates, implement automated certificate management Prevention: Set up alerts for certificate expiration dates Istio metrics : Detailed service-to-service communication metrics
# Example Prometheus scrape config for Istio
scrape_configs :
- job_name : 'istio-mesh'
kubernetes_sd_configs :
- role : endpoints
relabel_configs :
- source_labels : [ __meta_kubernetes_service_name , __meta_kubernetes_endpoint_port_name ]
action : keep
regex : istio-telemetry;prometheus
Traffic monitoring : Request volume, success rates, latency between servicesService graphs : Visual representation of service dependencies and health
# Grafana dashboard query for service dependency
sum(rate(istio_requests_total{reporter="source"}[5m])) by (source_workload, destination_workload)
Distributed tracing : End-to-end request tracing with Jaeger or Zipkin
# Jaeger deployment example
apiVersion : apps/v1
kind : Deployment
metadata :
name : jaeger
namespace : tracing
spec :
selector :
matchLabels :
app : jaeger
template :
metadata :
labels :
app : jaeger
spec :
containers :
- name : jaeger
image : jaegertracing/all-in-one:1.24
ports :
- containerPort : 16686
name : ui
- containerPort : 14268
name : collector
Application metrics : Custom instrumentation of application code
// Go example with Prometheus client
httpRequestsTotal := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests by status code and method",
},
[]string{"status_code", "method"},
)
prometheus.MustRegister(httpRequestsTotal)
// In HTTP handler
httpRequestsTotal.WithLabelValues(strconv.Itoa(statusCode), method).Inc()
Business metrics : KPIs specific to your domain (orders, users, etc.)
# Python example
from prometheus_client import Counter
payments_total = Counter('payments_total', 'Total payments processed',
['status', 'payment_method'])
# When processing a payment
payments_total.labels(status='success', payment_method='credit_card').inc()
External metrics : Metrics from external systems (databases, APIs, etc.)
# Example exporter for MySQL
apiVersion : apps/v1
kind : Deployment
metadata :
name : mysql-exporter
spec :
selector :
matchLabels :
app : mysql-exporter
template :
metadata :
labels :
app : mysql-exporter
spec :
containers :
- name : exporter
image : prom/mysqld-exporter:v0.13.0
env :
- name : DATA_SOURCE_NAME
valueFrom :
secretKeyRef :
name : mysql-credentials
key : data-source-name
Synthetic monitoring : Simulated user interactions to test availability
# Blackbox exporter for probing endpoints
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : synthetic-monitoring
spec :
groups :
- name : synthetic
rules :
- alert : EndpointDown
expr : probe_success{job="blackbox"} == 0
for : 5m
labels :
severity : critical
annotations :
summary : "Endpoint {{ $labels.instance }} is down"
Auto-scaling : HPA, VPA, and Cluster Autoscaler for dynamic resource allocation
# Custom metrics HPA example
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : api-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : api
minReplicas : 2
maxReplicas : 10
metrics :
- type : Pods
pods :
metric :
name : http_requests_per_second
target :
type : AverageValue
averageValue : 500
Self-healing : Automated remediation based on monitoring signals
# Example of Kubernetes Event-driven Autoscaling (KEDA)
apiVersion : keda.sh/v1alpha1
kind : ScaledObject
metadata :
name : queue-consumer
spec :
scaleTargetRef :
name : consumer
triggers :
- type : prometheus
metadata :
serverAddress : http://prometheus.monitoring:9090
metricName : queue_depth
threshold : "10"
query : sum(rabbitmq_queue_messages_ready{queue="tasks"})
Capacity planning : Predictive scaling based on historical patterns
# Prometheus recording rule for capacity planning
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : capacity-planning
spec :
groups :
- name : capacity
rules :
- record : capacity:cpu:prediction:next_7d
expr : predict_linear(avg_over_time(cluster:cpu:usage:ratio[30d])[60d:1h], 86400 * 7)
Performance optimization : Automated tuning based on observed metrics
# Example of a performance tuning operator configuration
apiVersion : tuning.openshift.io/v1
kind : Profile
metadata :
name : performance
spec :
additionalKernelArgs :
- "nmi_watchdog=0"
- "audit=0"
- "processor.max_cstate=1"
cpu :
isolated : "1-3"
reserved : "0"
hugepages :
defaultHugepagesSize : "1G"
pages :
- size : "1G"
count : 4
Monitor cluster health Set up comprehensive monitoring for all Kubernetes components Monitor control plane components (API server, etcd, scheduler) Track node health metrics (kubelet, container runtime) Use dedicated tools for Kubernetes-specific monitoring Implement log rotation Configure log rotation policies to prevent disk exhaustion Set appropriate retention periods based on importance Consider regulatory requirements for log retention Implement automated archiving for long-term storage Set up alerting Define meaningful alerts with clear thresholds Implement different severity levels and routing Create actionable alerts with runbook links Avoid alert fatigue with proper grouping and silencing Test alert delivery and escalation paths regularly Use persistent storage Store metrics and logs on reliable persistent volumes Size storage appropriately based on retention needs Monitor storage usage and set up alerts for capacity thresholds Use storage classes appropriate for the workload (SSD for hot data) Regular backups Back up monitoring and logging configurations Include dashboards, alert rules, and custom scripts Test restoration procedures periodically Consider disaster recovery scenarios Monitor applications Instrument applications with custom metrics Track business-relevant indicators Implement health checks and readiness probes Monitor external dependencies Create application-specific dashboards Track resource usage Monitor CPU, memory, disk, and network utilization Set up resource requests and limits based on actual usage Identify resource bottlenecks and optimization opportunities Implement cost allocation and showback Implement security monitoring Monitor for suspicious activities and policy violations Track authentication and authorization events Implement audit logging for sensitive operations Scan for vulnerabilities and misconfigurations Set up alerts for security-related events Performance tracking Define and track SLIs (Service Level Indicators) Establish SLOs (Service Level Objectives) for critical services Monitor latency, error rates, and throughput Implement distributed tracing for complex applications Conduct regular performance reviews Capacity planning Track growth trends and forecast future needs Implement predictive analysis for capacity requirements Set up automated scaling mechanisms Plan for seasonal variations in workload Balance resource efficiency with performance requirements