Monitoring & Logging

Understanding Kubernetes monitoring, logging, and observability practices

Effective monitoring and logging are crucial for maintaining healthy Kubernetes clusters and applications. A comprehensive monitoring and logging strategy provides visibility into cluster health, application performance, and helps troubleshoot issues before they affect end users.

Kubernetes observability consists of three main pillars:

Monitoring: Collecting and analyzing metrics about system performance and behavior
Logging: Capturing and storing event data and messages from applications and system components
Tracing: Following the path of requests through distributed systems to identify bottlenecks

Monitoring Components

Metrics Server

Basic cluster metrics: Lightweight, in-memory metrics collector
CPU and memory usage: Resource utilization for pods and nodes
Pod and node metrics: Key metrics for orchestration decisions
Horizontal Pod Autoscaling: Provides metrics for the HPA controller
Simple API: Accessible via Kubernetes API (metrics.k8s.io)
No historical data: Only provides current state (not for long-term analysis)

Prometheus

Time-series database: Purpose-built database for metrics with timestamps
Metrics collection: Pull-based architecture with service discovery
Query language (PromQL): Powerful language for metrics analysis and aggregation
Alert management: Define alert conditions and routing
Federation: Scale to large deployments by federating multiple Prometheus servers
Service discovery: Auto-discovers targets via Kubernetes API
Extensive integrations: Wide ecosystem of exporters for various systems

Grafana

Visualization platform: Create rich, interactive dashboards
Dashboard creation: Drag-and-drop interface with extensive customization
Metrics exploration: Query and explore metrics in real-time
Alert notification: Define alerts based on metrics thresholds
Annotation: Mark events on time-series graphs
Multi-datasource: Connect to Prometheus, Elasticsearch, and many other data sources
Template variables: Create dynamic, reusable dashboards

Basic Monitoring Setup

Below is a simplified Prometheus deployment. In production environments, you would typically use the Prometheus Operator or a solution like Kube Prometheus Stack.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
    component: core
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: core
  template:
    metadata:
      labels:
        app: prometheus
        component: core
    spec:
      serviceAccountName: prometheus # Must have permissions to access API server
      containers:
      - name: prometheus
        image: prom/prometheus:v2.30.3
        args:
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--storage.tsdb.retention.time=15d"
          - "--web.console.libraries=/etc/prometheus/console_libraries"
          - "--web.console.templates=/etc/prometheus/consoles"
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: prometheus-data
          mountPath: /prometheus
        resources:
          requests:
            cpu: 500m
            memory: 500Mi
          limits:
            cpu: 1
            memory: 1Gi
      volumes:
      - name: config-volume
        configMap:
          name: prometheus-config
      - name: prometheus-data
        persistentVolumeClaim:
          claimName: prometheus-data

The Prometheus configuration would be stored in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

Resource Metrics

Pod Metrics

The kubectl top command provides a quick way to view resource usage for pods:

kubectl top pod -n default

Output example:

NAME                     CPU(cores)   MEMORY(bytes)   
nginx-85b98978db-2cgnj   1m           9Mi            
redis-78586f566b-jz5jq   1m           10Mi           
webapp-7d7ffd5cc9-nv2xz  10m          45Mi

Node Metrics

View resource utilization across all nodes in the cluster:

kubectl top node

Output example:

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
worker-node-1   248m         12%    1867Mi          24%       
worker-node-2   134m         6%     2507Mi          32%       
worker-node-3   324m         16%    2120Mi          27%

Custom Metrics

The Prometheus Operator uses ServiceMonitor CRDs to define which services should be monitored:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    release: prometheus  # Used by Prometheus Operator for selection
spec:
  selector:
    matchLabels:
      app: myapp  # Will monitor all services with this label
  namespaceSelector:
    matchNames:
      - default
      - myapp-namespace
  endpoints:
  - port: metrics  # The service port name to scrape
    path: /metrics  # The metrics path (default is /metrics)
    interval: 15s  # Scrape interval
    scrapeTimeout: 10s  # Timeout for each scrape request
    honorLabels: true  # Keep original metric labels
    metricRelabelings:  # Optional transformations of metrics
    - sourceLabels: [__name__]
      regex: 'api_http_requests_total'
      action: keep

For custom metrics to work with Horizontal Pod Autoscalers, you'll need to deploy the Prometheus Adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second  # Custom metric
      target:
        type: AverageValue
        averageValue: 100

Logging Architecture

Components of logging:

Container logs
- Application logs written to stdout/stderr
- Captured by container runtime
- Accessible via kubectl logs
- Limited retention and search capabilities
- Lost when containers are restarted or deleted
Node-level logging
- Container logs stored on each node in /var/log/containers/
- Kubelet manages log rotation and cleanup
- Logs may include system components (kubelet, container runtime)
- Limited by node storage capacity
- Lost when nodes are terminated
Cluster-level logging
- Centralized logging architecture
- Log aggregation from all nodes and components
- Persistent storage independent of node lifecycle
- Enables cross-node and historical log analysis
- Required for production environments
Log aggregation
- Collection agents (Fluentd, Fluent Bit, Filebeat)
- Transport mechanism (Kafka, Redis, direct HTTP)
- Log forwarding and filtering
- Buffering and retry mechanisms
- Metadata enrichment (adding Kubernetes context)
Log analysis
- Indexing and storage (Elasticsearch)
- Search and query capabilities
- Visualization (Kibana, Grafana)
- Alerting based on log patterns
- Long-term archiving and compliance

Container Logging

In Kubernetes, containers should log to stdout and stderr instead of files. This example demonstrates a simple logging pattern:

apiVersion: v1
kind: Pod
metadata:
  name: counter
  labels:
    app: counter
    component: example
spec:
  containers:
  - name: count
    image: busybox
    args:
    - /bin/sh
    - -c
    - >
      i=0;
      while true;
      do
        echo "$i: $(date)";
        i=$((i+1));
        sleep 1;
      done
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "128Mi"
        cpu: "200m"

You can access these logs using kubectl:

# Follow logs in real-time
kubectl logs -f counter

# Show logs with timestamps
kubectl logs --timestamps counter

# Show logs for previous container instance (after restart)
kubectl logs --previous counter

# Show last 100 lines
kubectl logs --tail=100 counter

For multi-container pods, you must specify the container name:

kubectl logs counter -c count

Best practices for application logging:

Use structured logging (JSON format)
Include consistent metadata (request ID, user ID, etc.)
Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
Don't write logs to files inside containers
Consider log volume in high-traffic services

Logging Solutions

Elasticsearch

Log storage: Distributed, scalable document store optimized for search
Full-text search: Advanced search capabilities with analyzers and tokenizers
Analytics: Real-time aggregations and analytics on log data
Visualization: Integration with visualization tools
Scalability: Horizontal scaling with sharding and replication
Schema flexibility: Dynamic mapping for varying log formats
Retention management: Index lifecycle policies for log rotation and archival

Fluentd / Fluent Bit

Log collection: Tail files and receive events from various sources
Unified logging: Consistent format across all logging sources
Multiple outputs: Send to multiple destinations (Elasticsearch, S3, etc.)
Plugin architecture: Extensible with custom plugins
Buffering: Handles log spikes and network issues
Kubernetes integration: Native metadata enrichment
Filtering: Parse, transform, and filter logs before forwarding
Performance: Fluent Bit is a lightweight alternative for edge cases

Kibana

Log visualization: Rich visualization of log data
Dashboard creation: Custom dashboards for different teams/use cases
Log exploration: Interactive exploration and search of log data
Alert management: Alerting based on log patterns and thresholds
Reporting: Generate reports from log data
Machine learning: Anomaly detection in log patterns
Role-based access: Control access to different log sources

EFK Stack Deployment

Here's an example of a production-ready Elasticsearch deployment as part of an EFK (Elasticsearch, Fluentd, Kibana) stack:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
  labels:
    app: elasticsearch
    component: logging
spec:
  serviceName: elasticsearch
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
      - name: fix-permissions
        image: busybox
        command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
      - name: increase-vm-max-map
        image: busybox
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        securityContext:
          privileged: true
      - name: increase-fd-ulimit
        image: busybox
        command: ["sh", "-c", "ulimit -n 65536"]
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
        env:
        - name: cluster.name
          value: kubernetes-logs
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms1g -Xmx1g"
        resources:
          limits:
            cpu: 1000m
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
        readinessProbe:
          httpGet:
            path: /_cluster/health
            port: http
          initialDelaySeconds: 20
          timeoutSeconds: 5
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: elasticsearch-data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 100Gi

A complementary Fluentd DaemonSet to collect logs from all nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    app: fluentd
    component: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.13-debian-elasticsearch7
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: "http"
        - name: FLUENT_ELASTICSEARCH_USER
          value: "elastic"
        - name: FLUENT_ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: elasticsearch-credentials
              key: password
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluentd-config

And Kibana for visualization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
  labels:
    app: kibana
    component: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:7.15.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: http://elasticsearch:9200
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 512Mi
        ports:
        - containerPort: 5601
          name: http
        readinessProbe:
          httpGet:
            path: /api/status
            port: http
          initialDelaySeconds: 30
          timeoutSeconds: 10

Monitoring Best Practices

Set appropriate resource limits
- Prevent monitoring tools from consuming excessive resources
- Ensure monitoring components have sufficient resources
- Monitor the monitoring system itself
- Consider dedicated nodes for monitoring infrastructure
Configure alerts properly
- Define meaningful alert thresholds based on business impact
- Implement alert severity levels (info, warning, critical)
- Prevent alert fatigue with proper grouping and routing
- Establish clear escalation paths and on-call procedures
- Include runbooks with actionable resolution steps
Use persistent storage
- Ensure metrics data survives pod restarts
- Use appropriate storage classes for performance
- Size volumes appropriately for retention needs
- Consider using remote storage for long-term metrics
Implement retention policies
- Define tiered retention based on metric importance
- Downsample older metrics to reduce storage requirements
- Archive historical data to cold storage if needed
- Align retention with compliance requirements
Regular backup of metrics
- Back up critical metrics data
- Test restoration procedures
- Include dashboards and alert configurations in backups
- Version control monitoring configurations
Monitor critical components
- Monitor both applications and infrastructure
- Include all cluster components (etcd, API server, etc.)
- Monitor external dependencies
- Set up synthetic monitoring for end-user experience
- Implement SLOs (Service Level Objectives) for key services

Alert Management

Prometheus alerts are defined as PrometheusRules in the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: high-cpu-usage
  namespace: monitoring
  labels:
    prometheus: k8s  # Label used by Prometheus to find rules
    role: alert-rules
spec:
  groups:
  - name: cpu
    rules:
    - alert: HighCPUUsage
      expr: sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, namespace) > 0.9
      for: 5m  # Must be true for this duration before firing
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High CPU usage for pod {{ $labels.pod }} in namespace {{ $labels.namespace }}"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been using more than 90% of its CPU limit for 5 minutes."
        runbook_url: "https://wiki.example.com/runbooks/high-cpu-usage"
        dashboard_url: "https://grafana.example.com/d/pods?var-namespace={{ $labels.namespace }}&var-pod={{ $labels.pod }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
      for: 15m
      labels:
        severity: critical
        team: app-team
      annotations:
        summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value | printf \"%.0f\" }} times in the last 15 minutes"

AlertManager then handles routing and notification:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: main
  namespace: monitoring
spec:
  replicas: 3
  configSecret: alertmanager-config
---
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['namespace', 'alertname', 'job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match_re:
          service: ^(foo|bar)$
        receiver: 'team-X-mails'
    
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: "{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:*\n{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`\n{{ end }}{{ end }}"
    
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true
    
    - name: 'team-X-mails'
      email_configs:
      - to: 'team-X+alerts@example.com'
        send_resolved: true

Dashboards

Metrics Dashboard

Resource utilization: CPU, memory, disk, and network usage
Performance metrics: Request latency, throughput, error rates
Health status: Pod/node availability, readiness/liveness probe results
Alert overview: Active alerts, recent resolutions, alert history
Cluster capacity: Available resources, scheduling headroom
Workload metrics: Deployment status, replica counts, scaling events
Custom application metrics: Business-specific indicators

Example Grafana dashboard JSON for pod resources:

{
  "title": "Kubernetes Pod Resources",
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}[5m])) by (container)",
          "legendFormat": "{{container}}"
        }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(container_memory_usage_bytes{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}) by (container)",
          "legendFormat": "{{container}}"
        }
      ]
    }
  ],
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)"
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
      }
    ]
  }
}

Logging Dashboard

Log patterns: Visualize log volume patterns and trends
Error rates: Track error frequency and types
Application logs: Filter and search application-specific logs
System logs: Monitor Kubernetes components and node-level logs
Log correlations: Connect logs to related metrics and events
Audit logs: Track security-relevant actions in the cluster
Custom parsing: Extract and visualize structured log fields

Example Kibana dashboard configuration:

{
  "objects": [
    {
      "attributes": {
        "title": "Kubernetes Logs Overview",
        "panelsJSON": "[
          {
            \"embeddableConfig\": {},
            \"gridData\": {
              \"h\": 15,
              \"i\": \"1\",
              \"w\": 24,
              \"x\": 0,
              \"y\": 0
            },
            \"panelIndex\": \"1\",
            \"panelRefName\": \"panel_0\",
            \"title\": \"Log Volume Over Time\",
            \"version\": \"7.10.0\"
          },
          {
            \"embeddableConfig\": {},
            \"gridData\": {
              \"h\": 15,
              \"i\": \"2\",
              \"w\": 24,
              \"x\": 24,
              \"y\": 0
            },
            \"panelIndex\": \"2\",
            \"panelRefName\": \"panel_1\",
            \"title\": \"Errors by Namespace\",
            \"version\": \"7.10.0\"
          }
        ]",
        "timeRestore": false,
        "kibanaSavedObjectMeta": {
          "searchSourceJSON": "{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"
        }
      },
      "id": "kubernetes-logs-overview",
      "type": "dashboard"
    }
  ]
}

Integrated Dashboards

Modern observability platforms combine metrics, logs, and traces:

Cross-referencing: Navigate from metrics to related logs
Service maps: Visualize service dependencies

Performance Monitoring

Key Metrics

CPU usage: Container and node CPU utilization, throttling events
Memory consumption: RSS, working set, page faults, OOM events
Network traffic: Bytes sent/received, packet rate, connection states
Disk I/O: Read/write operations, latency, queue depth, errors
Response times: Request latency percentiles (p50, p95, p99), processing time

Service Level Indicators (SLIs)

Availability: Percentage of successful requests (100% - error rate)
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Latency: Request duration distribution and percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Error rate: Percentage of failed requests
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Throughput: Request rate per second
sum(rate(http_requests_total[5m]))
Saturation: How "full" your service is (resource utilization)
max(sum by(pod) (rate(container_cpu_usage_seconds_total[5m])) / sum by(pod) (kube_pod_container_resource_limits_cpu_cores))

Service Level Objectives (SLOs)

Targets set for service performance and reliability
Example: 99.9% availability over 30 days
Error budgets: Allowable downtime within SLO
Measuring SLO compliance:
# Recording rule for availability SLO - record: availability:slo_ratio_30d expr: sum_over_time(availability:ratio[30d]) / count_over_time(availability:ratio[30d])

Log Management

Log Aggregation

Log aggregation collects logs from all nodes and applications into a centralized system. A Fluentd DaemonSet is a common approach:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.13
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluentd-config

Log Structuring

Structured logging greatly improves searchability and analysis:

{
  "timestamp": "2023-04-01T12:34:56.789Z",
  "level": "INFO",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "user_id": "user-789",
  "message": "Payment processed successfully",
  "amount": 125.50,
  "currency": "USD",
  "payment_method": "credit_card"
}

Log Retention and Archiving

Managing log volume is crucial for both cost and performance:

Hot storage: Recent logs (1-7 days) in fast storage for active searching
Warm storage: Medium-term logs (7-30 days) for less frequent access
Cold storage: Long-term archival (30+ days) for compliance and auditing

Example Elasticsearch Index Lifecycle Policy:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Troubleshooting

Common monitoring issues:

Resource constraints
- Symptom: Monitoring components OOMKilled or throttled
- Solution: Increase resource limits, enable vertical pod autoscaler
- Prevention: Regularly check resource usage trends and scale proactively
Storage problems
- Symptom: PersistentVolume filling up, failed writes
- Solution: Expand PVs, implement proper retention policies
- Prevention: Set up alerts for storage capacity thresholds (>80%)
Network connectivity
- Symptom: Failed scrapes, missing metrics, connection timeouts
- Solution: Check network policies, DNS resolution, firewall rules
- Prevention: Monitor network latency and error rates between components
Configuration errors
- Symptom: Invalid scrape configs, dashboard errors, alert rule syntax errors
- Solution: Validate configurations, use linters, implement CI/CD checks
- Prevention: Version control monitoring configurations, implement change management
Data retention
- Symptom: Missing historical data, excessive disk usage, slow queries
- Solution: Optimize retention periods, implement data downsampling
- Prevention: Configure appropriate retention policies based on data importance
Cardinality explosion
- Symptom: High memory usage in Prometheus, slow queries, OOM issues
- Solution: Limit label cardinality, use recording rules, optimize queries
- Prevention: Review metric naming and labeling practices, monitor cardinality growth
Certificate expiration
- Symptom: TLS errors in monitoring components
- Solution: Renew certificates, implement automated certificate management
- Prevention: Set up alerts for certificate expiration dates

Advanced Topics

Service Mesh Monitoring

Istio metrics: Detailed service-to-service communication metrics

# Example Prometheus scrape config for Istio
scrape_configs:
  - job_name: 'istio-mesh'
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: istio-telemetry;prometheus

Traffic monitoring: Request volume, success rates, latency between services
Service graphs: Visual representation of service dependencies and health
# Grafana dashboard query for service dependency sum(rate(istio_requests_total{reporter="source"}[5m])) by (source_workload, destination_workload)

Distributed tracing: End-to-end request tracing with Jaeger or Zipkin

# Jaeger deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: tracing
spec:
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.24
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 14268
          name: collector

Custom Metrics

Application metrics: Custom instrumentation of application code

// Go example with Prometheus client
httpRequestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests by status code and method",
    },
    []string{"status_code", "method"},
)
prometheus.MustRegister(httpRequestsTotal)

// In HTTP handler
httpRequestsTotal.WithLabelValues(strconv.Itoa(statusCode), method).Inc()

Business metrics: KPIs specific to your domain (orders, users, etc.)
# Python example from prometheus_client import Counter payments_total = Counter('payments_total', 'Total payments processed', ['status', 'payment_method']) # When processing a payment payments_total.labels(status='success', payment_method='credit_card').inc()

External metrics: Metrics from external systems (databases, APIs, etc.)

# Example exporter for MySQL
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-exporter
spec:
  selector:
    matchLabels:
      app: mysql-exporter
  template:
    metadata:
      labels:
        app: mysql-exporter
    spec:
      containers:
      - name: exporter
        image: prom/mysqld-exporter:v0.13.0
        env:
        - name: DATA_SOURCE_NAME
          valueFrom:
            secretKeyRef:
              name: mysql-credentials
              key: data-source-name

Synthetic monitoring: Simulated user interactions to test availability

# Blackbox exporter for probing endpoints
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: synthetic-monitoring
spec:
  groups:
  - name: synthetic
    rules:
    - alert: EndpointDown
      expr: probe_success{job="blackbox"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Endpoint {{ $labels.instance }} is down"

Automated Actions

Auto-scaling: HPA, VPA, and Cluster Autoscaler for dynamic resource allocation

# Custom metrics HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 500

Self-healing: Automated remediation based on monitoring signals

# Example of Kubernetes Event-driven Autoscaling (KEDA)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-consumer
spec:
  scaleTargetRef:
    name: consumer
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: queue_depth
      threshold: "10"
      query: sum(rabbitmq_queue_messages_ready{queue="tasks"})

Capacity planning: Predictive scaling based on historical patterns

# Prometheus recording rule for capacity planning
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: capacity-planning
spec:
  groups:
  - name: capacity
    rules:
    - record: capacity:cpu:prediction:next_7d
      expr: predict_linear(avg_over_time(cluster:cpu:usage:ratio[30d])[60d:1h], 86400 * 7)

Performance optimization: Automated tuning based on observed metrics

# Example of a performance tuning operator configuration
apiVersion: tuning.openshift.io/v1
kind: Profile
metadata:
  name: performance
spec:
  additionalKernelArgs:
    - "nmi_watchdog=0"
    - "audit=0"
    - "processor.max_cstate=1"
  cpu:
    isolated: "1-3"
    reserved: "0"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
      - size: "1G"
        count: 4

Best Practices Checklist

Monitor cluster health
- Set up comprehensive monitoring for all Kubernetes components
- Monitor control plane components (API server, etcd, scheduler)
- Track node health metrics (kubelet, container runtime)
- Use dedicated tools for Kubernetes-specific monitoring
Implement log rotation
- Configure log rotation policies to prevent disk exhaustion
- Set appropriate retention periods based on importance
- Consider regulatory requirements for log retention
- Implement automated archiving for long-term storage
Set up alerting
- Define meaningful alerts with clear thresholds
- Implement different severity levels and routing
- Create actionable alerts with runbook links
- Avoid alert fatigue with proper grouping and silencing
- Test alert delivery and escalation paths regularly
Use persistent storage
- Store metrics and logs on reliable persistent volumes
- Size storage appropriately based on retention needs
- Monitor storage usage and set up alerts for capacity thresholds
- Use storage classes appropriate for the workload (SSD for hot data)
Regular backups
- Back up monitoring and logging configurations
- Include dashboards, alert rules, and custom scripts
- Test restoration procedures periodically
- Consider disaster recovery scenarios
Monitor applications
- Instrument applications with custom metrics
- Track business-relevant indicators
- Implement health checks and readiness probes
- Monitor external dependencies
- Create application-specific dashboards
Track resource usage
- Monitor CPU, memory, disk, and network utilization
- Set up resource requests and limits based on actual usage
- Identify resource bottlenecks and optimization opportunities
- Implement cost allocation and showback
Implement security monitoring
- Monitor for suspicious activities and policy violations
- Track authentication and authorization events
- Implement audit logging for sensitive operations
- Scan for vulnerabilities and misconfigurations
- Set up alerts for security-related events
Performance tracking
- Define and track SLIs (Service Level Indicators)
- Establish SLOs (Service Level Objectives) for critical services
- Monitor latency, error rates, and throughput
- Implement distributed tracing for complex applications
- Conduct regular performance reviews
Capacity planning
- Track growth trends and forecast future needs
- Implement predictive analysis for capacity requirements
- Set up automated scaling mechanisms
- Plan for seasonal variations in workload
- Balance resource efficiency with performance requirements

Edit this page

ConfigMaps & Secrets

Understanding Kubernetes configuration and secrets management

Service Mesh & Ingress

Understanding Kubernetes service mesh architecture and ingress controllers