Welcome to from-docker-to-kubernetes

Monitoring & Logging

Understanding Kubernetes monitoring, logging, and observability practices

Monitoring & Logging

Effective monitoring and logging are crucial for maintaining healthy Kubernetes clusters and applications. A comprehensive monitoring and logging strategy provides visibility into cluster health, application performance, and helps troubleshoot issues before they affect end users.

Kubernetes observability consists of three main pillars:

  1. Monitoring: Collecting and analyzing metrics about system performance and behavior
  2. Logging: Capturing and storing event data and messages from applications and system components
  3. Tracing: Following the path of requests through distributed systems to identify bottlenecks

Monitoring Components

Metrics Server

  • Basic cluster metrics: Lightweight, in-memory metrics collector
  • CPU and memory usage: Resource utilization for pods and nodes
  • Pod and node metrics: Key metrics for orchestration decisions
  • Horizontal Pod Autoscaling: Provides metrics for the HPA controller
  • Simple API: Accessible via Kubernetes API (metrics.k8s.io)
  • No historical data: Only provides current state (not for long-term analysis)

Prometheus

  • Time-series database: Purpose-built database for metrics with timestamps
  • Metrics collection: Pull-based architecture with service discovery
  • Query language (PromQL): Powerful language for metrics analysis and aggregation
  • Alert management: Define alert conditions and routing
  • Federation: Scale to large deployments by federating multiple Prometheus servers
  • Service discovery: Auto-discovers targets via Kubernetes API
  • Extensive integrations: Wide ecosystem of exporters for various systems

Grafana

  • Visualization platform: Create rich, interactive dashboards
  • Dashboard creation: Drag-and-drop interface with extensive customization
  • Metrics exploration: Query and explore metrics in real-time
  • Alert notification: Define alerts based on metrics thresholds
  • Annotation: Mark events on time-series graphs
  • Multi-datasource: Connect to Prometheus, Elasticsearch, and many other data sources
  • Template variables: Create dynamic, reusable dashboards

Basic Monitoring Setup

Below is a simplified Prometheus deployment. In production environments, you would typically use the Prometheus Operator or a solution like Kube Prometheus Stack.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
    component: core
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: core
  template:
    metadata:
      labels:
        app: prometheus
        component: core
    spec:
      serviceAccountName: prometheus # Must have permissions to access API server
      containers:
      - name: prometheus
        image: prom/prometheus:v2.30.3
        args:
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--storage.tsdb.retention.time=15d"
          - "--web.console.libraries=/etc/prometheus/console_libraries"
          - "--web.console.templates=/etc/prometheus/consoles"
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: prometheus-data
          mountPath: /prometheus
        resources:
          requests:
            cpu: 500m
            memory: 500Mi
          limits:
            cpu: 1
            memory: 1Gi
      volumes:
      - name: config-volume
        configMap:
          name: prometheus-config
      - name: prometheus-data
        persistentVolumeClaim:
          claimName: prometheus-data

The Prometheus configuration would be stored in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

Resource Metrics

Pod Metrics

The kubectl top command provides a quick way to view resource usage for pods:

kubectl top pod -n default

Output example:

NAME                     CPU(cores)   MEMORY(bytes)   
nginx-85b98978db-2cgnj   1m           9Mi            
redis-78586f566b-jz5jq   1m           10Mi           
webapp-7d7ffd5cc9-nv2xz  10m          45Mi           

Node Metrics

View resource utilization across all nodes in the cluster:

kubectl top node

Output example:

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
worker-node-1   248m         12%    1867Mi          24%       
worker-node-2   134m         6%     2507Mi          32%       
worker-node-3   324m         16%    2120Mi          27%       

Custom Metrics

The Prometheus Operator uses ServiceMonitor CRDs to define which services should be monitored:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    release: prometheus  # Used by Prometheus Operator for selection
spec:
  selector:
    matchLabels:
      app: myapp  # Will monitor all services with this label
  namespaceSelector:
    matchNames:
      - default
      - myapp-namespace
  endpoints:
  - port: metrics  # The service port name to scrape
    path: /metrics  # The metrics path (default is /metrics)
    interval: 15s  # Scrape interval
    scrapeTimeout: 10s  # Timeout for each scrape request
    honorLabels: true  # Keep original metric labels
    metricRelabelings:  # Optional transformations of metrics
    - sourceLabels: [__name__]
      regex: 'api_http_requests_total'
      action: keep

For custom metrics to work with Horizontal Pod Autoscalers, you'll need to deploy the Prometheus Adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second  # Custom metric
      target:
        type: AverageValue
        averageValue: 100

Logging Architecture

Container Logging

In Kubernetes, containers should log to stdout and stderr instead of files. This example demonstrates a simple logging pattern:

apiVersion: v1
kind: Pod
metadata:
  name: counter
  labels:
    app: counter
    component: example
spec:
  containers:
  - name: count
    image: busybox
    args:
    - /bin/sh
    - -c
    - >
      i=0;
      while true;
      do
        echo "$i: $(date)";
        i=$((i+1));
        sleep 1;
      done
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "128Mi"
        cpu: "200m"

You can access these logs using kubectl:

# Follow logs in real-time
kubectl logs -f counter

# Show logs with timestamps
kubectl logs --timestamps counter

# Show logs for previous container instance (after restart)
kubectl logs --previous counter

# Show last 100 lines
kubectl logs --tail=100 counter

For multi-container pods, you must specify the container name:

kubectl logs counter -c count

Best practices for application logging:

  • Use structured logging (JSON format)
  • Include consistent metadata (request ID, user ID, etc.)
  • Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
  • Don't write logs to files inside containers
  • Consider log volume in high-traffic services

Logging Solutions

Elasticsearch

  • Log storage: Distributed, scalable document store optimized for search
  • Full-text search: Advanced search capabilities with analyzers and tokenizers
  • Analytics: Real-time aggregations and analytics on log data
  • Visualization: Integration with visualization tools
  • Scalability: Horizontal scaling with sharding and replication
  • Schema flexibility: Dynamic mapping for varying log formats
  • Retention management: Index lifecycle policies for log rotation and archival

Fluentd / Fluent Bit

  • Log collection: Tail files and receive events from various sources
  • Unified logging: Consistent format across all logging sources
  • Multiple outputs: Send to multiple destinations (Elasticsearch, S3, etc.)
  • Plugin architecture: Extensible with custom plugins
  • Buffering: Handles log spikes and network issues
  • Kubernetes integration: Native metadata enrichment
  • Filtering: Parse, transform, and filter logs before forwarding
  • Performance: Fluent Bit is a lightweight alternative for edge cases

Kibana

  • Log visualization: Rich visualization of log data
  • Dashboard creation: Custom dashboards for different teams/use cases
  • Log exploration: Interactive exploration and search of log data
  • Alert management: Alerting based on log patterns and thresholds
  • Reporting: Generate reports from log data
  • Machine learning: Anomaly detection in log patterns
  • Role-based access: Control access to different log sources

Other Solutions

  • Loki: Horizontally-scalable, multi-tenant log aggregation system designed for cost efficiency
  • Grafana: Can visualize logs from multiple sources including Loki and Elasticsearch
  • ELK Stack: Combined Elasticsearch, Logstash, and Kibana deployment
  • Datadog/New Relic/Dynatrace: Commercial observability platforms with integrated logging
  • CloudWatch Logs: Native solution for AWS EKS clusters
  • Google Cloud Logging: Native solution for GKE clusters
  • Azure Monitor: Native solution for AKS clusters

EFK Stack Deployment

Here's an example of a production-ready Elasticsearch deployment as part of an EFK (Elasticsearch, Fluentd, Kibana) stack:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
  labels:
    app: elasticsearch
    component: logging
spec:
  serviceName: elasticsearch
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
      - name: fix-permissions
        image: busybox
        command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
      - name: increase-vm-max-map
        image: busybox
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        securityContext:
          privileged: true
      - name: increase-fd-ulimit
        image: busybox
        command: ["sh", "-c", "ulimit -n 65536"]
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
        env:
        - name: cluster.name
          value: kubernetes-logs
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms1g -Xmx1g"
        resources:
          limits:
            cpu: 1000m
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
        readinessProbe:
          httpGet:
            path: /_cluster/health
            port: http
          initialDelaySeconds: 20
          timeoutSeconds: 5
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: elasticsearch-data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 100Gi

A complementary Fluentd DaemonSet to collect logs from all nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    app: fluentd
    component: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.13-debian-elasticsearch7
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: "http"
        - name: FLUENT_ELASTICSEARCH_USER
          value: "elastic"
        - name: FLUENT_ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: elasticsearch-credentials
              key: password
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluentd-config

And Kibana for visualization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
  labels:
    app: kibana
    component: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:7.15.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: http://elasticsearch:9200
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 512Mi
        ports:
        - containerPort: 5601
          name: http
        readinessProbe:
          httpGet:
            path: /api/status
            port: http
          initialDelaySeconds: 30
          timeoutSeconds: 10

Monitoring Best Practices

Alert Management

Prometheus alerts are defined as PrometheusRules in the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: high-cpu-usage
  namespace: monitoring
  labels:
    prometheus: k8s  # Label used by Prometheus to find rules
    role: alert-rules
spec:
  groups:
  - name: cpu
    rules:
    - alert: HighCPUUsage
      expr: sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, namespace) > 0.9
      for: 5m  # Must be true for this duration before firing
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High CPU usage for pod {{ $labels.pod }} in namespace {{ $labels.namespace }}"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been using more than 90% of its CPU limit for 5 minutes."
        runbook_url: "https://wiki.example.com/runbooks/high-cpu-usage"
        dashboard_url: "https://grafana.example.com/d/pods?var-namespace={{ $labels.namespace }}&var-pod={{ $labels.pod }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
      for: 15m
      labels:
        severity: critical
        team: app-team
      annotations:
        summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value | printf \"%.0f\" }} times in the last 15 minutes"

AlertManager then handles routing and notification:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: main
  namespace: monitoring
spec:
  replicas: 3
  configSecret: alertmanager-config
---
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['namespace', 'alertname', 'job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match_re:
          service: ^(foo|bar)$
        receiver: 'team-X-mails'
    
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: "{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:*\n{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`\n{{ end }}{{ end }}"
    
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true
    
    - name: 'team-X-mails'
      email_configs:
      - to: 'team-X+alerts@example.com'
        send_resolved: true

Dashboards

Metrics Dashboard

  • Resource utilization: CPU, memory, disk, and network usage
  • Performance metrics: Request latency, throughput, error rates
  • Health status: Pod/node availability, readiness/liveness probe results
  • Alert overview: Active alerts, recent resolutions, alert history
  • Cluster capacity: Available resources, scheduling headroom
  • Workload metrics: Deployment status, replica counts, scaling events
  • Custom application metrics: Business-specific indicators

Example Grafana dashboard JSON for pod resources:

{
  "title": "Kubernetes Pod Resources",
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}[5m])) by (container)",
          "legendFormat": "{{container}}"
        }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(container_memory_usage_bytes{namespace=\"$namespace\", pod=\"$pod\", container!=\"POD\", container!=\"\"}) by (container)",
          "legendFormat": "{{container}}"
        }
      ]
    }
  ],
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)"
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
      }
    ]
  }
}

Logging Dashboard

  • Log patterns: Visualize log volume patterns and trends
  • Error rates: Track error frequency and types
  • Application logs: Filter and search application-specific logs
  • System logs: Monitor Kubernetes components and node-level logs
  • Log correlations: Connect logs to related metrics and events
  • Audit logs: Track security-relevant actions in the cluster
  • Custom parsing: Extract and visualize structured log fields

Example Kibana dashboard configuration:

{
  "objects": [
    {
      "attributes": {
        "title": "Kubernetes Logs Overview",
        "panelsJSON": "[
          {
            \"embeddableConfig\": {},
            \"gridData\": {
              \"h\": 15,
              \"i\": \"1\",
              \"w\": 24,
              \"x\": 0,
              \"y\": 0
            },
            \"panelIndex\": \"1\",
            \"panelRefName\": \"panel_0\",
            \"title\": \"Log Volume Over Time\",
            \"version\": \"7.10.0\"
          },
          {
            \"embeddableConfig\": {},
            \"gridData\": {
              \"h\": 15,
              \"i\": \"2\",
              \"w\": 24,
              \"x\": 24,
              \"y\": 0
            },
            \"panelIndex\": \"2\",
            \"panelRefName\": \"panel_1\",
            \"title\": \"Errors by Namespace\",
            \"version\": \"7.10.0\"
          }
        ]",
        "timeRestore": false,
        "kibanaSavedObjectMeta": {
          "searchSourceJSON": "{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"
        }
      },
      "id": "kubernetes-logs-overview",
      "type": "dashboard"
    }
  ]
}

Integrated Dashboards

Modern observability platforms combine metrics, logs, and traces:

  • Cross-referencing: Navigate from metrics to related logs
  • Service maps: Visualize service dependencies

Performance Monitoring

Key Metrics

  • CPU usage: Container and node CPU utilization, throttling events
  • Memory consumption: RSS, working set, page faults, OOM events
  • Network traffic: Bytes sent/received, packet rate, connection states
  • Disk I/O: Read/write operations, latency, queue depth, errors
  • Response times: Request latency percentiles (p50, p95, p99), processing time

Service Level Indicators (SLIs)

  • Availability: Percentage of successful requests (100% - error rate)
    sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
  • Latency: Request duration distribution and percentiles
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
    
  • Error rate: Percentage of failed requests
    sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
  • Throughput: Request rate per second
    sum(rate(http_requests_total[5m]))
    
  • Saturation: How "full" your service is (resource utilization)
    max(sum by(pod) (rate(container_cpu_usage_seconds_total[5m])) / sum by(pod) (kube_pod_container_resource_limits_cpu_cores))
    

Service Level Objectives (SLOs)

  • Targets set for service performance and reliability
  • Example: 99.9% availability over 30 days
  • Error budgets: Allowable downtime within SLO
  • Measuring SLO compliance:
    # Recording rule for availability SLO
    - record: availability:slo_ratio_30d
      expr: sum_over_time(availability:ratio[30d]) / count_over_time(availability:ratio[30d])
    

Log Management

Log Aggregation

Log aggregation collects logs from all nodes and applications into a centralized system. A Fluentd DaemonSet is a common approach:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.13
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config-volume
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config-volume
        configMap:
          name: fluentd-config

Log Structuring

Structured logging greatly improves searchability and analysis:

{
  "timestamp": "2023-04-01T12:34:56.789Z",
  "level": "INFO",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "user_id": "user-789",
  "message": "Payment processed successfully",
  "amount": 125.50,
  "currency": "USD",
  "payment_method": "credit_card"
}

Log Retention and Archiving

Managing log volume is crucial for both cost and performance:

  • Hot storage: Recent logs (1-7 days) in fast storage for active searching
  • Warm storage: Medium-term logs (7-30 days) for less frequent access
  • Cold storage: Long-term archival (30+ days) for compliance and auditing

Example Elasticsearch Index Lifecycle Policy:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Troubleshooting

Advanced Topics

Service Mesh Monitoring

  • Istio metrics: Detailed service-to-service communication metrics
    # Example Prometheus scrape config for Istio
    scrape_configs:
      - job_name: 'istio-mesh'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: istio-telemetry;prometheus
    
  • Traffic monitoring: Request volume, success rates, latency between services
  • Service graphs: Visual representation of service dependencies and health
    # Grafana dashboard query for service dependency
    sum(rate(istio_requests_total{reporter="source"}[5m])) by (source_workload, destination_workload)
    
  • Distributed tracing: End-to-end request tracing with Jaeger or Zipkin
    # Jaeger deployment example
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: jaeger
      namespace: tracing
    spec:
      selector:
        matchLabels:
          app: jaeger
      template:
        metadata:
          labels:
            app: jaeger
        spec:
          containers:
          - name: jaeger
            image: jaegertracing/all-in-one:1.24
            ports:
            - containerPort: 16686
              name: ui
            - containerPort: 14268
              name: collector
    

Custom Metrics

  • Application metrics: Custom instrumentation of application code
    // Go example with Prometheus client
    httpRequestsTotal := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests by status code and method",
        },
        []string{"status_code", "method"},
    )
    prometheus.MustRegister(httpRequestsTotal)
    
    // In HTTP handler
    httpRequestsTotal.WithLabelValues(strconv.Itoa(statusCode), method).Inc()
    
  • Business metrics: KPIs specific to your domain (orders, users, etc.)
    # Python example
    from prometheus_client import Counter
    
    payments_total = Counter('payments_total', 'Total payments processed', 
                             ['status', 'payment_method'])
    
    # When processing a payment
    payments_total.labels(status='success', payment_method='credit_card').inc()
    
  • External metrics: Metrics from external systems (databases, APIs, etc.)
    # Example exporter for MySQL
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mysql-exporter
    spec:
      selector:
        matchLabels:
          app: mysql-exporter
      template:
        metadata:
          labels:
            app: mysql-exporter
        spec:
          containers:
          - name: exporter
            image: prom/mysqld-exporter:v0.13.0
            env:
            - name: DATA_SOURCE_NAME
              valueFrom:
                secretKeyRef:
                  name: mysql-credentials
                  key: data-source-name
    
  • Synthetic monitoring: Simulated user interactions to test availability
    # Blackbox exporter for probing endpoints
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: synthetic-monitoring
    spec:
      groups:
      - name: synthetic
        rules:
        - alert: EndpointDown
          expr: probe_success{job="blackbox"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Endpoint {{ $labels.instance }} is down"
    

Automated Actions

  • Auto-scaling: HPA, VPA, and Cluster Autoscaler for dynamic resource allocation
    # Custom metrics HPA example
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: 500
    
  • Self-healing: Automated remediation based on monitoring signals
    # Example of Kubernetes Event-driven Autoscaling (KEDA)
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: queue-consumer
    spec:
      scaleTargetRef:
        name: consumer
      triggers:
      - type: prometheus
        metadata:
          serverAddress: http://prometheus.monitoring:9090
          metricName: queue_depth
          threshold: "10"
          query: sum(rabbitmq_queue_messages_ready{queue="tasks"})
    
  • Capacity planning: Predictive scaling based on historical patterns
    # Prometheus recording rule for capacity planning
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: capacity-planning
    spec:
      groups:
      - name: capacity
        rules:
        - record: capacity:cpu:prediction:next_7d
          expr: predict_linear(avg_over_time(cluster:cpu:usage:ratio[30d])[60d:1h], 86400 * 7)
    
  • Performance optimization: Automated tuning based on observed metrics
    # Example of a performance tuning operator configuration
    apiVersion: tuning.openshift.io/v1
    kind: Profile
    metadata:
      name: performance
    spec:
      additionalKernelArgs:
        - "nmi_watchdog=0"
        - "audit=0"
        - "processor.max_cstate=1"
      cpu:
        isolated: "1-3"
        reserved: "0"
      hugepages:
        defaultHugepagesSize: "1G"
        pages:
          - size: "1G"
            count: 4
    

Best Practices Checklist