Welcome to from-docker-to-kubernetes

Monitoring & Logging

Learn how to monitor Docker containers and implement effective logging strategies

Docker Monitoring & Logging

Effective monitoring and logging are essential for maintaining reliable Docker environments. They help with troubleshooting, performance optimization, and ensuring application health. A comprehensive monitoring and logging strategy provides visibility into container behavior, enables proactive issue detection, and helps maintain service level objectives.

Docker containers present unique monitoring challenges due to their ephemeral nature, isolation characteristics, and the dynamic environments they often operate in. An effective container observability strategy must account for these characteristics while providing meaningful insights across the entire container lifecycle.

Container Monitoring

Built-in Monitoring

Docker provides basic monitoring capabilities out of the box that require no additional tools or setup:

# View real-time resource usage for all containers
docker stats
# Sample output:
# CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
# 1a2b3c4d5e6f   web       0.25%     15.23MiB / 1.952GiB   0.76%     648B / 648B       0B / 0B           2
# 2a3b4c5d6e7f   redis     0.05%     2.746MiB / 1.952GiB   0.14%     1.45kB / 1.05kB   0B / 0B           5

# View detailed container stats in JSON format (useful for scripting)
docker stats --no-stream --format "{{json .}}" container_name | jq '.'
# Pretty prints JSON output with jq for better readability

# Filter stats to show only specific containers
docker stats $(docker ps --format {{.Names}} | grep "api-")

# Display specific columns only
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

# Inspect container configuration details
docker inspect container_name

# Extract specific configuration information
docker inspect --format '{{.HostConfig.Memory}}' container_name
docker inspect --format '{{.State.Health.Status}}' container_name
docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' container_name

# View container events in real-time
docker events --filter 'type=container'

Built-in monitoring tools provide a good starting point but have limitations:

  • No historical data retention
  • Limited metrics granularity
  • No alerting capabilities
  • Minimal visualization options
  • Container-centric rather than application-centric view

cAdvisor

Google's container advisor (cAdvisor) provides more detailed container metrics with a web UI and exposes Prometheus-compatible endpoints:

# Run cAdvisor container with all necessary volume mounts
docker run -d \
  --name=cadvisor \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --privileged \
  --device=/dev/kmsg \
  --restart=always \
  gcr.io/cadvisor/cadvisor:v0.47.0

cAdvisor provides:

  • Detailed resource usage statistics (CPU, memory, network, filesystem)
  • Container lifecycle events
  • Historical performance data (short-term, in-memory storage)
  • Container hierarchy visualization
  • Prometheus metrics endpoint at /metrics
  • Hardware-specific monitoring (NVML for NVIDIA GPUs)
  • Auto-discovery of containers

Key metrics exposed by cAdvisor:

  • CPU usage breakdown (user, system, throttling)
  • Memory details (RSS, cache, swap, working set)
  • Network statistics (RX/TX bytes, errors, dropped packets)
  • Filesystem usage and I/O statistics
  • Per-process statistics within containers

Accessing the UI:

  • Web interface at http://localhost:8080
  • API endpoints for programmatic access:
    • /api/v1.3/containers - All container stats
    • /api/v1.3/docker/[container_name] - Specific container stats
    • /metrics - Prometheus-formatted metrics

Prometheus + Grafana

A powerful combination for metrics collection, storage, querying, and visualization, widely considered the industry standard for container monitoring:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.44.0
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped
    
  node-exporter:
    image: prom/node-exporter:v1.5.0
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped
    
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    restart: unless-stopped
    
  grafana:
    image: grafana/grafana:9.5.1
    container_name: grafana
    user: "472"
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin_password
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_DOMAIN=localhost
    restart: unless-stopped

volumes:
  prometheus_data: {}
  grafana_data: {}

Example prometheus.yml configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    
  - job_name: 'docker'
    metrics_path: /metrics
    static_configs:
      - targets: ['172.17.0.1:9323']  # Docker daemon metrics (requires daemon configuration)

  # Auto-discover and scrape containers with prometheus.io annotations
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        filters:
          - name: label
            values: ['prometheus.io.scrape=true']
    relabel_configs:
      - source_labels: ['__meta_docker_container_label_prometheus_io_port']
        target_label: __metrics_path__
        replacement: /metrics
      - source_labels: ['__meta_docker_container_name']
        target_label: container_name
      - source_labels: ['__meta_docker_container_label_prometheus_io_job']
        target_label: job

This setup provides:

  • Prometheus: Time-series database with a powerful query language (PromQL)
  • Node Exporter: Host-level metrics (CPU, memory, disk, network)
  • cAdvisor: Container-specific metrics
  • Grafana: Visualization dashboards and alerting

You can enable Docker daemon metrics by adding to /etc/docker/daemon.json:

{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}

Common Prometheus metrics for containers:

  • container_cpu_usage_seconds_total - Cumulative CPU time consumed
  • container_memory_usage_bytes - Current memory usage, including cache
  • container_memory_rss - Resident Set Size (actual memory usage)
  • container_network_receive_bytes_total - Network bytes received
  • container_network_transmit_bytes_total - Network bytes sent
  • container_fs_reads_bytes_total - Bytes read from disk
  • container_fs_writes_bytes_total - Bytes written to disk

Example PromQL queries:

# CPU usage percentage per container
sum(rate(container_cpu_usage_seconds_total{name=~".+"}[1m])) by (name) * 100

# Memory usage in MB per container
sum(container_memory_usage_bytes{name=~".+"}) by (name) / 1024 / 1024

# Network receive throughput
rate(container_network_receive_bytes_total{name=~".+"}[5m])

Logging Strategies

Configure Logging Drivers

Logging drivers can be configured at the daemon level (for all containers) or per container:

# Set logging driver for a specific container
docker run --log-driver=syslog \
  --log-opt syslog-address=udp://logs.example.com:514 \
  --log-opt syslog-facility=daemon \
  --log-opt tag="{{.Name}}/{{.ID}}" \
  nginx

# Use JSON file driver with size-based rotation
docker run --log-driver=json-file \
  --log-opt max-size=10m \
  --log-opt max-file=3 \
  --log-opt compress=true \
  nginx

# Configure AWS CloudWatch logging
docker run --log-driver=awslogs \
  --log-opt awslogs-region=us-west-2 \
  --log-opt awslogs-group=my-container-logs \
  --log-opt awslogs-stream=web-app \
  my-web-app

# Send logs to Fluentd with custom labels
docker run --log-driver=fluentd \
  --log-opt fluentd-address=localhost:24224 \
  --log-opt tag="docker.{{.Name}}" \
  --log-opt fluentd-async=true \
  --log-opt labels=environment,service_name \
  --label environment=production \
  --label service_name=api \
  my-api-service

Configure default logging driver for all containers in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true",
    "labels": "environment,project",
    "env": "HOSTNAME,ENVIRONMENT"
  }
}

After modifying daemon.json, restart the Docker daemon:

sudo systemctl restart docker

You can verify the current logging driver configuration:

docker info --format '{{.LoggingDriver}}'

For container-specific logging configuration:

docker inspect --format '{{.HostConfig.LogConfig}}' container_name

Common logging options across drivers:

  • mode - blocking (default) or non-blocking
  • max-buffer-size - buffer size for non-blocking mode
  • env - include specific environment variables in logs
  • labels - include container labels in logs
  • tag - customize log message tag format

Viewing Container Logs

Basic Log Commands

# View container logs (shows all logs)
docker logs container_name

# Follow container logs in real-time (like tail -f)
docker logs -f container_name

# Show timestamps with logs (useful for debugging)
docker logs -t container_name

# Show last n lines (instead of entire log history)
docker logs --tail=100 container_name

# Combine options for real-time monitoring with timestamps
docker logs -f -t --tail=50 container_name

# Filter logs by time (show logs since specific timestamp)
docker logs --since 2023-07-01T00:00:00 container_name

# Show logs until a specific timestamp
docker logs --until 2023-07-02T00:00:00 container_name

# Show logs from relative time (e.g., last hour)
docker logs --since=1h container_name

# Grep for specific patterns in logs
docker logs container_name | grep ERROR

# Output logs in different formats (useful for parsing)
docker logs --details container_name

# Count occurrences of specific log patterns
docker logs container_name | grep -c "Connection refused"

# Extract and format log timestamps for timing analysis
docker logs -t container_name | awk '{print $1, $2}' | sort

# Monitor multiple containers simultaneously
docker logs $(docker ps -q) 2>&1 | grep ERROR

# Show container logs with colorized output for different log levels
docker logs container_name | grep --color -E "ERROR|WARN|INFO|DEBUG|$"

Note that docker logs only works with containers using the json-file, local, or journald logging drivers. For other logging drivers (like syslog, fluentd, etc.), you'll need to access logs through their respective systems.

Log Retrieval Performance Considerations

  • For large log volumes, use --tail to limit output
  • Avoid frequent calls to docker logs on busy production systems
  • Consider using the --since flag to limit time range
  • For high-traffic containers, use a dedicated logging solution instead of docker logs
  • The docker logs command reads the entire log file before applying filters, which can be resource-intensive

Log Rotation

Log rotation is essential to prevent disk space exhaustion. Configure it in the Docker daemon settings:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",      // Maximum size of each log file before rotation
    "max-file": "3",        // Number of log files to keep
    "compress": "true"      // Compress rotated log files
  }
}

Container-specific log rotation can be configured at runtime:

# Configure log rotation for a specific container
docker run -d \
  --log-opt max-size=5m \
  --log-opt max-file=5 \
  --log-opt compress=true \
  --name web nginx

For existing deployments with unmanaged log files, you can use external log rotation:

# Example logrotate configuration (/etc/logrotate.d/docker-containers)
/var/lib/docker/containers/*/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

You can manually trigger log rotation by restarting the container or the Docker daemon.

Common log rotation issues:

  • Docker daemon restart required for daemon.json changes to take effect
  • Containers created before log rotation configuration won't have it applied
  • Very high log volume containers may still face issues despite rotation
  • copytruncate in logrotate may miss logs during rotation window
  • Nested JSON logs can be challenging to parse after rotation

Multi-container Log Aggregation

Tools like Fluentd, Logstash, or Filebeat can collect logs from multiple containers and forward them to centralized logging systems:

version: '3.8'
services:
  app:
    image: my-app
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: app.{{.Name}}
        fluentd-async: "true"
        fluentd-async-connect: "true"
        labels: "environment,service_name,version"
        env: "NODE_ENV,SERVICE_VERSION"
    labels:
      environment: production
      service_name: api
      version: "1.0.0"
    environment:
      NODE_ENV: production
      SERVICE_VERSION: "1.0.0"
  
  web:
    image: nginx
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: web.{{.Name}}
        fluentd-async: "true"
    depends_on:
      - app
  
  db:
    image: postgres
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: db.{{.Name}}
        fluentd-async: "true"
  
  fluentd:
    image: fluentd/fluentd:v1.16-1
    volumes:
      - ./fluentd/conf:/fluentd/etc
      - fluentd-data:/fluentd/log
    ports:
      - "24224:24224"
      - "24224:24224/udp"
    environment:
      - FLUENTD_CONF=fluent.conf
    restart: always

volumes:
  fluentd-data:

Example Fluentd configuration (fluent.conf):

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

# Parse Docker logs
<filter **>
  @type parser
  key_name log
  reserve_data true
  remove_key_name_field true
  <parse>
    @type json
    json_parser json
  </parse>
</filter>

# Add Kubernetes metadata if running in K8s
<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

# Output to Elasticsearch
<match **>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix fluentd
  <buffer>
    @type file
    path /fluentd/log/buffer
    flush_thread_count 2
    flush_interval 5s
    chunk_limit_size 2M
    queue_limit_length 32
    retry_max_interval 30
    retry_forever true
  </buffer>
</match>

# Keep a local copy of logs for debugging
<match **>
  @type copy
  <store>
    @type file
    path /fluentd/log/${tag}/%Y/%m/%d.%H.%M
    append true
    <format>
      @type json
    </format>
    <buffer tag,time>
      @type file
      timekey 1h
      timekey_use_utc true
      timekey_wait 10m
    </buffer>
  </store>
</match>

Alternative log aggregation solutions:

  • Filebeat: Lightweight log shipper from Elastic Stack
    filebeat:
      image: docker.elastic.co/beats/filebeat:8.8.0
      volumes:
        - ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
        - /var/lib/docker/containers:/var/lib/docker/containers:ro
        - /var/run/docker.sock:/var/run/docker.sock:ro
      user: root
      restart: always
    
  • Logstash: More powerful log processing pipeline
    logstash:
      image: docker.elastic.co/logstash/logstash:8.8.0
      volumes:
        - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
        - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
      ports:
        - "5000:5000/tcp"
        - "5000:5000/udp"
        - "9600:9600"
      environment:
        LS_JAVA_OPTS: "-Xmx512m -Xms512m"
      restart: always
    
  • Vector: High-performance observability data pipeline
    vector:
      image: timberio/vector:0.29.1-alpine
      volumes:
        - ./vector.toml:/etc/vector/vector.toml:ro
        - /var/lib/docker/containers:/var/lib/docker/containers:ro
        - /var/run/docker.sock:/var/run/docker.sock:ro
      ports:
        - "8686:8686"
      restart: always
    

Centralized Logging

ELK Stack Example

A production-ready ELK stack deployment with proper configuration and resource settings:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.8.1
    container_name: elasticsearch
    environment:
      - node.name=elasticsearch
      - cluster.name=docker-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=changeme
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elk
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:9200 | grep -q 'You Know, for Search'"]
      interval: 10s
      timeout: 10s
      retries: 120
  
  logstash:
    image: docker.elastic.co/logstash/logstash:8.8.1
    container_name: logstash
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    ports:
      - "5044:5044"
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "9600:9600"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"
    networks:
      - elk
    depends_on:
      - elasticsearch
  
  kibana:
    image: docker.elastic.co/kibana/kibana:8.8.1
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_URL=http://elasticsearch:9200
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=changeme
    networks:
      - elk
    depends_on:
      - elasticsearch
    healthcheck:
      test: ["CMD-SHELL", "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'"]
      interval: 10s
      timeout: 10s
      retries: 120
      
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.8.1
    container_name: filebeat
    command: filebeat -e -strict.perms=false
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    user: root
    networks:
      - elk
    depends_on:
      - elasticsearch
      - logstash

networks:
  elk:
    driver: bridge

volumes:
  es_data:
    driver: local

## Application Logging Best Practices

::steps
### Log to STDOUT/STDERR
- **Write logs to standard output and error streams**
  - Docker captures anything written to stdout/stderr
  - No need to manage log files within containers
  - Reduces complexity and disk space issues
  - Allows container restarts without losing logs
  - Makes logs available through `docker logs` command

- **Let Docker handle log collection**
  - Docker daemon manages log storage and rotation
  - Consistent logging mechanism across all containers
  - Enables centralized log configuration
  - Simplifies logging architecture
  - Allows switching logging drivers without application changes

- **Follows the 12-factor app methodology**
  - Principle IV: Treat logs as event streams
  - Application doesn't concern itself with storage/routing
  - Decouples log generation from log processing
  - Enables easier horizontal scaling
  - Promotes separation of concerns

- **Example logging practices:**
  ```javascript
  // Node.js example - good practice
  console.log(JSON.stringify({
    level: 'info',
    message: 'User logged in',
    timestamp: new Date().toISOString(),
    userId: user.id
  }));
  
  // Bad practice - writing to files
  // fs.appendFileSync('/var/log/app.log', 'User logged in\n');

Structured Logging

  • Use JSON or other structured formats
    • Enables machine-readable log processing
    • Maintains relationships between data fields
    • Simplifies parsing and querying
    • Preserves data types and hierarchies
    • Facilitates automated analysis
  • Include essential metadata fields
    • timestamp: ISO 8601 format with timezone (e.g., 2023-07-01T12:34:56.789Z)
    • level: Severity level (e.g., debug, info, warn, error)
    • service: Service or component name
    • message: Human-readable description
    • traceId: Distributed tracing identifier
  • Add correlation IDs for request tracking
    • Enables tracking requests across multiple services
    • Helps with distributed system debugging
    • Simplifies complex workflow analysis
    • Essential for microservice architectures
    • Example fields: requestId, correlationId, traceId, spanId
  • Include contextual information
    • User identifiers (anonymized if needed)
    • Resource identifiers (e.g., orderId, productId)
    • Operation being performed
    • Source information (e.g., IP address, user agent)
    • Performance metrics (e.g., duration, resource usage)
  • Example structured log format:
    {
      "timestamp": "2023-07-01T12:34:56.789Z",
      "level": "error",
      "service": "payment-service",
      "message": "Payment processing failed",
      "traceId": "abc123def456",
      "userId": "user-789",
      "orderId": "order-456",
      "error": {
        "code": "INSUFFICIENT_FUNDS",
        "message": "Insufficient funds in account"
      },
      "paymentMethod": "credit_card",
      "amount": 99.95,
      "duration_ms": 236
    }
    

Log Levels

  • Implement proper log levels
    • DEBUG: Detailed information for development/debugging
    • INFO: Confirmation that things are working as expected
    • WARN: Something unexpected but not necessarily an error
    • ERROR: Something failed that should be investigated
    • FATAL/CRITICAL: System is unusable, immediate attention required
  • Configure appropriate level for each environment
    • Development: DEBUG or INFO for maximum visibility
    • Testing/QA: INFO or WARN to reduce noise
    • Production: WARN or ERROR to minimize performance impact
    • Use environment variables to control log levels
    • Example:
      # Set log level via environment variable
      docker run -e LOG_LEVEL=INFO my-app
      
  • Security considerations
    • Never log credentials, tokens, or API keys
    • Hash or mask sensitive personal information
    • Comply with relevant regulations (GDPR, CCPA, HIPAA)
    • Be cautious with stack traces in production
    • Implement log field redaction for sensitive data
    • Example:
      // Logging with redaction
      logger.info({
        user: { id: user.id, email: redactEmail(user.email) },
        action: "profile_update",
        changes: redactSensitiveFields(changes)
      });
      
  • Include actionable information in error logs
    • Error type and message
    • Stack trace (in development/testing)
    • Context that led to the error
    • Request parameters (sanitized)
    • Correlation IDs for tracing
    • Suggested resolution steps where applicable
  • Performance considerations
    • Log volume impacts system performance
    • Use sampling for high-volume debug logs
    • Consider asynchronous logging for performance-critical paths
    • Implement circuit breakers for logging failures
    • Monitor and alert on abnormal log volume ::

Health Checks & Monitoring

Health checks provide automated monitoring of container health status, enabling Docker to detect and recover from application failures.

# Add health check to Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
  CMD curl -f http://localhost/ || exit 1

Health check parameters:

  • interval: Time between checks (default: 30s)
  • timeout: Maximum time for check to complete (default: 30s)
  • start-period: Initial grace period (default: 0s)
  • retries: Consecutive failures before unhealthy (default: 3)
version: '3.8'
services:
  web:
    image: nginx
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
  
  api:
    image: my-api:latest
    healthcheck:
      test: ["CMD", "wget", "-O", "/dev/null", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 5
      start_period: 30s
  
  redis:
    image: redis:alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
  
  postgres:
    image: postgres:13
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

Health check best practices:

  • Design checks that validate core functionality, not just that the process is running
  • Keep checks lightweight to avoid resource consumption
  • Include proper timeouts to prevent hanging checks
  • Set appropriate start periods for slow-starting applications
  • Implement health endpoints that check dependencies
  • Use exit codes properly (0 = healthy, 1 = unhealthy)
  • Avoid complex scripts that might fail for reasons unrelated to application health

Health check states:

  • starting: During start period, not counted as unhealthy yet
  • healthy: Check is passing
  • unhealthy: Check is failing

You can check container health status:

docker inspect --format='{{.State.Health.Status}}' container_name

Monitoring Metrics

Prometheus Configuration

Example prometheus.yml for comprehensive Docker monitoring:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

  # Add labels to all time series or alerts
  external_labels:
    environment: production
    region: us-west-1

# Rule files contain recording rules and alerting rules
rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  # Docker daemon metrics
  - job_name: 'docker'
    static_configs:
      - targets: ['docker-host:9323']
    metrics_path: '/metrics'
    scheme: 'http'
  
  # Container metrics from cAdvisor
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    metrics_path: '/metrics'
    scheme: 'http'
    scrape_interval: 10s
  
  # Host metrics from node-exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    metrics_path: '/metrics'
    scheme: 'http'
  
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Application metrics
  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/actuator/prometheus'
    scheme: 'http'
  
  # Auto-discover containers with Prometheus annotations
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        filters:
          - name: label
            values: ['prometheus.io.scrape=true']
    relabel_configs:
      # Use the container name as the instance label
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'instance'
        replacement: '$1'
      # Extract metrics path from container label
      - source_labels: ['__meta_docker_container_label_prometheus_io_metrics_path']
        regex: '(.+)'
        target_label: '__metrics_path__'
        replacement: '$1'
      # Extract port from container label
      - source_labels: ['__meta_docker_container_label_prometheus_io_port']
        regex: '(.+)'
        target_label: '__address__'
        replacement: '${1}:${__meta_docker_container_label_prometheus_io_port}'
      # Add container labels as Prometheus labels
      - action: labelmap
        regex: __meta_docker_container_label_(.+)

For alerting, add an AlertManager configuration:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Example alert rules file (/etc/prometheus/rules/container_alerts.yml)
groups:
- name: container_alerts
  rules:
  - alert: ContainerCpuUsage
    expr: (sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) / scalar(count(node_cpu_seconds_total{mode="idle"}))) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}%\n  LABELS = {{ $labels }}"
  
  - alert: ContainerMemoryUsage
    expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Memory usage (instance {{ $labels.instance }})"
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}%\n  LABELS = {{ $labels }}"

Grafana Dashboard Setup

Basic Setup

  1. Access Grafana (default: http://localhost:3000)
    • Ensure Grafana service is running and port is accessible
    • Check for any proxy or network restrictions
  2. Login with default credentials (admin/admin)
    • You'll be prompted to change the password on first login
    • Set a strong password and store it securely
    • Consider setting up additional users with appropriate permissions
  3. Add Prometheus data source
    • Navigate to Configuration > Data Sources
    • Click "Add data source" and select "Prometheus"
    • Set URL to http://prometheus:9090 (or appropriate address)
    • Set scrape interval to match Prometheus configuration
    • Test the connection to ensure it works
    • Advanced settings:
      HTTP Method: GET
      Type: Server (default)
      Access: Server (default)
      Disable metrics lookup: No
      Custom query parameters: None
      
  4. Import Docker monitoring dashboards
    • Navigate to Dashboards > Import
    • Enter dashboard ID or upload JSON file
    • Recommended dashboard IDs:
      • 893: Docker and system monitoring (1 server)
      • 10619: Docker monitoring with Prometheus
      • 11467: Container metrics dashboard
      • 1860: Node Exporter Full
    • Adjust variables to match your environment
    • Save dashboard with appropriate name and folder
  5. Configure alerts
    • Navigate to Alerting > Alert Rules
    • Create alert rules based on critical metrics
    • Set appropriate thresholds and evaluation intervals
    • Configure notification channels (email, Slack, PagerDuty, etc.)
    • Test alerts to ensure proper delivery
    • Example alert rule:
      Rule name: High Container CPU Usage
      Data source: Prometheus
      Expression: max by(container_name) (rate(container_cpu_usage_seconds_total{container_name!=""}[1m]) * 100) > 80
      Evaluation interval: 1m
      Pending period: 5m
      

Dashboard Recommendations

  • Docker Host metrics dashboard
    • System load, CPU, memory, disk, and network
    • Host uptime and stability metrics
    • Docker daemon metrics
    • Number of running containers
    • Resource utilization trends
    • Example panels:
      • CPU Usage by Container (stacked graph)
      • Memory Usage by Container (stacked graph)
      • Container Status Count (stat panel)
      • System Load (gauge)
      • Disk Space Usage (pie chart)
  • Container resource usage dashboard
    • Per-container CPU, memory, and I/O metrics
    • Container restart counts
    • Health check status
    • Network traffic by container
    • Key visualizations:
      • Heatmap of container resource usage
      • Time-series charts for each resource type
      • Top N resource consumers (table)
      • Container lifecycle events (annotations)
      • Resource limit vs. actual usage comparison
  • Application-specific metrics dashboards
    • Business KPIs relevant to your application
    • Request rates, error rates, and latencies
    • Database connection pool status
    • Cache hit ratios
    • Custom instrumentation metrics
    • User experience metrics
    • Example: E-commerce dashboard with:
      • Orders per minute
      • Cart abandonment rate
      • Payment processing time
      • Product search latency
      • Active user sessions
  • Alert thresholds visualization
    • Combine metrics with alert thresholds
    • Visual indicators for approaching thresholds
    • Alert history and frequency
    • Mean time to resolution tracking
    • Alert correlation with system events
    • Example panel: Graph with colored threshold bands
  • Log correlation views
    • Combined metrics and log panels
    • Error rate correlation with log volume
    • Event markers on time-series charts
    • Log context for anomalies
    • Drill-down from metrics to logs
    • Example: Request latency chart with error log entries as annotations

Alert Configuration

# Comprehensive Alertmanager configuration example
global:
  # The smarthost and SMTP sender used for mail notifications
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: 'alertmanager@example.org'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  
  # The auth token for Slack
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXX'
  
  # PagerDuty integration
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  
  # Default notification settings
  resolve_timeout: 5m

# Templates for notifications
templates:
  - '/etc/alertmanager/template/*.tmpl'

# The root route on which all alerts enter
route:
  # Default receiver
  receiver: 'team-emails'
  
  # Group alerts by category and environment
  group_by: ['alertname', 'environment', 'service']
  
  # Wait 30s to aggregate alerts of the same group
  group_wait: 30s
  
  # Send updated notification if new alerts added to group
  group_interval: 5m
  
  # How long to wait before sending a notification again
  repeat_interval: 4h
  
  # Child routes
  routes:
  - receiver: 'critical-pages'
    matchers:
      - severity="critical"
    # Don't wait to send critical alerts
    group_wait: 0s
    # Lower interval for critical alerts
    repeat_interval: 1h
    # Continue to forward to other child routes
    continue: true
    
  - receiver: 'slack-notifications'
    matchers:
      - severity=~"warning|info"
    # Continue processing other child routes
    continue: true
    
  - receiver: 'database-team'
    matchers:
      - service=~"database|postgres|mysql"
    
  - receiver: 'frontend-team'
    matchers:
      - service=~"frontend|ui|web"

# Inhibition rules prevent notifications of lower severity alerts if a higher severity
# alert is already firing for the same issue
inhibit_rules:
- source_matchers:
    - severity="critical"
  target_matchers:
    - severity="warning"
  # Only inhibit if the alertname is the same
  equal: ['alertname', 'instance']

receivers:
  - name: 'team-emails'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true
        html: '{{ template "email.default.html" . }}'
        headers:
          Subject: '{{ template "email.subject" . }}'
  
  - name: 'critical-pages'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        send_resolved: true
        description: '{{ template "pagerduty.default.description" . }}'
        severity: 'critical'
        client: 'Alertmanager'
        client_url: 'https://alertmanager.example.com'
  
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#monitoring'
        send_resolved: true
        icon_url: 'https://avatars3.githubusercontent.com/u/3380462'
        title: '{{ template "slack.default.title" . }}'
        title_link: 'https://alertmanager.example.com/#/alerts'
        text: '{{ template "slack.default.text" . }}'
        footer: 'Prometheus Alertmanager'
        actions:
          - type: 'button'
            text: 'Runbook 📚'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
  
  - name: 'database-team'
    slack_configs:
      - channel: '#db-alerts'
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'
  
  - name: 'frontend-team'
    slack_configs:
      - channel: '#frontend-alerts'
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

Debugging with Logs

Effective log analysis is crucial for troubleshooting container issues. Here are practical techniques for extracting valuable information from container logs:

# Debugging container issues with error filtering
docker logs --tail=100 -t container_name | grep ERROR

# Follow logs with specific pattern matching (using extended regex)
docker logs -f container_name | grep -E "error|exception|fail|fatal|panic"

# Extract logs with context (3 lines before and after each match)
docker logs container_name | grep -A 3 -B 3 "Exception"

# Find all occurrences of specific error types and count them
docker logs container_name | grep -o "OutOfMemoryError" | wc -l

# Search for logs within a specific time window
docker logs --since 2023-07-01T10:00:00 --until 2023-07-01T11:00:00 container_name

# Save logs to file for offline analysis
docker logs container_name > container.log

# Compare logs across time periods
docker logs --since 2h container_name > recent.log
docker logs --since 4h --until 2h container_name > older.log
diff recent.log older.log | less

# Parse JSON logs for better readability
docker logs container_name | grep -v '^$' | jq '.'

# Extract specific fields from JSON logs
docker logs container_name | grep -v '^$' | jq 'select(.level=="error") | {timestamp, message, error}'

# Follow logs and highlight different log levels with colors
docker logs -f container_name | GREP_COLOR="01;31" grep -E --color=always 'ERROR|$' | GREP_COLOR="01;33" grep -E --color=always 'WARN|$'

# Extract logs for a specific request ID
docker logs container_name | grep "request-123456"

# Find slow operations (e.g., queries taking more than 1 second)
docker logs container_name | grep -E "took [1-9][0-9]{3,}ms"

# Extract stack traces
docker logs container_name | grep -A 20 "Exception" | grep -v "^$"

# Analyze log volume by time (logs per minute)
docker logs -t container_name | cut -d' ' -f1 | cut -d'T' -f2 | cut -c1-8 | sort | uniq -c

Advanced debugging techniques:

  • Correlate logs with system events (deployments, scaling, etc.)
  • Compare logs across multiple containers for distributed issues
  • Use timestamps to create a sequence of events
  • Check container environment variables for configuration issues
  • Analyze container startup logs separately from runtime logs
  • Monitor both application and infrastructure logs simultaneously
  • Use regex patterns to extract structured data from unstructured logs

Advanced Monitoring Topics

Distributed Tracing

  • Implement OpenTelemetry or Jaeger
    • Track requests as they flow through distributed systems
    • Generate trace IDs to correlate logs across services
    • Instrument code with OpenTelemetry SDK
    • Deploy Jaeger or Zipkin collectors
    • Example Jaeger deployment:
      version: '3.8'
      services:
        jaeger:
          image: jaegertracing/all-in-one:1.37
          ports:
            - "6831:6831/udp"  # Jaeger agent - accepts spans in Thrift format
            - "16686:16686"    # Jaeger UI
          environment:
            - COLLECTOR_ZIPKIN_HOST_PORT=:9411
      
  • Trace requests across multiple containers
    • Propagate trace context between services
    • Capture request path across microservices
    • Record parent-child relationships between spans
    • Preserve baggage items for request context
    • Example client instrumentation (Node.js):
      const tracer = opentelemetry.trace.getTracer('my-service');
      const span = tracer.startSpan('process-order');
      try {
        // Add attributes to span
        span.setAttribute('order.id', orderId);
        span.setAttribute('customer.id', customerId);
        
        // Create child span
        const dbSpan = tracer.startSpan('database-query', {
          parent: span,
        });
        // Process database operations
        dbSpan.end();
        
        // Propagate context to HTTP requests
        const headers = {};
        opentelemetry.propagation.inject(opentelemetry.context.active(), headers);
        await axios.post('http://inventory-service/check', data, { headers });
      } catch (error) {
        span.recordException(error);
        span.setStatus({ code: SpanStatusCode.ERROR });
      } finally {
        span.end();
      }
      
  • Measure service latency
    • Calculate time spent in each service
    • Break down processing time by operation
    • Identify slow components or dependencies
    • Compare latency across different environments
    • Correlate latency with resource utilization
    • Example span attributes for latency analysis:
      database.query.duration_ms: 45.2
      http.request.duration_ms: 120.7
      cache.lookup_time_ms: 2.1
      business_logic.processing_time_ms: 15.8
      
  • Identify bottlenecks
    • Analyze critical path in request processing
    • Find operations with highest latency contribution
    • Detect contention points and resource constraints
    • Quantify impact of external dependencies
    • Trace-based hotspot analysis techniques
    • Example hotspot dashboard showing service latency breakdown
  • Visualize service dependencies
    • Generate service dependency graphs
    • Analyze traffic patterns between services
    • Calculate error rates between service pairs
    • Identify redundant or unnecessary calls
    • Detect circular dependencies
    • Example visualization tools:
      • Jaeger UI service graph
      • Grafana service graph panel
      • Kiali for service mesh visualization
      • Custom D3.js visualization

Custom Metrics

  • Expose application-specific metrics
    • Identify key business and technical metrics
    • Instrument application code with metrics
    • Expose metrics endpoints (/metrics)
    • Design meaningful metric names and labels
    • Balance cardinality and usefulness
    • Example custom metrics:
      # Counter for business events
      order_total{status="completed",payment_method="credit_card"} 1550.75
      
      # Gauge for active resource usage
      active_user_sessions{region="us-west"} 1250
      
      # Histogram for latency distributions
      api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.1"} 1500
      api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.5"} 1950
      api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="1.0"} 1990
      api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="+Inf"} 2000
      
  • Implement Prometheus client libraries
    • Use official client libraries for language-specific instrumentation
    • Create counters, gauges, histograms, and summaries
    • Register metrics with registry
    • Set up middleware for standard metrics
    • Add custom labels for filtering and aggregation
    • Example implementation (Python):
      from prometheus_client import Counter, Gauge, Histogram, Summary, start_http_server
      
      # Create metrics
      REQUEST_COUNT = Counter('app_request_count', 'Application Request Count', ['method', 'endpoint', 'status'])
      REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Application Request Latency', ['method', 'endpoint'])
      ACTIVE_SESSIONS = Gauge('app_active_sessions', 'Active Sessions', ['region'])
      
      # Start metrics endpoint
      start_http_server(8000)
      
      # Update metrics in code
      def process_request(request):
          ACTIVE_SESSIONS.labels(region='us-west').inc()
          
          with REQUEST_LATENCY.labels(method='POST', endpoint='/api/v1/order').time():
              # Process request
              result = handle_request(request)
          
          REQUEST_COUNT.labels(method='POST', endpoint='/api/v1/order', status=result.status_code).inc()
          ACTIVE_SESSIONS.labels(region='us-west').dec()
          
          return result
      
  • Create custom dashboards
    • Design purpose-specific visualization panels
    • Combine technical and business metrics
    • Create drill-down capabilities
    • Use appropriate visualization types
    • Implement dynamic variables for filtering
    • Example Grafana dashboard JSON structure:
      {
        "title": "E-Commerce Platform Dashboard",
        "panels": [
          {
            "title": "Order Volume by Payment Method",
            "type": "barchart",
            "datasource": "Prometheus",
            "targets": [
              {
                "expr": "sum by(payment_method) (order_total)",
                "legendFormat": "{{payment_method}}"
              }
            ]
          },
          {
            "title": "API Latency (95th Percentile)",
            "type": "timeseries",
            "datasource": "Prometheus",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint))",
                "legendFormat": "{{endpoint}}"
              }
            ]
          }
        ]
      }
      
  • Set relevant thresholds
    • Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives)
    • Create alert thresholds based on business impact
    • Implement multi-level thresholds (warning, critical)
    • Use historical data to establish baselines
    • Account for traffic patterns and seasonality
    • Example alert thresholds:
      - alert: APIHighLatency
        expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency on {{ $labels.endpoint }}"
          description: "95th percentile latency is {{ $value }}s, which exceeds the SLO of 0.5s"
      
  • Correlate with business metrics
    • Connect technical metrics to business outcomes
    • Measure conversion impact of performance issues
    • Track costs associated with resource usage
    • Create composite KPI dashboards
    • Implement business impact scoring
    • Example correlation queries:
      # Conversion rate vs. page load time
      sum(rate(purchase_completed_total[1h])) / sum(rate(product_page_view_total[1h]))
      
      # Revenue impact of errors
      sum(rate(order_total{status="error"}[1h]))
      
      # Customer satisfaction correlation
      rate(support_ticket_created{category="performance"}[1d]) / rate(active_user_sessions[1d])
      

Automated Responses

  • Implement auto-scaling based on metrics
    • Configure horizontal pod autoscaling
    • Use custom metrics for scaling decisions
    • Set appropriate cooldown periods
    • Implement predictive scaling for known patterns
    • Test scaling behavior under various conditions
    • Example Docker Swarm service scaling:
      # Autoscaling with docker service update
      while true; do
        # Get current metrics
        REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(http_requests_total\[1m\]\)\) | jq -r '.data.result[0].value[1]')
        
        # Calculate desired replicas (1 replica per 100 requests/second)
        DESIRED=$(echo "$REQUESTS / 100" | bc)
        if [ $DESIRED -lt 2 ]; then DESIRED=2; fi
        if [ $DESIRED -gt 10 ]; then DESIRED=10; fi
        
        # Get current replicas
        CURRENT=$(docker service ls --filter name=myapp --format "{{.Replicas}}" | cut -d '/' -f1)
        
        # Scale if needed
        if [ $DESIRED -ne $CURRENT ]; then
          echo "Scaling from $CURRENT to $DESIRED replicas"
          docker service update --replicas $DESIRED myapp
        fi
        
        sleep 30
      done
      
  • Configure auto-healing for failed containers
    • Set appropriate restart policies
    • Implement health checks for accurate failure detection
    • Configure liveness and readiness probes
    • Capture diagnostic information before restarts
    • Implement circuit breakers for dependent services
    • Example Docker Compose configuration:
      services:
        app:
          image: myapp:latest
          restart: unless-stopped
          healthcheck:
            test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
            interval: 10s
            timeout: 5s
            retries: 3
            start_period: 40s
          deploy:
            restart_policy:
              condition: on-failure
              max_attempts: 3
              window: 120s
          logging:
            driver: "json-file"
            options:
              max-size: "10m"
              max-file: "3"
      
  • Create runbooks for common issues
    • Document standard troubleshooting procedures
    • Include diagnostic commands and expected output
    • Link alerts to specific runbook sections
    • Provide escalation paths for unresolved issues
    • Maintain version control for runbooks
    • Example runbook structure:
      # API Service High Latency Runbook
      
      ## Symptoms
      - API response time > 500ms (95th percentile)
      - Increased error rate in downstream services
      - Alert: APIHighLatency firing
      
      ## Diagnostic Steps
      1. Check container resource usage:
      
      docker stats api-service
      
      2. Examine database connection pool:
      
      curl http://api-service:8080/actuator/metrics/hikaricp.connections.usage
      
      3. Check for slow queries:
      
      docker logs api-service | grep "slow query"
      
      ## Resolution Steps
      1. If connection pool utilization > 80%:
      - Increase pool size in config
      - Restart service with `docker restart api-service`
      
      2. If slow queries detected:
      - Check database indexes
      - Optimize identified queries
      
      3. If CPU/memory usage high:
      - Scale service: `docker service scale api-service=3`
      
      ## Escalation
      If issue persists after 15 minutes:
      - Contact: database-team@example.com
      - Slack: #database-support
      
  • Develop automated remediation
    • Implement scripted responses to common problems
    • Create self-healing capabilities
    • Add circuit breakers for degraded dependencies
    • Implement graceful degradation modes
    • Balance automation with human oversight
    • Example automated remediation script:
      #!/bin/bash
      # Automated database connection reset
      
      # Check connection errors
      ERROR_COUNT=$(docker logs --since 5m db-service | grep "connection reset" | wc -l)
      
      if [ $ERROR_COUNT -gt 10 ]; then
        echo "Detected database connection issues, performing remediation"
        
        # Capture diagnostics before remediation
        docker logs --since 15m db-service > /var/log/remediation/db-$(date +%s).log
        
        # Perform remediation
        docker exec db-service /scripts/connection-reset.sh
        
        # Verify fix
        sleep 5
        NEW_ERRORS=$(docker logs --since 1m db-service | grep "connection reset" | wc -l)
        
        # Notify on outcome
        if [ $NEW_ERRORS -eq 0 ]; then
          echo "Remediation successful" | slack-notify "#monitoring"
        else
          echo "Remediation failed, escalating" | slack-notify "#monitoring" --urgent
          # Trigger PagerDuty incident
          pagerduty-trigger "Database connection issues persist after remediation"
        fi
      fi
      
  • Setup escalation policies
    • Define clear escalation thresholds
    • Create tiered response teams
    • Implement on-call rotation schedules
    • Track mean time to acknowledge and resolve
    • Document communication protocols
    • Example escalation policy:
      # PagerDuty escalation policy
      escalation_policies:
        - name: "API Service Escalation"
          description: "Escalation policy for API service incidents"
          num_loops: 2
          escalation_rules:
            - escalation_delay_in_minutes: 15
              targets:
                - id: "PXXXXXX"  # Primary on-call engineer
                  type: "user_reference"
            - escalation_delay_in_minutes: 15
              targets:
                - id: "PXXXXXX"  # Secondary on-call engineer
                  type: "user_reference"
            - escalation_delay_in_minutes: 30
              targets:
                - id: "PXXXXXX"  # Engineering manager
                  type: "user_reference"
      

Performance Monitoring

Best Practices Checklist

Logging Best Practices

  • Log to STDOUT/STDERR
    • Follow container best practices
    • Enable centralized collection
    • Avoid managing log files inside containers
    • Let Docker logging drivers handle transport
    • Example: Configure applications to write directly to stdout/stderr rather than log files
  • Use structured logging format
    • Implement JSON-formatted logs
    • Include consistent metadata fields
    • Use proper data types within JSON
    • Add correlation IDs for request tracking
    • Maintain schema consistency
    • Example structured log:
      {
        "timestamp": "2023-07-10T15:23:45.123Z",
        "level": "error",
        "service": "order-service",
        "message": "Failed to process payment",
        "traceId": "abc123def456",
        "orderId": "order-789",
        "errorCode": "PAYMENT_DECLINED",
        "duration_ms": 345
      }
      
  • Implement log rotation
    • Configure size and time-based rotation
    • Set appropriate retention periods
    • Compress rotated logs
    • Monitor disk usage
    • Handle rotation gracefully
    • Example Docker logging configuration:
      {
        "log-driver": "json-file",
        "log-opts": {
          "max-size": "20m",
          "max-file": "5",
          "compress": "true"
        }
      }
      
  • Set appropriate log levels
    • Use DEBUG for development environments
    • Use INFO or WARN for production
    • Reserve ERROR for actionable issues
    • Make log levels configurable at runtime
    • Use consistent level definitions across services
    • Example log level configuration:
      logging:
        level:
          root: WARN
          com.example.api: INFO
          com.example.database: WARN
          com.example.payment: INFO
      
  • Configure centralized logging
    • Aggregate logs from all containers
    • Implement proper indexing and search
    • Set up log parsing and normalization
    • Configure access controls for logs
    • Establish retention and archival policies
    • Example EFK stack setup:
      • Filebeat or Fluentd to collect logs
      • Elasticsearch for storage and indexing
      • Kibana for visualization and search
      • Curator for index lifecycle management
  • Avoid sensitive data in logs
    • Implement data masking for PII
    • Never log credentials or secrets
    • Truncate potentially large payloads
    • Remove sensitive headers
    • Anonymize personal identifiers
    • Example masking implementation:
      function logRequest(req) {
        const sanitized = {
          method: req.method,
          path: req.path,
          query: sanitizeObject(req.query),
          headers: sanitizeHeaders(req.headers),
          body: sanitizeObject(req.body)
        };
        logger.info({ request: sanitized }, "Incoming request");
      }
      
      function sanitizeObject(obj) {
        const masked = {...obj};
        const sensitiveFields = ['password', 'token', 'ssn', 'credit_card'];
        
        for (const field of sensitiveFields) {
          if (masked[field]) masked[field] = '***REDACTED***';
        }
        return masked;
      }
      

Monitoring Best Practices

  • Monitor both host and containers
    • Track host-level resources (CPU, memory, disk, network)
    • Monitor container-specific metrics
    • Correlate container performance with host constraints
    • Watch for noisy neighbor problems
    • Track Docker daemon health
    • Example monitoring stack:
      • Node Exporter for host metrics
      • cAdvisor for container metrics
      • Docker daemon metrics endpoint
      • Process-specific metrics from applications
  • Implement alerting with appropriate thresholds
    • Create multi-level alerts (warning/critical)
    • Avoid alert fatigue with proper thresholds
    • Include runbook links in alert notifications
    • Group related alerts to reduce noise
    • Implement alert de-duplication
    • Example alerting configuration:
      - alert: ContainerHighCpuUsage
        expr: rate(container_cpu_usage_seconds_total{name!=""}[1m]) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container CPU usage high ({{ $labels.name }})"
          description: "Container CPU usage is above 80% for 5 minutes"
          runbook_url: "https://wiki.example.com/runbooks/container-cpu"
      
  • Use visualization dashboards
    • Create role-specific dashboards
    • Include both overview and detailed views
    • Use appropriate visualization types
    • Implement template variables for filtering
    • Balance information density with readability
    • Example dashboard organization:
      • Executive overview: High-level health and KPIs
      • Operations dashboard: System health and resource usage
      • Developer dashboard: Application performance and errors
      • Service-specific dashboards: Detailed metrics for each service
      • On-call dashboard: Current alerts and recent incidents
  • Track business-relevant metrics
    • Monitor key performance indicators (KPIs)
    • Create business-technical correlation views
    • Measure user experience metrics
    • Track conversion and engagement metrics
    • Monitor transaction value and volume
    • Example business metrics:
      # Business metrics in Prometheus format
      # HELP order_value_total Total value of orders in currency
      # TYPE order_value_total counter
      order_value_total{currency="USD",status="completed"} 15234.50
      
      # HELP checkout_started Total number of checkout processes started
      # TYPE checkout_started counter
      checkout_started 5423
      
      # HELP checkout_completed Total number of checkout processes completed
      # TYPE checkout_completed counter
      checkout_completed 4231
      
  • Implement health checks
    • Create meaningful application health endpoints
    • Check dependencies in health probes
    • Implement readiness vs. liveness separation
    • Make health checks lightweight
    • Include version and dependency information
    • Example health check endpoint:
      app.get('/health', async (req, res) => {
        try {
          // Check database connectivity
          const dbStatus = await checkDatabase();
          
          // Check cache service
          const cacheStatus = await checkRedis();
          
          // Check external API dependency
          const apiStatus = await checkExternalApi();
          
          const allHealthy = dbStatus && cacheStatus && apiStatus;
          
          res.status(allHealthy ? 200 : 503).json({
            status: allHealthy ? 'healthy' : 'unhealthy',
            version: '1.2.3',
            timestamp: new Date().toISOString(),
            components: {
              database: dbStatus ? 'up' : 'down',
              cache: cacheStatus ? 'up' : 'down',
              api: apiStatus ? 'up' : 'down'
            }
          });
        } catch (error) {
          res.status(500).json({ status: 'error', error: error.message });
        }
      });
      
  • Plan for monitoring scalability
    • Design for growth in container count
    • Implement metric aggregation for high-cardinality data
    • Use appropriate retention policies
    • Consider resource requirements for monitoring tools
    • Implement federation for large-scale deployments
    • Example scalability techniques:
      • Prometheus hierarchical federation
      • Service discovery for dynamic environments
      • Metric aggregation and downsampling
      • Sharding metrics by service or namespace
      • Custom recording rules for common queries
      # Recording rules for efficient querying
      groups:
      - name: container_aggregation
        interval: 1m
        rules:
        - record: job:container_cpu:usage_rate5m
          expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (job)
        - record: job:container_memory:usage_bytes
          expr: sum(container_memory_usage_bytes) by (job)