Monitoring & Logging

Learn how to monitor Docker containers and implement effective logging strategies

Effective monitoring and logging are essential for maintaining reliable Docker environments. They help with troubleshooting, performance optimization, and ensuring application health. A comprehensive monitoring and logging strategy provides visibility into container behavior, enables proactive issue detection, and helps maintain service level objectives.

Docker containers present unique monitoring challenges due to their ephemeral nature, isolation characteristics, and the dynamic environments they often operate in. An effective container observability strategy must account for these characteristics while providing meaningful insights across the entire container lifecycle.

Container Monitoring

Built-in Monitoring

Docker provides basic monitoring capabilities out of the box that require no additional tools or setup:

# View real-time resource usage for all containers
docker stats
# Sample output:
# CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
# 1a2b3c4d5e6f   web       0.25%     15.23MiB / 1.952GiB   0.76%     648B / 648B       0B / 0B           2
# 2a3b4c5d6e7f   redis     0.05%     2.746MiB / 1.952GiB   0.14%     1.45kB / 1.05kB   0B / 0B           5

# View detailed container stats in JSON format (useful for scripting)
docker stats --no-stream --format "{{json .}}" container_name | jq '.'
# Pretty prints JSON output with jq for better readability

# Filter stats to show only specific containers
docker stats $(docker ps --format {{.Names}} | grep "api-")

# Display specific columns only
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

# Inspect container configuration details
docker inspect container_name

# Extract specific configuration information
docker inspect --format '{{.HostConfig.Memory}}' container_name
docker inspect --format '{{.State.Health.Status}}' container_name
docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' container_name

# View container events in real-time
docker events --filter 'type=container'

Built-in monitoring tools provide a good starting point but have limitations:

No historical data retention
Limited metrics granularity
No alerting capabilities
Minimal visualization options
Container-centric rather than application-centric view

cAdvisor

Google's container advisor (cAdvisor) provides more detailed container metrics with a web UI and exposes Prometheus-compatible endpoints:

# Run cAdvisor container with all necessary volume mounts
docker run -d \
  --name=cadvisor \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --privileged \
  --device=/dev/kmsg \
  --restart=always \
  gcr.io/cadvisor/cadvisor:v0.47.0

cAdvisor provides:

Detailed resource usage statistics (CPU, memory, network, filesystem)
Container lifecycle events
Historical performance data (short-term, in-memory storage)
Container hierarchy visualization
Prometheus metrics endpoint at /metrics
Hardware-specific monitoring (NVML for NVIDIA GPUs)
Auto-discovery of containers

Key metrics exposed by cAdvisor:

CPU usage breakdown (user, system, throttling)
Memory details (RSS, cache, swap, working set)
Network statistics (RX/TX bytes, errors, dropped packets)
Filesystem usage and I/O statistics
Per-process statistics within containers

Accessing the UI:

Web interface at http://localhost:8080
API endpoints for programmatic access:
- /api/v1.3/containers - All container stats
- /api/v1.3/docker/[container_name] - Specific container stats
- /metrics - Prometheus-formatted metrics

Prometheus + Grafana

A powerful combination for metrics collection, storage, querying, and visualization, widely considered the industry standard for container monitoring:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.44.0
    container_name: prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped
    
  node-exporter:
    image: prom/node-exporter:v1.5.0
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped
    
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    restart: unless-stopped
    
  grafana:
    image: grafana/grafana:9.5.1
    container_name: grafana
    user: "472"
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin_password
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_DOMAIN=localhost
    restart: unless-stopped

volumes:
  prometheus_data: {}
  grafana_data: {}

Example prometheus.yml configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    
  - job_name: 'docker'
    metrics_path: /metrics
    static_configs:
      - targets: ['172.17.0.1:9323']  # Docker daemon metrics (requires daemon configuration)

  # Auto-discover and scrape containers with prometheus.io annotations
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        filters:
          - name: label
            values: ['prometheus.io.scrape=true']
    relabel_configs:
      - source_labels: ['__meta_docker_container_label_prometheus_io_port']
        target_label: __metrics_path__
        replacement: /metrics
      - source_labels: ['__meta_docker_container_name']
        target_label: container_name
      - source_labels: ['__meta_docker_container_label_prometheus_io_job']
        target_label: job

This setup provides:

Prometheus: Time-series database with a powerful query language (PromQL)
Node Exporter: Host-level metrics (CPU, memory, disk, network)
cAdvisor: Container-specific metrics
Grafana: Visualization dashboards and alerting

You can enable Docker daemon metrics by adding to /etc/docker/daemon.json:

{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}

Common Prometheus metrics for containers:

container_cpu_usage_seconds_total - Cumulative CPU time consumed
container_memory_usage_bytes - Current memory usage, including cache
container_memory_rss - Resident Set Size (actual memory usage)
container_network_receive_bytes_total - Network bytes received
container_network_transmit_bytes_total - Network bytes sent
container_fs_reads_bytes_total - Bytes read from disk
container_fs_writes_bytes_total - Bytes written to disk

Example PromQL queries:

# CPU usage percentage per container
sum(rate(container_cpu_usage_seconds_total{name=~".+"}[1m])) by (name) * 100

# Memory usage in MB per container
sum(container_memory_usage_bytes{name=~".+"}) by (name) / 1024 / 1024

# Network receive throughput
rate(container_network_receive_bytes_total{name=~".+"}[5m])

Logging Strategies

Docker offers multiple logging drivers to handle container logs, each suitable for different environments and requirements:

json-file (default)
- Logs stored as JSON files on the host
- Simple to use and configure
- Local access to logs via docker logs
- Limited by local disk space
- Requires log rotation configuration
syslog
- Forwards logs to syslog daemon
- Integrates with existing syslog infrastructure
- Supports UDP, TCP, and TLS transport
- No access to logs via docker logs
- Standard format used by many log analyzers
journald
- Forwards logs to systemd journal
- Structured logging with metadata
- Index-based searching capabilities
- Good integration with systemd-based systems
- Access logs via journalctl or docker logs
fluentd
- Forwards logs to fluentd collector
- Highly configurable log processing
- Supports multiple output destinations
- Requires fluentd to be running
- No access to logs via docker logs
awslogs
- Sends logs to Amazon CloudWatch
- Native AWS integration
- Centralized logs for AWS deployments
- Region-specific configuration
- Managed retention and access control
splunk
- Sends logs directly to Splunk
- Enterprise-grade log management
- Supports Splunk indexing and searching
- Requires Splunk HTTP Event Collector
- Commercial solution with advanced features
gelf (Graylog)
- Sends logs in GELF format
- Supports compression
- Better structured data than syslog
- Works well with Graylog, Logstash
- Handles multi-line messages properly
loki
- Designed for Grafana Loki
- Label-based indexing (like Prometheus)
- Highly efficient storage
- Great integration with Grafana
- Optimized for cost-effectiveness

Configure Logging Drivers

Logging drivers can be configured at the daemon level (for all containers) or per container:

# Set logging driver for a specific container
docker run --log-driver=syslog \
  --log-opt syslog-address=udp://logs.example.com:514 \
  --log-opt syslog-facility=daemon \
  --log-opt tag="{{.Name}}/{{.ID}}" \
  nginx

# Use JSON file driver with size-based rotation
docker run --log-driver=json-file \
  --log-opt max-size=10m \
  --log-opt max-file=3 \
  --log-opt compress=true \
  nginx

# Configure AWS CloudWatch logging
docker run --log-driver=awslogs \
  --log-opt awslogs-region=us-west-2 \
  --log-opt awslogs-group=my-container-logs \
  --log-opt awslogs-stream=web-app \
  my-web-app

# Send logs to Fluentd with custom labels
docker run --log-driver=fluentd \
  --log-opt fluentd-address=localhost:24224 \
  --log-opt tag="docker.{{.Name}}" \
  --log-opt fluentd-async=true \
  --log-opt labels=environment,service_name \
  --label environment=production \
  --label service_name=api \
  my-api-service

Configure default logging driver for all containers in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true",
    "labels": "environment,project",
    "env": "HOSTNAME,ENVIRONMENT"
  }
}

After modifying daemon.json, restart the Docker daemon:

sudo systemctl restart docker

You can verify the current logging driver configuration:

docker info --format '{{.LoggingDriver}}'

For container-specific logging configuration:

docker inspect --format '{{.HostConfig.LogConfig}}' container_name

Common logging options across drivers:

mode - blocking (default) or non-blocking
max-buffer-size - buffer size for non-blocking mode
env - include specific environment variables in logs
labels - include container labels in logs
tag - customize log message tag format

Viewing Container Logs

Basic Log Commands

# View container logs (shows all logs)
docker logs container_name

# Follow container logs in real-time (like tail -f)
docker logs -f container_name

# Show timestamps with logs (useful for debugging)
docker logs -t container_name

# Show last n lines (instead of entire log history)
docker logs --tail=100 container_name

# Combine options for real-time monitoring with timestamps
docker logs -f -t --tail=50 container_name

# Filter logs by time (show logs since specific timestamp)
docker logs --since 2023-07-01T00:00:00 container_name

# Show logs until a specific timestamp
docker logs --until 2023-07-02T00:00:00 container_name

# Show logs from relative time (e.g., last hour)
docker logs --since=1h container_name

# Grep for specific patterns in logs
docker logs container_name | grep ERROR

# Output logs in different formats (useful for parsing)
docker logs --details container_name

# Count occurrences of specific log patterns
docker logs container_name | grep -c "Connection refused"

# Extract and format log timestamps for timing analysis
docker logs -t container_name | awk '{print $1, $2}' | sort

# Monitor multiple containers simultaneously
docker logs $(docker ps -q) 2>&1 | grep ERROR

# Show container logs with colorized output for different log levels
docker logs container_name | grep --color -E "ERROR|WARN|INFO|DEBUG|$"

Note that docker logs only works with containers using the json-file, local, or journald logging drivers. For other logging drivers (like syslog, fluentd, etc.), you'll need to access logs through their respective systems.

Log Retrieval Performance Considerations

For large log volumes, use --tail to limit output
Avoid frequent calls to docker logs on busy production systems
Consider using the --since flag to limit time range
For high-traffic containers, use a dedicated logging solution instead of docker logs
The docker logs command reads the entire log file before applying filters, which can be resource-intensive

Log Rotation

Log rotation is essential to prevent disk space exhaustion. Configure it in the Docker daemon settings:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",      // Maximum size of each log file before rotation
    "max-file": "3",        // Number of log files to keep
    "compress": "true"      // Compress rotated log files
  }
}

Container-specific log rotation can be configured at runtime:

# Configure log rotation for a specific container
docker run -d \
  --log-opt max-size=5m \
  --log-opt max-file=5 \
  --log-opt compress=true \
  --name web nginx

For existing deployments with unmanaged log files, you can use external log rotation:

# Example logrotate configuration (/etc/logrotate.d/docker-containers)
/var/lib/docker/containers/*/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}

You can manually trigger log rotation by restarting the container or the Docker daemon.

Common log rotation issues:

Docker daemon restart required for daemon.json changes to take effect
Containers created before log rotation configuration won't have it applied
Very high log volume containers may still face issues despite rotation
copytruncate in logrotate may miss logs during rotation window
Nested JSON logs can be challenging to parse after rotation

Multi-container Log Aggregation

Tools like Fluentd, Logstash, or Filebeat can collect logs from multiple containers and forward them to centralized logging systems:

version: '3.8'
services:
  app:
    image: my-app
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: app.{{.Name}}
        fluentd-async: "true"
        fluentd-async-connect: "true"
        labels: "environment,service_name,version"
        env: "NODE_ENV,SERVICE_VERSION"
    labels:
      environment: production
      service_name: api
      version: "1.0.0"
    environment:
      NODE_ENV: production
      SERVICE_VERSION: "1.0.0"
  
  web:
    image: nginx
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: web.{{.Name}}
        fluentd-async: "true"
    depends_on:
      - app
  
  db:
    image: postgres
    logging:
      driver: fluentd
      options:
        fluentd-address: fluentd:24224
        tag: db.{{.Name}}
        fluentd-async: "true"
  
  fluentd:
    image: fluentd/fluentd:v1.16-1
    volumes:
      - ./fluentd/conf:/fluentd/etc
      - fluentd-data:/fluentd/log
    ports:
      - "24224:24224"
      - "24224:24224/udp"
    environment:
      - FLUENTD_CONF=fluent.conf
    restart: always

volumes:
  fluentd-data:

Example Fluentd configuration (fluent.conf):

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

# Parse Docker logs
<filter **>
  @type parser
  key_name log
  reserve_data true
  remove_key_name_field true
  <parse>
    @type json
    json_parser json
  </parse>
</filter>

# Add Kubernetes metadata if running in K8s
<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

# Output to Elasticsearch
<match **>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix fluentd
  <buffer>
    @type file
    path /fluentd/log/buffer
    flush_thread_count 2
    flush_interval 5s
    chunk_limit_size 2M
    queue_limit_length 32
    retry_max_interval 30
    retry_forever true
  </buffer>
</match>

# Keep a local copy of logs for debugging
<match **>
  @type copy
  <store>
    @type file
    path /fluentd/log/${tag}/%Y/%m/%d.%H.%M
    append true
    <format>
      @type json
    </format>
    <buffer tag,time>
      @type file
      timekey 1h
      timekey_use_utc true
      timekey_wait 10m
    </buffer>
  </store>
</match>

Alternative log aggregation solutions:

Filebeat: Lightweight log shipper from Elastic Stack
filebeat: image: docker.elastic.co/beats/filebeat:8.8.0 volumes: - ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock:ro user: root restart: always

Logstash: More powerful log processing pipeline

logstash:
  image: docker.elastic.co/logstash/logstash:8.8.0
  volumes:
    - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
    - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
  ports:
    - "5000:5000/tcp"
    - "5000:5000/udp"
    - "9600:9600"
  environment:
    LS_JAVA_OPTS: "-Xmx512m -Xms512m"
  restart: always

Vector: High-performance observability data pipeline
vector: image: timberio/vector:0.29.1-alpine volumes: - ./vector.toml:/etc/vector/vector.toml:ro - /var/lib/docker/containers:/var/lib/docker/containers:ro - /var/run/docker.sock:/var/run/docker.sock:ro ports: - "8686:8686" restart: always

Centralized Logging

For production environments, consider these centralized logging solutions:

ELK Stack (Elasticsearch, Logstash, Kibana)
- Open-source solution with powerful search capabilities
- Highly scalable with clustering support
- Rich visualization and dashboarding with Kibana
- Flexible log processing with Logstash pipelines
- Can be self-hosted or used as a managed service (Elastic Cloud)
- Resource-intensive, requires careful sizing and tuning
- Community and commercial support options available
Graylog
- Open-source log management with enterprise features
- MongoDB for metadata storage, Elasticsearch for searching
- Stream-based routing and processing
- Built-in user management and role-based access control
- Alerting and reporting capabilities
- Native GELF format support
- Generally easier to set up than full ELK stack
Splunk
- Enterprise-grade log analysis platform
- Powerful search processing language (SPL)
- Advanced analytics and machine learning capabilities
- Comprehensive dashboarding and visualization
- Extensive integration ecosystem
- Commercial solution with licensing costs
- Available as cloud service or self-hosted
AWS CloudWatch Logs
- Native AWS service with tight integration
- Automatic scaling with no infrastructure to manage
- Log Insights for querying and analysis
- Integration with other AWS services
- Metric extraction from logs
- Pay-as-you-go pricing model
- Best for AWS-centric environments
Google Cloud Logging
- Native GCP logging service
- Automatic ingestion from GKE and other GCP services
- Log Explorer for search and analysis
- Error Reporting for automatic grouping of errors
- Log-based metrics and alerting
- Integration with Cloud Monitoring
- Ideal for GCP-based workloads
Azure Monitor Logs
- Native Azure service (formerly Log Analytics)
- Kusto Query Language (KQL) for log analysis
- Integrated with Azure Application Insights
- Workbooks for interactive reporting
- AI-powered analytics
- Unified with metrics in Azure Monitor
- Best choice for Azure-deployed applications
Loki (Grafana Loki)
- Log aggregation system designed to be cost-effective
- Label-based indexing similar to Prometheus
- Works well with existing Grafana deployments
- Lower resource requirements than Elasticsearch
- Horizontal scalability
- Open-source with commercial support options
- Strong integration with Prometheus ecosystem
Datadog Logs
- SaaS platform with unified observability
- Combines logs, metrics, and traces
- ML-powered analytics and anomaly detection
- Real-time log processing and filtering
- Extensive integration catalog
- Tag-based correlation across different data types
- Commercial solution with subscription pricing

ELK Stack Example

A production-ready ELK stack deployment with proper configuration and resource settings:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.8.1
    container_name: elasticsearch
    environment:
      - node.name=elasticsearch
      - cluster.name=docker-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=changeme
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elk
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:9200 | grep -q 'You Know, for Search'"]
      interval: 10s
      timeout: 10s
      retries: 120
  
  logstash:
    image: docker.elastic.co/logstash/logstash:8.8.1
    container_name: logstash
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    ports:
      - "5044:5044"
      - "5000:5000/tcp"
      - "5000:5000/udp"
      - "9600:9600"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"
    networks:
      - elk
    depends_on:
      - elasticsearch
  
  kibana:
    image: docker.elastic.co/kibana/kibana:8.8.1
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_URL=http://elasticsearch:9200
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=changeme
    networks:
      - elk
    depends_on:
      - elasticsearch
    healthcheck:
      test: ["CMD-SHELL", "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'"]
      interval: 10s
      timeout: 10s
      retries: 120
      
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.8.1
    container_name: filebeat
    command: filebeat -e -strict.perms=false
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    user: root
    networks:
      - elk
    depends_on:
      - elasticsearch
      - logstash

networks:
  elk:
    driver: bridge

volumes:
  es_data:
    driver: local

## Application Logging Best Practices

::steps
### Log to STDOUT/STDERR
- **Write logs to standard output and error streams**
  - Docker captures anything written to stdout/stderr
  - No need to manage log files within containers
  - Reduces complexity and disk space issues
  - Allows container restarts without losing logs
  - Makes logs available through `docker logs` command

- **Let Docker handle log collection**
  - Docker daemon manages log storage and rotation
  - Consistent logging mechanism across all containers
  - Enables centralized log configuration
  - Simplifies logging architecture
  - Allows switching logging drivers without application changes

- **Follows the 12-factor app methodology**
  - Principle IV: Treat logs as event streams
  - Application doesn't concern itself with storage/routing
  - Decouples log generation from log processing
  - Enables easier horizontal scaling
  - Promotes separation of concerns

- **Example logging practices:**
  ```javascript
  // Node.js example - good practice
  console.log(JSON.stringify({
    level: 'info',
    message: 'User logged in',
    timestamp: new Date().toISOString(),
    userId: user.id
  }));
  
  // Bad practice - writing to files
  // fs.appendFileSync('/var/log/app.log', 'User logged in\n');

Structured Logging

Use JSON or other structured formats
- Enables machine-readable log processing
- Maintains relationships between data fields
- Simplifies parsing and querying
- Preserves data types and hierarchies
- Facilitates automated analysis
Include essential metadata fields
- timestamp: ISO 8601 format with timezone (e.g., 2023-07-01T12:34:56.789Z)
- level: Severity level (e.g., debug, info, warn, error)
- service: Service or component name
- message: Human-readable description
- traceId: Distributed tracing identifier
Add correlation IDs for request tracking
- Enables tracking requests across multiple services
- Helps with distributed system debugging
- Simplifies complex workflow analysis
- Essential for microservice architectures
- Example fields: requestId, correlationId, traceId, spanId
Include contextual information
- User identifiers (anonymized if needed)
- Resource identifiers (e.g., orderId, productId)
- Operation being performed
- Source information (e.g., IP address, user agent)
- Performance metrics (e.g., duration, resource usage)

Example structured log format:

{
  "timestamp": "2023-07-01T12:34:56.789Z",
  "level": "error",
  "service": "payment-service",
  "message": "Payment processing failed",
  "traceId": "abc123def456",
  "userId": "user-789",
  "orderId": "order-456",
  "error": {
    "code": "INSUFFICIENT_FUNDS",
    "message": "Insufficient funds in account"
  },
  "paymentMethod": "credit_card",
  "amount": 99.95,
  "duration_ms": 236
}

Log Levels

Implement proper log levels
- DEBUG: Detailed information for development/debugging
- INFO: Confirmation that things are working as expected
- WARN: Something unexpected but not necessarily an error
- ERROR: Something failed that should be investigated
- FATAL/CRITICAL: System is unusable, immediate attention required
Configure appropriate level for each environment
- Development: DEBUG or INFO for maximum visibility
- Testing/QA: INFO or WARN to reduce noise
- Production: WARN or ERROR to minimize performance impact
- Use environment variables to control log levels
- Example:
  # Set log level via environment variable docker run -e LOG_LEVEL=INFO my-app
Security considerations
- Never log credentials, tokens, or API keys
- Hash or mask sensitive personal information
- Comply with relevant regulations (GDPR, CCPA, HIPAA)
- Be cautious with stack traces in production
- Implement log field redaction for sensitive data
- Example:
  // Logging with redaction logger.info({ user: { id: user.id, email: redactEmail(user.email) }, action: "profile_update", changes: redactSensitiveFields(changes) });
Include actionable information in error logs
- Error type and message
- Stack trace (in development/testing)
- Context that led to the error
- Request parameters (sanitized)
- Correlation IDs for tracing
- Suggested resolution steps where applicable
Performance considerations
- Log volume impacts system performance
- Use sampling for high-volume debug logs
- Consider asynchronous logging for performance-critical paths
- Implement circuit breakers for logging failures
- Monitor and alert on abnormal log volume ::

Health Checks & Monitoring

Health checks provide automated monitoring of container health status, enabling Docker to detect and recover from application failures.

# Add health check to Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
  CMD curl -f http://localhost/ || exit 1

Health check parameters:

interval: Time between checks (default: 30s)
timeout: Maximum time for check to complete (default: 30s)
start-period: Initial grace period (default: 0s)
retries: Consecutive failures before unhealthy (default: 3)

version: '3.8'
services:
  web:
    image: nginx
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
  
  api:
    image: my-api:latest
    healthcheck:
      test: ["CMD", "wget", "-O", "/dev/null", "-q", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 5
      start_period: 30s
  
  redis:
    image: redis:alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
  
  postgres:
    image: postgres:13
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

Health check best practices:

Design checks that validate core functionality, not just that the process is running
Keep checks lightweight to avoid resource consumption
Include proper timeouts to prevent hanging checks
Set appropriate start periods for slow-starting applications
Implement health endpoints that check dependencies
Use exit codes properly (0 = healthy, 1 = unhealthy)
Avoid complex scripts that might fail for reasons unrelated to application health

Health check states:

starting: During start period, not counted as unhealthy yet
healthy: Check is passing
unhealthy: Check is failing

You can check container health status:

docker inspect --format='{{.State.Health.Status}}' container_name

Monitoring Metrics

Key metrics to monitor:

CPU usage
- Overall utilization percentage
- User vs. system time
- CPU throttling events
- Load average
- Context switches and interrupts
- CPU usage per core/thread
Memory consumption
- RSS (Resident Set Size) - actual physical memory
- Virtual memory size
- Cache and buffer usage
- Heap vs. non-heap (for JVM applications)
- Memory limit usage percentage
- OOM (Out Of Memory) kill events
- Memory page faults
Disk I/O
- Read/write operations per second
- Bytes read/written per second
- Read/write latency
- Disk queue length
- Disk space usage (per volume)
- Inode usage
- Filesystem errors
Network traffic
- Bytes received/transmitted per second
- Packets received/transmitted per second
- Network errors and dropped packets
- Connection count (established, time-wait, etc.)
- TCP retransmits
- DNS query time
- Network latency
Container health status
- Health check status
- Uptime/age
- Restart count and reasons
- Init process status
- Container state changes
- Exit codes from previous runs
Application-specific metrics
- Request rate and throughput
- Response time (average, percentiles)
- Error rate and types
- Queue lengths and processing time
- Cache hit/miss ratio
- Database query performance
- Business KPIs (orders, users, conversions)
Container lifecycle metrics
- Container restart count
- Container creation/destruction rate
- Image pull time
- Container start time
- Build duration
- Failed starts/deployments
Resource orchestration metrics
- Number of running containers
- Resource allocation vs. usage
- Scheduling failures
- Node health and capacity
- Autoscaling events
- Deployment success rate
- Resource distribution balance

Prometheus Configuration

Example prometheus.yml for comprehensive Docker monitoring:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

  # Add labels to all time series or alerts
  external_labels:
    environment: production
    region: us-west-1

# Rule files contain recording rules and alerting rules
rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  # Docker daemon metrics
  - job_name: 'docker'
    static_configs:
      - targets: ['docker-host:9323']
    metrics_path: '/metrics'
    scheme: 'http'
  
  # Container metrics from cAdvisor
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    metrics_path: '/metrics'
    scheme: 'http'
    scrape_interval: 10s
  
  # Host metrics from node-exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    metrics_path: '/metrics'
    scheme: 'http'
  
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Application metrics
  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/actuator/prometheus'
    scheme: 'http'
  
  # Auto-discover containers with Prometheus annotations
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        filters:
          - name: label
            values: ['prometheus.io.scrape=true']
    relabel_configs:
      # Use the container name as the instance label
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'instance'
        replacement: '$1'
      # Extract metrics path from container label
      - source_labels: ['__meta_docker_container_label_prometheus_io_metrics_path']
        regex: '(.+)'
        target_label: '__metrics_path__'
        replacement: '$1'
      # Extract port from container label
      - source_labels: ['__meta_docker_container_label_prometheus_io_port']
        regex: '(.+)'
        target_label: '__address__'
        replacement: '${1}:${__meta_docker_container_label_prometheus_io_port}'
      # Add container labels as Prometheus labels
      - action: labelmap
        regex: __meta_docker_container_label_(.+)

For alerting, add an AlertManager configuration:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Example alert rules file (/etc/prometheus/rules/container_alerts.yml)
groups:
- name: container_alerts
  rules:
  - alert: ContainerCpuUsage
    expr: (sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) / scalar(count(node_cpu_seconds_total{mode="idle"}))) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}%\n  LABELS = {{ $labels }}"
  
  - alert: ContainerMemoryUsage
    expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Memory usage (instance {{ $labels.instance }})"
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}%\n  LABELS = {{ $labels }}"

Grafana Dashboard Setup

Basic Setup

Access Grafana (default: http://localhost:3000)
- Ensure Grafana service is running and port is accessible
- Check for any proxy or network restrictions
Login with default credentials (admin/admin)
- You'll be prompted to change the password on first login
- Set a strong password and store it securely
- Consider setting up additional users with appropriate permissions
Add Prometheus data source
- Navigate to Configuration > Data Sources
- Click "Add data source" and select "Prometheus"
- Set URL to http://prometheus:9090 (or appropriate address)
- Set scrape interval to match Prometheus configuration
- Test the connection to ensure it works
- Advanced settings:
  HTTP Method: GET Type: Server (default) Access: Server (default) Disable metrics lookup: No Custom query parameters: None
Import Docker monitoring dashboards
- Navigate to Dashboards > Import
- Enter dashboard ID or upload JSON file
- Recommended dashboard IDs:
  - 893: Docker and system monitoring (1 server)
  - 10619: Docker monitoring with Prometheus
  - 11467: Container metrics dashboard
  - 1860: Node Exporter Full
- Adjust variables to match your environment
- Save dashboard with appropriate name and folder
Configure alerts
- Navigate to Alerting > Alert Rules
- Create alert rules based on critical metrics
- Set appropriate thresholds and evaluation intervals
- Configure notification channels (email, Slack, PagerDuty, etc.)
- Test alerts to ensure proper delivery
- Example alert rule:
  Rule name: High Container CPU Usage Data source: Prometheus Expression: max by(container_name) (rate(container_cpu_usage_seconds_total{container_name!=""}[1m]) * 100) > 80 Evaluation interval: 1m Pending period: 5m

Dashboard Recommendations

Docker Host metrics dashboard
- System load, CPU, memory, disk, and network
- Host uptime and stability metrics
- Docker daemon metrics
- Number of running containers
- Resource utilization trends
- Example panels:
  - CPU Usage by Container (stacked graph)
  - Memory Usage by Container (stacked graph)
  - Container Status Count (stat panel)
  - System Load (gauge)
  - Disk Space Usage (pie chart)
Container resource usage dashboard
- Per-container CPU, memory, and I/O metrics
- Container restart counts
- Health check status
- Network traffic by container
- Key visualizations:
  - Heatmap of container resource usage
  - Time-series charts for each resource type
  - Top N resource consumers (table)
  - Container lifecycle events (annotations)
  - Resource limit vs. actual usage comparison
Application-specific metrics dashboards
- Business KPIs relevant to your application
- Request rates, error rates, and latencies
- Database connection pool status
- Cache hit ratios
- Custom instrumentation metrics
- User experience metrics
- Example: E-commerce dashboard with:
  - Orders per minute
  - Cart abandonment rate
  - Payment processing time
  - Product search latency
  - Active user sessions
Alert thresholds visualization
- Combine metrics with alert thresholds
- Visual indicators for approaching thresholds
- Alert history and frequency
- Mean time to resolution tracking
- Alert correlation with system events
- Example panel: Graph with colored threshold bands
Log correlation views
- Combined metrics and log panels
- Error rate correlation with log volume
- Event markers on time-series charts
- Log context for anomalies
- Drill-down from metrics to logs
- Example: Request latency chart with error log entries as annotations

Alert Configuration

Configure alerts for critical conditions:

Container restarts
- Alert when containers restart too frequently
- Track crash loops and failure patterns
- Set thresholds based on service criticality
- Correlate with deployment events
- Example query: increase(kube_pod_container_status_restarts_total[1h]) > 3
High resource usage
- CPU utilization exceeding thresholds (e.g., >80% for 15min)
- Memory approaching limits (e.g., >90% of limit)
- Disk space running low (e.g., <10% free)
- Network saturation
- Example query: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 90
Health check failures
- Failed health checks for critical services
- Increased health check latency
- Intermittent health check failures
- Health check timeout frequency
- Example query: rate(container_health_status_changes_total{health_status="unhealthy"}[5m]) > 0
Error log patterns
- Increased error rate in application logs
- Critical error messages or patterns
- Authentication failures
- Database connection issues
- Example query: rate(log_messages_total{level="error"}[5m]) > 10
Abnormal application behavior
- Request latency spikes
- Increased error responses (HTTP 5xx)
- Abnormal traffic patterns
- Unexpected drop in transactions
- Database query performance degradation
- Example query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
Infrastructure issues
- Docker daemon errors
- Storage driver issues
- Network connectivity problems
- Host resource exhaustion
- System-level errors
- Example query: rate(docker_engine_daemon_errors_total[5m]) > 0

# Comprehensive Alertmanager configuration example
global:
  # The smarthost and SMTP sender used for mail notifications
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: 'alertmanager@example.org'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  
  # The auth token for Slack
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXX'
  
  # PagerDuty integration
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  
  # Default notification settings
  resolve_timeout: 5m

# Templates for notifications
templates:
  - '/etc/alertmanager/template/*.tmpl'

# The root route on which all alerts enter
route:
  # Default receiver
  receiver: 'team-emails'
  
  # Group alerts by category and environment
  group_by: ['alertname', 'environment', 'service']
  
  # Wait 30s to aggregate alerts of the same group
  group_wait: 30s
  
  # Send updated notification if new alerts added to group
  group_interval: 5m
  
  # How long to wait before sending a notification again
  repeat_interval: 4h
  
  # Child routes
  routes:
  - receiver: 'critical-pages'
    matchers:
      - severity="critical"
    # Don't wait to send critical alerts
    group_wait: 0s
    # Lower interval for critical alerts
    repeat_interval: 1h
    # Continue to forward to other child routes
    continue: true
    
  - receiver: 'slack-notifications'
    matchers:
      - severity=~"warning|info"
    # Continue processing other child routes
    continue: true
    
  - receiver: 'database-team'
    matchers:
      - service=~"database|postgres|mysql"
    
  - receiver: 'frontend-team'
    matchers:
      - service=~"frontend|ui|web"

# Inhibition rules prevent notifications of lower severity alerts if a higher severity
# alert is already firing for the same issue
inhibit_rules:
- source_matchers:
    - severity="critical"
  target_matchers:
    - severity="warning"
  # Only inhibit if the alertname is the same
  equal: ['alertname', 'instance']

receivers:
  - name: 'team-emails'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true
        html: '{{ template "email.default.html" . }}'
        headers:
          Subject: '{{ template "email.subject" . }}'
  
  - name: 'critical-pages'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        send_resolved: true
        description: '{{ template "pagerduty.default.description" . }}'
        severity: 'critical'
        client: 'Alertmanager'
        client_url: 'https://alertmanager.example.com'
  
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#monitoring'
        send_resolved: true
        icon_url: 'https://avatars3.githubusercontent.com/u/3380462'
        title: '{{ template "slack.default.title" . }}'
        title_link: 'https://alertmanager.example.com/#/alerts'
        text: '{{ template "slack.default.text" . }}'
        footer: 'Prometheus Alertmanager'
        actions:
          - type: 'button'
            text: 'Runbook 📚'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
  
  - name: 'database-team'
    slack_configs:
      - channel: '#db-alerts'
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'
  
  - name: 'frontend-team'
    slack_configs:
      - channel: '#frontend-alerts'
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

Debugging with Logs

Effective log analysis is crucial for troubleshooting container issues. Here are practical techniques for extracting valuable information from container logs:

# Debugging container issues with error filtering
docker logs --tail=100 -t container_name | grep ERROR

# Follow logs with specific pattern matching (using extended regex)
docker logs -f container_name | grep -E "error|exception|fail|fatal|panic"

# Extract logs with context (3 lines before and after each match)
docker logs container_name | grep -A 3 -B 3 "Exception"

# Find all occurrences of specific error types and count them
docker logs container_name | grep -o "OutOfMemoryError" | wc -l

# Search for logs within a specific time window
docker logs --since 2023-07-01T10:00:00 --until 2023-07-01T11:00:00 container_name

# Save logs to file for offline analysis
docker logs container_name > container.log

# Compare logs across time periods
docker logs --since 2h container_name > recent.log
docker logs --since 4h --until 2h container_name > older.log
diff recent.log older.log | less

# Parse JSON logs for better readability
docker logs container_name | grep -v '^$' | jq '.'

# Extract specific fields from JSON logs
docker logs container_name | grep -v '^$' | jq 'select(.level=="error") | {timestamp, message, error}'

# Follow logs and highlight different log levels with colors
docker logs -f container_name | GREP_COLOR="01;31" grep -E --color=always 'ERROR|$' | GREP_COLOR="01;33" grep -E --color=always 'WARN|$'

# Extract logs for a specific request ID
docker logs container_name | grep "request-123456"

# Find slow operations (e.g., queries taking more than 1 second)
docker logs container_name | grep -E "took [1-9][0-9]{3,}ms"

# Extract stack traces
docker logs container_name | grep -A 20 "Exception" | grep -v "^$"

# Analyze log volume by time (logs per minute)
docker logs -t container_name | cut -d' ' -f1 | cut -d'T' -f2 | cut -c1-8 | sort | uniq -c

Advanced debugging techniques:

Correlate logs with system events (deployments, scaling, etc.)
Compare logs across multiple containers for distributed issues
Use timestamps to create a sequence of events
Check container environment variables for configuration issues
Analyze container startup logs separately from runtime logs
Monitor both application and infrastructure logs simultaneously
Use regex patterns to extract structured data from unstructured logs

Advanced Monitoring Topics

Distributed Tracing

Implement OpenTelemetry or Jaeger
- Track requests as they flow through distributed systems
- Generate trace IDs to correlate logs across services
- Instrument code with OpenTelemetry SDK
- Deploy Jaeger or Zipkin collectors
- Example Jaeger deployment:
  version: '3.8' services: jaeger: image: jaegertracing/all-in-one:1.37 ports: - "6831:6831/udp" # Jaeger agent - accepts spans in Thrift format - "16686:16686" # Jaeger UI environment: - COLLECTOR_ZIPKIN_HOST_PORT=:9411

Trace requests across multiple containers

Propagate trace context between services
Capture request path across microservices
Record parent-child relationships between spans
Preserve baggage items for request context

Example client instrumentation (Node.js):

const tracer = opentelemetry.trace.getTracer('my-service');
const span = tracer.startSpan('process-order');
try {
  // Add attributes to span
  span.setAttribute('order.id', orderId);
  span.setAttribute('customer.id', customerId);
  
  // Create child span
  const dbSpan = tracer.startSpan('database-query', {
    parent: span,
  });
  // Process database operations
  dbSpan.end();
  
  // Propagate context to HTTP requests
  const headers = {};
  opentelemetry.propagation.inject(opentelemetry.context.active(), headers);
  await axios.post('http://inventory-service/check', data, { headers });
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
} finally {
  span.end();
}

Measure service latency
- Calculate time spent in each service
- Break down processing time by operation
- Identify slow components or dependencies
- Compare latency across different environments
- Correlate latency with resource utilization
- Example span attributes for latency analysis:
  database.query.duration_ms: 45.2 http.request.duration_ms: 120.7 cache.lookup_time_ms: 2.1 business_logic.processing_time_ms: 15.8
Identify bottlenecks
- Analyze critical path in request processing
- Find operations with highest latency contribution
- Detect contention points and resource constraints
- Quantify impact of external dependencies
- Trace-based hotspot analysis techniques
- Example hotspot dashboard showing service latency breakdown
Visualize service dependencies
- Generate service dependency graphs
- Analyze traffic patterns between services
- Calculate error rates between service pairs
- Identify redundant or unnecessary calls
- Detect circular dependencies
- Example visualization tools:
  - Jaeger UI service graph
  - Grafana service graph panel
  - Kiali for service mesh visualization
  - Custom D3.js visualization

Custom Metrics

Expose application-specific metrics

Identify key business and technical metrics
Instrument application code with metrics
Expose metrics endpoints (/metrics)
Design meaningful metric names and labels
Balance cardinality and usefulness

Example custom metrics:

# Counter for business events
order_total{status="completed",payment_method="credit_card"} 1550.75

# Gauge for active resource usage
active_user_sessions{region="us-west"} 1250

# Histogram for latency distributions
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.1"} 1500
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.5"} 1950
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="1.0"} 1990
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="+Inf"} 2000

Implement Prometheus client libraries

Use official client libraries for language-specific instrumentation
Create counters, gauges, histograms, and summaries
Register metrics with registry
Set up middleware for standard metrics
Add custom labels for filtering and aggregation

Example implementation (Python):

from prometheus_client import Counter, Gauge, Histogram, Summary, start_http_server

# Create metrics
REQUEST_COUNT = Counter('app_request_count', 'Application Request Count', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Application Request Latency', ['method', 'endpoint'])
ACTIVE_SESSIONS = Gauge('app_active_sessions', 'Active Sessions', ['region'])

# Start metrics endpoint
start_http_server(8000)

# Update metrics in code
def process_request(request):
    ACTIVE_SESSIONS.labels(region='us-west').inc()
    
    with REQUEST_LATENCY.labels(method='POST', endpoint='/api/v1/order').time():
        # Process request
        result = handle_request(request)
    
    REQUEST_COUNT.labels(method='POST', endpoint='/api/v1/order', status=result.status_code).inc()
    ACTIVE_SESSIONS.labels(region='us-west').dec()
    
    return result

Create custom dashboards

Design purpose-specific visualization panels
Combine technical and business metrics
Create drill-down capabilities
Use appropriate visualization types
Implement dynamic variables for filtering

Example Grafana dashboard JSON structure:

{
  "title": "E-Commerce Platform Dashboard",
  "panels": [
    {
      "title": "Order Volume by Payment Method",
      "type": "barchart",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum by(payment_method) (order_total)",
          "legendFormat": "{{payment_method}}"
        }
      ]
    },
    {
      "title": "API Latency (95th Percentile)",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint))",
          "legendFormat": "{{endpoint}}"
        }
      ]
    }
  ]
}

Set relevant thresholds
- Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives)
- Create alert thresholds based on business impact
- Implement multi-level thresholds (warning, critical)
- Use historical data to establish baselines
- Account for traffic patterns and seasonality
- Example alert thresholds:
  - alert: APIHighLatency expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 0.5 for: 5m labels: severity: warning annotations: summary: "High API latency on {{ $labels.endpoint }}" description: "95th percentile latency is {{ $value }}s, which exceeds the SLO of 0.5s"
Correlate with business metrics
- Connect technical metrics to business outcomes
- Measure conversion impact of performance issues
- Track costs associated with resource usage
- Create composite KPI dashboards
- Implement business impact scoring
- Example correlation queries:
  # Conversion rate vs. page load time sum(rate(purchase_completed_total[1h])) / sum(rate(product_page_view_total[1h])) # Revenue impact of errors sum(rate(order_total{status="error"}[1h])) # Customer satisfaction correlation rate(support_ticket_created{category="performance"}[1d]) / rate(active_user_sessions[1d])

Automated Responses

Implement auto-scaling based on metrics

Configure horizontal pod autoscaling
Use custom metrics for scaling decisions
Set appropriate cooldown periods
Implement predictive scaling for known patterns
Test scaling behavior under various conditions

Example Docker Swarm service scaling:

# Autoscaling with docker service update
while true; do
  # Get current metrics
  REQUESTS=$(curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(http_requests_total\[1m\]\)\) | jq -r '.data.result[0].value[1]')
  
  # Calculate desired replicas (1 replica per 100 requests/second)
  DESIRED=$(echo "$REQUESTS / 100" | bc)
  if [ $DESIRED -lt 2 ]; then DESIRED=2; fi
  if [ $DESIRED -gt 10 ]; then DESIRED=10; fi
  
  # Get current replicas
  CURRENT=$(docker service ls --filter name=myapp --format "{{.Replicas}}" | cut -d '/' -f1)
  
  # Scale if needed
  if [ $DESIRED -ne $CURRENT ]; then
    echo "Scaling from $CURRENT to $DESIRED replicas"
    docker service update --replicas $DESIRED myapp
  fi
  
  sleep 30
done

Configure auto-healing for failed containers

Set appropriate restart policies
Implement health checks for accurate failure detection
Configure liveness and readiness probes
Capture diagnostic information before restarts
Implement circuit breakers for dependent services

Example Docker Compose configuration:

services:
  app:
    image: myapp:latest
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 40s
    deploy:
      restart_policy:
        condition: on-failure
        max_attempts: 3
        window: 120s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Create runbooks for common issues
- Document standard troubleshooting procedures
- Include diagnostic commands and expected output
- Link alerts to specific runbook sections
- Provide escalation paths for unresolved issues
- Maintain version control for runbooks
- Example runbook structure:
  # API Service High Latency Runbook ## Symptoms - API response time > 500ms (95th percentile) - Increased error rate in downstream services - Alert: APIHighLatency firing ## Diagnostic Steps 1. Check container resource usage:
  docker stats api-service
  2. Examine database connection pool:
  curl http://api-service:8080/actuator/metrics/hikaricp.connections.usage
  3. Check for slow queries:
  docker logs api-service | grep "slow query"
  ## Resolution Steps 1. If connection pool utilization > 80%: - Increase pool size in config - Restart service with `docker restart api-service` 2. If slow queries detected: - Check database indexes - Optimize identified queries 3. If CPU/memory usage high: - Scale service: `docker service scale api-service=3` ## Escalation If issue persists after 15 minutes: - Contact: database-team@example.com - Slack: #database-support

Develop automated remediation

Implement scripted responses to common problems
Create self-healing capabilities
Add circuit breakers for degraded dependencies
Implement graceful degradation modes
Balance automation with human oversight

Example automated remediation script:

#!/bin/bash
# Automated database connection reset

# Check connection errors
ERROR_COUNT=$(docker logs --since 5m db-service | grep "connection reset" | wc -l)

if [ $ERROR_COUNT -gt 10 ]; then
  echo "Detected database connection issues, performing remediation"
  
  # Capture diagnostics before remediation
  docker logs --since 15m db-service > /var/log/remediation/db-$(date +%s).log
  
  # Perform remediation
  docker exec db-service /scripts/connection-reset.sh
  
  # Verify fix
  sleep 5
  NEW_ERRORS=$(docker logs --since 1m db-service | grep "connection reset" | wc -l)
  
  # Notify on outcome
  if [ $NEW_ERRORS -eq 0 ]; then
    echo "Remediation successful" | slack-notify "#monitoring"
  else
    echo "Remediation failed, escalating" | slack-notify "#monitoring" --urgent
    # Trigger PagerDuty incident
    pagerduty-trigger "Database connection issues persist after remediation"
  fi
fi

Setup escalation policies

Define clear escalation thresholds
Create tiered response teams
Implement on-call rotation schedules
Track mean time to acknowledge and resolve
Document communication protocols

Example escalation policy:

# PagerDuty escalation policy
escalation_policies:
  - name: "API Service Escalation"
    description: "Escalation policy for API service incidents"
    num_loops: 2
    escalation_rules:
      - escalation_delay_in_minutes: 15
        targets:
          - id: "PXXXXXX"  # Primary on-call engineer
            type: "user_reference"
      - escalation_delay_in_minutes: 15
        targets:
          - id: "PXXXXXX"  # Secondary on-call engineer
            type: "user_reference"
      - escalation_delay_in_minutes: 30
        targets:
          - id: "PXXXXXX"  # Engineering manager
            type: "user_reference"

Performance Monitoring

For comprehensive monitoring:

Combine infrastructure and application metrics
- Correlate container metrics with application performance
- Identify resource constraints affecting application behavior
- View system and application health holistically
- Create dashboards showing both layers together
- Implement consistent labeling across metric types
Establish performance baselines
- Capture normal behavior patterns
- Document expected resource utilization
- Create baseline metrics for different load levels
- Use percentiles rather than averages
- Update baselines after significant changes
- Example baseline document:
  API Service Baseline (10 rps): - CPU: 0.2-0.4 cores (p95) - Memory: 250-350MB (p95) - Latency: 75-150ms (p95) - Error rate: <0.1%
Track historical trends
- Store metrics with appropriate retention policies
- Analyze seasonal patterns and growth trends
- Compare current performance to historical data
- Detect gradual degradation over time
- Correlate performance changes with application updates
- Example trend analysis:
  # PromQL query for weekly latency comparison avg_over_time(histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint="/api/v1/search"}[5m])) by (le))[7d:1d])
Correlate logs with metrics
- Connect error spikes with log events
- Add trace IDs to both logs and metrics
- Use common timestamps and identifiers
- Create linked visualizations
- Implement log-derived metrics
- Example correlation techniques:
  - Use Grafana's Explore view to show logs and metrics side by side
  - Add log annotations to metric graphs
  - Create composite dashboards with both data types
  - Use trace IDs to link between systems
Monitor from both internal and external perspectives
- Internal: Resource usage, component health
- External: End-user experience, global availability
- Edge performance: CDN, DNS, SSL
- Third-party dependencies
- Regional variations in performance
- Example multi-perspective monitoring:
  # Internal metric (container health) - name: "container_health" query: "container_memory_usage_bytes{name=~'api-.*'} / container_spec_memory_limit_bytes{name=~'api-.*'} * 100 > 80" # External metric (user experience) - name: "user_experience" query: "probe_duration_seconds{job='blackbox',target='https://api.example.com/health'} > 0.5"

Implement synthetic checks

Simulate user interactions regularly
Test critical business workflows
Monitor from multiple geographic locations
Set up transactional checks
Alert on business-impacting failures

Example synthetic check configuration:

# Blackbox exporter configuration
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]  # Look for a HTTP 200 response
  static_configs:
    - targets:
      - https://api.example.com/health
      - https://api.example.com/ready
      - https://www.example.com/login
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:9115

Best Practices Checklist

Logging Best Practices

Log to STDOUT/STDERR
- Follow container best practices
- Enable centralized collection
- Avoid managing log files inside containers
- Let Docker logging drivers handle transport
- Example: Configure applications to write directly to stdout/stderr rather than log files
Use structured logging format
- Implement JSON-formatted logs
- Include consistent metadata fields
- Use proper data types within JSON
- Add correlation IDs for request tracking
- Maintain schema consistency
- Example structured log:
  { "timestamp": "2023-07-10T15:23:45.123Z", "level": "error", "service": "order-service", "message": "Failed to process payment", "traceId": "abc123def456", "orderId": "order-789", "errorCode": "PAYMENT_DECLINED", "duration_ms": 345 }
Implement log rotation
- Configure size and time-based rotation
- Set appropriate retention periods
- Compress rotated logs
- Monitor disk usage
- Handle rotation gracefully
- Example Docker logging configuration:
  { "log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "5", "compress": "true" } }
Set appropriate log levels
- Use DEBUG for development environments
- Use INFO or WARN for production
- Reserve ERROR for actionable issues
- Make log levels configurable at runtime
- Use consistent level definitions across services
- Example log level configuration:
  logging: level: root: WARN com.example.api: INFO com.example.database: WARN com.example.payment: INFO
Configure centralized logging
- Aggregate logs from all containers
- Implement proper indexing and search
- Set up log parsing and normalization
- Configure access controls for logs
- Establish retention and archival policies
- Example EFK stack setup:
  - Filebeat or Fluentd to collect logs
  - Elasticsearch for storage and indexing
  - Kibana for visualization and search
  - Curator for index lifecycle management

Avoid sensitive data in logs

Implement data masking for PII
Never log credentials or secrets
Truncate potentially large payloads
Remove sensitive headers
Anonymize personal identifiers

Example masking implementation:

function logRequest(req) {
  const sanitized = {
    method: req.method,
    path: req.path,
    query: sanitizeObject(req.query),
    headers: sanitizeHeaders(req.headers),
    body: sanitizeObject(req.body)
  };
  logger.info({ request: sanitized }, "Incoming request");
}

function sanitizeObject(obj) {
  const masked = {...obj};
  const sensitiveFields = ['password', 'token', 'ssn', 'credit_card'];
  
  for (const field of sensitiveFields) {
    if (masked[field]) masked[field] = '***REDACTED***';
  }
  return masked;
}

Monitoring Best Practices

Monitor both host and containers
- Track host-level resources (CPU, memory, disk, network)
- Monitor container-specific metrics
- Correlate container performance with host constraints
- Watch for noisy neighbor problems
- Track Docker daemon health
- Example monitoring stack:
  - Node Exporter for host metrics
  - cAdvisor for container metrics
  - Docker daemon metrics endpoint
  - Process-specific metrics from applications
Implement alerting with appropriate thresholds
- Create multi-level alerts (warning/critical)
- Avoid alert fatigue with proper thresholds
- Include runbook links in alert notifications
- Group related alerts to reduce noise
- Implement alert de-duplication
- Example alerting configuration:
  - alert: ContainerHighCpuUsage expr: rate(container_cpu_usage_seconds_total{name!=""}[1m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "Container CPU usage high ({{ $labels.name }})" description: "Container CPU usage is above 80% for 5 minutes" runbook_url: "https://wiki.example.com/runbooks/container-cpu"
Use visualization dashboards
- Create role-specific dashboards
- Include both overview and detailed views
- Use appropriate visualization types
- Implement template variables for filtering
- Balance information density with readability
- Example dashboard organization:
  - Executive overview: High-level health and KPIs
  - Operations dashboard: System health and resource usage
  - Developer dashboard: Application performance and errors
  - Service-specific dashboards: Detailed metrics for each service
  - On-call dashboard: Current alerts and recent incidents
Track business-relevant metrics
- Monitor key performance indicators (KPIs)
- Create business-technical correlation views
- Measure user experience metrics
- Track conversion and engagement metrics
- Monitor transaction value and volume
- Example business metrics:
  # Business metrics in Prometheus format # HELP order_value_total Total value of orders in currency # TYPE order_value_total counter order_value_total{currency="USD",status="completed"} 15234.50 # HELP checkout_started Total number of checkout processes started # TYPE checkout_started counter checkout_started 5423 # HELP checkout_completed Total number of checkout processes completed # TYPE checkout_completed counter checkout_completed 4231

Implement health checks

Create meaningful application health endpoints
Check dependencies in health probes
Implement readiness vs. liveness separation
Make health checks lightweight
Include version and dependency information

Example health check endpoint:

app.get('/health', async (req, res) => {
  try {
    // Check database connectivity
    const dbStatus = await checkDatabase();
    
    // Check cache service
    const cacheStatus = await checkRedis();
    
    // Check external API dependency
    const apiStatus = await checkExternalApi();
    
    const allHealthy = dbStatus && cacheStatus && apiStatus;
    
    res.status(allHealthy ? 200 : 503).json({
      status: allHealthy ? 'healthy' : 'unhealthy',
      version: '1.2.3',
      timestamp: new Date().toISOString(),
      components: {
        database: dbStatus ? 'up' : 'down',
        cache: cacheStatus ? 'up' : 'down',
        api: apiStatus ? 'up' : 'down'
      }
    });
  } catch (error) {
    res.status(500).json({ status: 'error', error: error.message });
  }
});

Plan for monitoring scalability
- Design for growth in container count
- Implement metric aggregation for high-cardinality data
- Use appropriate retention policies
- Consider resource requirements for monitoring tools
- Implement federation for large-scale deployments
- Example scalability techniques:
  - Prometheus hierarchical federation
  - Service discovery for dynamic environments
  - Metric aggregation and downsampling
  - Sharding metrics by service or namespace
  - Custom recording rules for common queries
  # Recording rules for efficient querying groups: - name: container_aggregation interval: 1m rules: - record: job:container_cpu:usage_rate5m expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (job) - record: job:container_memory:usage_bytes expr: sum(container_memory_usage_bytes) by (job)

Edit this page

Optimization

Learn how to optimize Docker images, containers, and overall Docker performance

CI/CD Integration

Learn how to integrate Docker with CI/CD pipelines for automated builds and deployments