Effective monitoring and logging are essential for maintaining reliable Docker environments. They help with troubleshooting, performance optimization, and ensuring application health. A comprehensive monitoring and logging strategy provides visibility into container behavior, enables proactive issue detection, and helps maintain service level objectives.
Docker containers present unique monitoring challenges due to their ephemeral nature, isolation characteristics, and the dynamic environments they often operate in. An effective container observability strategy must account for these characteristics while providing meaningful insights across the entire container lifecycle.
Docker provides basic monitoring capabilities out of the box that require no additional tools or setup:
# View real-time resource usage for all containers
docker stats
# Sample output:
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# 1a2b3c4d5e6f web 0.25% 15.23MiB / 1.952GiB 0.76% 648B / 648B 0B / 0B 2
# 2a3b4c5d6e7f redis 0.05% 2.746MiB / 1.952GiB 0.14% 1.45kB / 1.05kB 0B / 0B 5
# View detailed container stats in JSON format (useful for scripting)
docker stats --no-stream --format "{{json .}}" container_name | jq '.'
# Pretty prints JSON output with jq for better readability
# Filter stats to show only specific containers
docker stats $( docker ps --format {{.Names}} | grep "api-" )
# Display specific columns only
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# Inspect container configuration details
docker inspect container_name
# Extract specific configuration information
docker inspect --format '{{.HostConfig.Memory}}' container_name
docker inspect --format '{{.State.Health.Status}}' container_name
docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' container_name
# View container events in real-time
docker events --filter 'type=container'
Built-in monitoring tools provide a good starting point but have limitations:
No historical data retention Limited metrics granularity No alerting capabilities Minimal visualization options Container-centric rather than application-centric view Google's container advisor (cAdvisor) provides more detailed container metrics with a web UI and exposes Prometheus-compatible endpoints:
# Run cAdvisor container with all necessary volume mounts
docker run -d \
--name=cadvisor \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--privileged \
--device=/dev/kmsg \
--restart=always \
gcr.io/cadvisor/cadvisor:v0.47.0
cAdvisor provides:
Detailed resource usage statistics (CPU, memory, network, filesystem) Container lifecycle events Historical performance data (short-term, in-memory storage) Container hierarchy visualization Prometheus metrics endpoint at /metrics
Hardware-specific monitoring (NVML for NVIDIA GPUs) Auto-discovery of containers Key metrics exposed by cAdvisor:
CPU usage breakdown (user, system, throttling) Memory details (RSS, cache, swap, working set) Network statistics (RX/TX bytes, errors, dropped packets) Filesystem usage and I/O statistics Per-process statistics within containers Accessing the UI:
Web interface at http://localhost:8080 API endpoints for programmatic access:
/api/v1.3/containers
- All container stats/api/v1.3/docker/[container_name]
- Specific container stats/metrics
- Prometheus-formatted metrics A powerful combination for metrics collection, storage, querying, and visualization, widely considered the industry standard for container monitoring:
version : '3.8'
services :
prometheus :
image : prom/prometheus:v2.44.0
container_name : prometheus
command :
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
volumes :
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports :
- "9090:9090"
restart : unless-stopped
node-exporter :
image : prom/node-exporter:v1.5.0
container_name : node-exporter
volumes :
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command :
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports :
- "9100:9100"
restart : unless-stopped
cadvisor :
image : gcr.io/cadvisor/cadvisor:v0.47.0
container_name : cadvisor
volumes :
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports :
- "8080:8080"
restart : unless-stopped
grafana :
image : grafana/grafana:9.5.1
container_name : grafana
user : "472"
ports :
- "3000:3000"
volumes :
- grafana_data:/var/lib/grafana
depends_on :
- prometheus
environment :
- GF_SECURITY_ADMIN_PASSWORD=admin_password
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_DOMAIN=localhost
restart : unless-stopped
volumes :
prometheus_data : {}
grafana_data : {}
Example prometheus.yml
configuration:
global :
scrape_interval : 15s
evaluation_interval : 15s
scrape_configs :
- job_name : 'prometheus'
static_configs :
- targets : [ 'localhost:9090' ]
- job_name : 'node-exporter'
static_configs :
- targets : [ 'node-exporter:9100' ]
- job_name : 'cadvisor'
static_configs :
- targets : [ 'cadvisor:8080' ]
- job_name : 'docker'
metrics_path : /metrics
static_configs :
- targets : [ '172.17.0.1:9323' ] # Docker daemon metrics (requires daemon configuration)
# Auto-discover and scrape containers with prometheus.io annotations
- job_name : 'docker-containers'
docker_sd_configs :
- host : unix:///var/run/docker.sock
filters :
- name : label
values : [ 'prometheus.io.scrape=true' ]
relabel_configs :
- source_labels : [ '__meta_docker_container_label_prometheus_io_port' ]
target_label : __metrics_path__
replacement : /metrics
- source_labels : [ '__meta_docker_container_name' ]
target_label : container_name
- source_labels : [ '__meta_docker_container_label_prometheus_io_job' ]
target_label : job
This setup provides:
Prometheus : Time-series database with a powerful query language (PromQL)Node Exporter : Host-level metrics (CPU, memory, disk, network)cAdvisor : Container-specific metricsGrafana : Visualization dashboards and alertingYou can enable Docker daemon metrics by adding to /etc/docker/daemon.json
:
{
"metrics-addr" : "0.0.0.0:9323" ,
"experimental" : true
}
Common Prometheus metrics for containers:
container_cpu_usage_seconds_total
- Cumulative CPU time consumedcontainer_memory_usage_bytes
- Current memory usage, including cachecontainer_memory_rss
- Resident Set Size (actual memory usage)container_network_receive_bytes_total
- Network bytes receivedcontainer_network_transmit_bytes_total
- Network bytes sentcontainer_fs_reads_bytes_total
- Bytes read from diskcontainer_fs_writes_bytes_total
- Bytes written to diskExample PromQL queries:
# CPU usage percentage per container
sum(rate(container_cpu_usage_seconds_total{name=~".+"}[1m])) by (name) * 100
# Memory usage in MB per container
sum(container_memory_usage_bytes{name=~".+"}) by (name) / 1024 / 1024
# Network receive throughput
rate(container_network_receive_bytes_total{name=~".+"}[5m])
Docker offers multiple logging drivers to handle container logs, each suitable for different environments and requirements:
json-file (default) Logs stored as JSON files on the host Simple to use and configure Local access to logs via docker logs
Limited by local disk space Requires log rotation configuration syslog Forwards logs to syslog daemon Integrates with existing syslog infrastructure Supports UDP, TCP, and TLS transport No access to logs via docker logs
Standard format used by many log analyzers journald Forwards logs to systemd journal Structured logging with metadata Index-based searching capabilities Good integration with systemd-based systems Access logs via journalctl
or docker logs
fluentd Forwards logs to fluentd collector Highly configurable log processing Supports multiple output destinations Requires fluentd to be running No access to logs via docker logs
awslogs Sends logs to Amazon CloudWatch Native AWS integration Centralized logs for AWS deployments Region-specific configuration Managed retention and access control splunk Sends logs directly to Splunk Enterprise-grade log management Supports Splunk indexing and searching Requires Splunk HTTP Event Collector Commercial solution with advanced features gelf (Graylog) Sends logs in GELF format Supports compression Better structured data than syslog Works well with Graylog, Logstash Handles multi-line messages properly loki Designed for Grafana Loki Label-based indexing (like Prometheus) Highly efficient storage Great integration with Grafana Optimized for cost-effectiveness Logging drivers can be configured at the daemon level (for all containers) or per container:
# Set logging driver for a specific container
docker run --log-driver=syslog \
--log-opt syslog-address=udp://logs.example.com:514 \
--log-opt syslog-facility=daemon \
--log-opt tag="{{.Name}}/{{.ID}}" \
nginx
# Use JSON file driver with size-based rotation
docker run --log-driver=json-file \
--log-opt max-size=10m \
--log-opt max-file= 3 \
--log-opt compress= true \
nginx
# Configure AWS CloudWatch logging
docker run --log-driver=awslogs \
--log-opt awslogs-region=us-west-2 \
--log-opt awslogs-group=my-container-logs \
--log-opt awslogs-stream=web-app \
my-web-app
# Send logs to Fluentd with custom labels
docker run --log-driver=fluentd \
--log-opt fluentd-address=localhost:24224 \
--log-opt tag="docker.{{.Name}}" \
--log-opt fluentd-async= true \
--log-opt labels=environment,service_name \
--label environment=production \
--label service_name=api \
my-api-service
Configure default logging driver for all containers in /etc/docker/daemon.json
:
{
"log-driver" : "json-file" ,
"log-opts" : {
"max-size" : "10m" ,
"max-file" : "3" ,
"compress" : "true" ,
"labels" : "environment,project" ,
"env" : "HOSTNAME,ENVIRONMENT"
}
}
After modifying daemon.json, restart the Docker daemon:
sudo systemctl restart docker
You can verify the current logging driver configuration:
docker info --format '{{.LoggingDriver}}'
For container-specific logging configuration:
docker inspect --format '{{.HostConfig.LogConfig}}' container_name
Common logging options across drivers:
mode
- blocking (default) or non-blockingmax-buffer-size
- buffer size for non-blocking modeenv
- include specific environment variables in logslabels
- include container labels in logstag
- customize log message tag format
# View container logs (shows all logs)
docker logs container_name
# Follow container logs in real-time (like tail -f)
docker logs -f container_name
# Show timestamps with logs (useful for debugging)
docker logs -t container_name
# Show last n lines (instead of entire log history)
docker logs --tail=100 container_name
# Combine options for real-time monitoring with timestamps
docker logs -f -t --tail=50 container_name
# Filter logs by time (show logs since specific timestamp)
docker logs --since 2023-07-01T00:00:00 container_name
# Show logs until a specific timestamp
docker logs --until 2023-07-02T00:00:00 container_name
# Show logs from relative time (e.g., last hour)
docker logs --since=1h container_name
# Grep for specific patterns in logs
docker logs container_name | grep ERROR
# Output logs in different formats (useful for parsing)
docker logs --details container_name
# Count occurrences of specific log patterns
docker logs container_name | grep -c "Connection refused"
# Extract and format log timestamps for timing analysis
docker logs -t container_name | awk '{print $1, $2}' | sort
# Monitor multiple containers simultaneously
docker logs $( docker ps -q ) 2>&1 | grep ERROR
# Show container logs with colorized output for different log levels
docker logs container_name | grep --color -E "ERROR|WARN|INFO|DEBUG|$"
Note that docker logs
only works with containers using the json-file
, local
, or journald
logging drivers. For other logging drivers (like syslog
, fluentd
, etc.), you'll need to access logs through their respective systems.
For large log volumes, use --tail
to limit output Avoid frequent calls to docker logs
on busy production systems Consider using the --since
flag to limit time range For high-traffic containers, use a dedicated logging solution instead of docker logs
The docker logs
command reads the entire log file before applying filters, which can be resource-intensive Log rotation is essential to prevent disk space exhaustion. Configure it in the Docker daemon settings:
{
"log-driver" : "json-file" ,
"log-opts" : {
"max-size" : "10m" , // Maximum size of each log file before rotation
"max-file" : "3" , // Number of log files to keep
"compress" : "true" // Compress rotated log files
}
}
Container-specific log rotation can be configured at runtime:
# Configure log rotation for a specific container
docker run -d \
--log-opt max-size=5m \
--log-opt max-file= 5 \
--log-opt compress= true \
--name web nginx
For existing deployments with unmanaged log files, you can use external log rotation:
# Example logrotate configuration (/etc/logrotate.d/docker-containers)
/var/lib/docker/containers/*/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
copytruncate
}
You can manually trigger log rotation by restarting the container or the Docker daemon.
Common log rotation issues:
Docker daemon restart required for daemon.json changes to take effect Containers created before log rotation configuration won't have it applied Very high log volume containers may still face issues despite rotation copytruncate
in logrotate may miss logs during rotation windowNested JSON logs can be challenging to parse after rotation Tools like Fluentd, Logstash, or Filebeat can collect logs from multiple containers and forward them to centralized logging systems:
version : '3.8'
services :
app :
image : my-app
logging :
driver : fluentd
options :
fluentd-address : fluentd:24224
tag : app.{{.Name}}
fluentd-async : "true"
fluentd-async-connect : "true"
labels : "environment,service_name,version"
env : "NODE_ENV,SERVICE_VERSION"
labels :
environment : production
service_name : api
version : "1.0.0"
environment :
NODE_ENV : production
SERVICE_VERSION : "1.0.0"
web :
image : nginx
logging :
driver : fluentd
options :
fluentd-address : fluentd:24224
tag : web.{{.Name}}
fluentd-async : "true"
depends_on :
- app
db :
image : postgres
logging :
driver : fluentd
options :
fluentd-address : fluentd:24224
tag : db.{{.Name}}
fluentd-async : "true"
fluentd :
image : fluentd/fluentd:v1.16-1
volumes :
- ./fluentd/conf:/fluentd/etc
- fluentd-data:/fluentd/log
ports :
- "24224:24224"
- "24224:24224/udp"
environment :
- FLUENTD_CONF=fluent.conf
restart : always
volumes :
fluentd-data :
Example Fluentd configuration (fluent.conf
):
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
# Parse Docker logs
<filter **>
@type parser
key_name log
reserve_data true
remove_key_name_field true
<parse>
@type json
json_parser json
</parse>
</filter>
# Add Kubernetes metadata if running in K8s
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
# Output to Elasticsearch
<match **>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix fluentd
<buffer>
@type file
path /fluentd/log/buffer
flush_thread_count 2
flush_interval 5s
chunk_limit_size 2M
queue_limit_length 32
retry_max_interval 30
retry_forever true
</buffer>
</match>
# Keep a local copy of logs for debugging
<match **>
@type copy
<store>
@type file
path /fluentd/log/${tag}/%Y/%m/%d.%H.%M
append true
<format>
@type json
</format>
<buffer tag,time>
@type file
timekey 1h
timekey_use_utc true
timekey_wait 10m
</buffer>
</store>
</match>
Alternative log aggregation solutions:
Filebeat : Lightweight log shipper from Elastic Stack
filebeat :
image : docker.elastic.co/beats/filebeat:8.8.0
volumes :
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
user : root
restart : always
Logstash : More powerful log processing pipeline
logstash :
image : docker.elastic.co/logstash/logstash:8.8.0
volumes :
- ./logstash/pipeline:/usr/share/logstash/pipeline:ro
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
ports :
- "5000:5000/tcp"
- "5000:5000/udp"
- "9600:9600"
environment :
LS_JAVA_OPTS : "-Xmx512m -Xms512m"
restart : always
Vector : High-performance observability data pipeline
vector :
image : timberio/vector:0.29.1-alpine
volumes :
- ./vector.toml:/etc/vector/vector.toml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
ports :
- "8686:8686"
restart : always
For production environments, consider these centralized logging solutions:
ELK Stack (Elasticsearch, Logstash, Kibana) Open-source solution with powerful search capabilities Highly scalable with clustering support Rich visualization and dashboarding with Kibana Flexible log processing with Logstash pipelines Can be self-hosted or used as a managed service (Elastic Cloud) Resource-intensive, requires careful sizing and tuning Community and commercial support options available Graylog Open-source log management with enterprise features MongoDB for metadata storage, Elasticsearch for searching Stream-based routing and processing Built-in user management and role-based access control Alerting and reporting capabilities Native GELF format support Generally easier to set up than full ELK stack Splunk Enterprise-grade log analysis platform Powerful search processing language (SPL) Advanced analytics and machine learning capabilities Comprehensive dashboarding and visualization Extensive integration ecosystem Commercial solution with licensing costs Available as cloud service or self-hosted AWS CloudWatch Logs Native AWS service with tight integration Automatic scaling with no infrastructure to manage Log Insights for querying and analysis Integration with other AWS services Metric extraction from logs Pay-as-you-go pricing model Best for AWS-centric environments Google Cloud Logging Native GCP logging service Automatic ingestion from GKE and other GCP services Log Explorer for search and analysis Error Reporting for automatic grouping of errors Log-based metrics and alerting Integration with Cloud Monitoring Ideal for GCP-based workloads Azure Monitor Logs Native Azure service (formerly Log Analytics) Kusto Query Language (KQL) for log analysis Integrated with Azure Application Insights Workbooks for interactive reporting AI-powered analytics Unified with metrics in Azure Monitor Best choice for Azure-deployed applications Loki (Grafana Loki) Log aggregation system designed to be cost-effective Label-based indexing similar to Prometheus Works well with existing Grafana deployments Lower resource requirements than Elasticsearch Horizontal scalability Open-source with commercial support options Strong integration with Prometheus ecosystem Datadog Logs SaaS platform with unified observability Combines logs, metrics, and traces ML-powered analytics and anomaly detection Real-time log processing and filtering Extensive integration catalog Tag-based correlation across different data types Commercial solution with subscription pricing A production-ready ELK stack deployment with proper configuration and resource settings:
version : '3.8'
services :
elasticsearch :
image : docker.elastic.co/elasticsearch/elasticsearch:8.8.1
container_name : elasticsearch
environment :
- node.name=elasticsearch
- cluster.name=docker-cluster
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=true
- ELASTIC_PASSWORD=changeme
ulimits :
memlock :
soft : -1
hard : -1
nofile :
soft : 65536
hard : 65536
cap_add :
- IPC_LOCK
volumes :
- es_data:/usr/share/elasticsearch/data
ports :
- "9200:9200"
networks :
- elk
healthcheck :
test : [ "CMD-SHELL" , "curl -s http://localhost:9200 | grep -q 'You Know, for Search'" ]
interval : 10s
timeout : 10s
retries : 120
logstash :
image : docker.elastic.co/logstash/logstash:8.8.1
container_name : logstash
volumes :
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
ports :
- "5044:5044"
- "5000:5000/tcp"
- "5000:5000/udp"
- "9600:9600"
environment :
LS_JAVA_OPTS : "-Xmx256m -Xms256m"
networks :
- elk
depends_on :
- elasticsearch
kibana :
image : docker.elastic.co/kibana/kibana:8.8.1
container_name : kibana
ports :
- "5601:5601"
environment :
- ELASTICSEARCH_URL=http://elasticsearch:9200
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=changeme
networks :
- elk
depends_on :
- elasticsearch
healthcheck :
test : [ "CMD-SHELL" , "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'" ]
interval : 10s
timeout : 10s
retries : 120
filebeat :
image : docker.elastic.co/beats/filebeat:8.8.1
container_name : filebeat
command : filebeat -e -strict.perms=false
volumes :
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
user : root
networks :
- elk
depends_on :
- elasticsearch
- logstash
networks :
elk :
driver : bridge
volumes :
es_data :
driver : local
## Application Logging Best Practices
::steps
### Log to STDOUT/STDERR
- * *Write logs to standard output and error streams**
- Docker captures anything written to stdout/stderr
- No need to manage log files within containers
- Reduces complexity and disk space issues
- Allows container restarts without losing logs
- Makes logs available through `docker logs` command
- * *Let Docker handle log collection**
- Docker daemon manages log storage and rotation
- Consistent logging mechanism across all containers
- Enables centralized log configuration
- Simplifies logging architecture
- Allows switching logging drivers without application changes
- * *Follows the 12-factor app methodology**
- Principle IV : Treat logs as event streams
- Application doesn't concern itself with storage/routing
- Decouples log generation from log processing
- Enables easier horizontal scaling
- Promotes separation of concerns
- * *Example logging practices:**
``` javascript
// Node.js example - good practice
console.log(JSON.stringify({
level : 'info' ,
message : 'User logged in' ,
timestamp : new Date().toISOString(),
userId : user.id
} ));
// Bad practice - writing to files
// fs.appendFileSync('/var/log/app.log', 'User logged in\n');
Use JSON or other structured formats Enables machine-readable log processing Maintains relationships between data fields Simplifies parsing and querying Preserves data types and hierarchies Facilitates automated analysis Include essential metadata fields timestamp
: ISO 8601 format with timezone (e.g., 2023-07-01T12:34:56.789Z
)level
: Severity level (e.g., debug
, info
, warn
, error
)service
: Service or component namemessage
: Human-readable descriptiontraceId
: Distributed tracing identifierAdd correlation IDs for request tracking Enables tracking requests across multiple services Helps with distributed system debugging Simplifies complex workflow analysis Essential for microservice architectures Example fields: requestId
, correlationId
, traceId
, spanId
Include contextual information User identifiers (anonymized if needed) Resource identifiers (e.g., orderId
, productId
) Operation being performed Source information (e.g., IP address, user agent) Performance metrics (e.g., duration, resource usage) Example structured log format:
{
"timestamp" : "2023-07-01T12:34:56.789Z" ,
"level" : "error" ,
"service" : "payment-service" ,
"message" : "Payment processing failed" ,
"traceId" : "abc123def456" ,
"userId" : "user-789" ,
"orderId" : "order-456" ,
"error" : {
"code" : "INSUFFICIENT_FUNDS" ,
"message" : "Insufficient funds in account"
},
"paymentMethod" : "credit_card" ,
"amount" : 99.95 ,
"duration_ms" : 236
}
Implement proper log levels DEBUG : Detailed information for development/debuggingINFO : Confirmation that things are working as expectedWARN : Something unexpected but not necessarily an errorERROR : Something failed that should be investigatedFATAL/CRITICAL : System is unusable, immediate attention requiredConfigure appropriate level for each environment Development: DEBUG or INFO for maximum visibility Testing/QA: INFO or WARN to reduce noise Production: WARN or ERROR to minimize performance impact Use environment variables to control log levels Example:
# Set log level via environment variable
docker run -e LOG_LEVEL=INFO my-app
Security considerations Never log credentials, tokens, or API keys Hash or mask sensitive personal information Comply with relevant regulations (GDPR, CCPA, HIPAA) Be cautious with stack traces in production Implement log field redaction for sensitive data Example:
// Logging with redaction
logger. info ({
user: { id: user.id, email: redactEmail (user.email) },
action: "profile_update" ,
changes: redactSensitiveFields (changes)
});
Include actionable information in error logs Error type and message Stack trace (in development/testing) Context that led to the error Request parameters (sanitized) Correlation IDs for tracing Suggested resolution steps where applicable Performance considerations Log volume impacts system performance Use sampling for high-volume debug logs Consider asynchronous logging for performance-critical paths Implement circuit breakers for logging failures Monitor and alert on abnormal log volume
:: Health checks provide automated monitoring of container health status, enabling Docker to detect and recover from application failures.
# Add health check to Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
CMD curl -f http://localhost/ || exit 1
Health check parameters:
interval
: Time between checks (default: 30s)timeout
: Maximum time for check to complete (default: 30s)start-period
: Initial grace period (default: 0s)retries
: Consecutive failures before unhealthy (default: 3)
version : '3.8'
services :
web :
image : nginx
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost" ]
interval : 30s
timeout : 10s
retries : 3
start_period : 40s
api :
image : my-api:latest
healthcheck :
test : [ "CMD" , "wget" , "-O" , "/dev/null" , "-q" , "http://localhost:8080/health" ]
interval : 15s
timeout : 5s
retries : 5
start_period : 30s
redis :
image : redis:alpine
healthcheck :
test : [ "CMD" , "redis-cli" , "ping" ]
interval : 10s
timeout : 5s
retries : 3
postgres :
image : postgres:13
healthcheck :
test : [ "CMD-SHELL" , "pg_isready -U postgres" ]
interval : 10s
timeout : 5s
retries : 5
start_period : 20s
Health check best practices:
Design checks that validate core functionality, not just that the process is running Keep checks lightweight to avoid resource consumption Include proper timeouts to prevent hanging checks Set appropriate start periods for slow-starting applications Implement health endpoints that check dependencies Use exit codes properly (0 = healthy, 1 = unhealthy) Avoid complex scripts that might fail for reasons unrelated to application health Health check states:
starting
: During start period, not counted as unhealthy yethealthy
: Check is passingunhealthy
: Check is failingYou can check container health status:
docker inspect --format= '{{.State.Health.Status}}' container_name
Key metrics to monitor:
CPU usage Overall utilization percentage User vs. system time CPU throttling events Load average Context switches and interrupts CPU usage per core/thread Memory consumption RSS (Resident Set Size) - actual physical memory Virtual memory size Cache and buffer usage Heap vs. non-heap (for JVM applications) Memory limit usage percentage OOM (Out Of Memory) kill events Memory page faults Disk I/O Read/write operations per second Bytes read/written per second Read/write latency Disk queue length Disk space usage (per volume) Inode usage Filesystem errors Network traffic Bytes received/transmitted per second Packets received/transmitted per second Network errors and dropped packets Connection count (established, time-wait, etc.) TCP retransmits DNS query time Network latency Container health status Health check status Uptime/age Restart count and reasons Init process status Container state changes Exit codes from previous runs Application-specific metrics Request rate and throughput Response time (average, percentiles) Error rate and types Queue lengths and processing time Cache hit/miss ratio Database query performance Business KPIs (orders, users, conversions) Container lifecycle metrics Container restart count Container creation/destruction rate Image pull time Container start time Build duration Failed starts/deployments Resource orchestration metrics Number of running containers Resource allocation vs. usage Scheduling failures Node health and capacity Autoscaling events Deployment success rate Resource distribution balance Example prometheus.yml
for comprehensive Docker monitoring:
global :
scrape_interval : 15s
evaluation_interval : 15s
scrape_timeout : 10s
# Add labels to all time series or alerts
external_labels :
environment : production
region : us-west-1
# Rule files contain recording rules and alerting rules
rule_files :
- "/etc/prometheus/rules/*.yml"
scrape_configs :
# Docker daemon metrics
- job_name : 'docker'
static_configs :
- targets : [ 'docker-host:9323' ]
metrics_path : '/metrics'
scheme : 'http'
# Container metrics from cAdvisor
- job_name : 'cadvisor'
static_configs :
- targets : [ 'cadvisor:8080' ]
metrics_path : '/metrics'
scheme : 'http'
scrape_interval : 10s
# Host metrics from node-exporter
- job_name : 'node-exporter'
static_configs :
- targets : [ 'node-exporter:9100' ]
metrics_path : '/metrics'
scheme : 'http'
# Prometheus self-monitoring
- job_name : 'prometheus'
static_configs :
- targets : [ 'localhost:9090' ]
# Application metrics
- job_name : 'application'
static_configs :
- targets : [ 'app:8080' ]
metrics_path : '/actuator/prometheus'
scheme : 'http'
# Auto-discover containers with Prometheus annotations
- job_name : 'docker-containers'
docker_sd_configs :
- host : unix:///var/run/docker.sock
filters :
- name : label
values : [ 'prometheus.io.scrape=true' ]
relabel_configs :
# Use the container name as the instance label
- source_labels : [ '__meta_docker_container_name' ]
regex : '/(.*)'
target_label : 'instance'
replacement : '$1'
# Extract metrics path from container label
- source_labels : [ '__meta_docker_container_label_prometheus_io_metrics_path' ]
regex : '(.+)'
target_label : '__metrics_path__'
replacement : '$1'
# Extract port from container label
- source_labels : [ '__meta_docker_container_label_prometheus_io_port' ]
regex : '(.+)'
target_label : '__address__'
replacement : '${1}:${__meta_docker_container_label_prometheus_io_port}'
# Add container labels as Prometheus labels
- action : labelmap
regex : __meta_docker_container_label_(.+)
For alerting, add an AlertManager configuration:
alerting :
alertmanagers :
- static_configs :
- targets : [ 'alertmanager:9093' ]
# Example alert rules file (/etc/prometheus/rules/container_alerts.yml)
groups :
- name : container_alerts
rules :
- alert : ContainerCpuUsage
expr : (sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) / scalar(count(node_cpu_seconds_total{mode="idle"}))) * 100 > 80
for : 5m
labels :
severity : warning
annotations :
summary : "Container CPU usage (instance {{ $labels.instance }})"
description : "Container CPU usage is above 80% \n VALUE = {{ $value }}% \n LABELS = {{ $labels }}"
- alert : ContainerMemoryUsage
expr : (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80
for : 5m
labels :
severity : warning
annotations :
summary : "Container Memory usage (instance {{ $labels.instance }})"
description : "Container Memory usage is above 80% \n VALUE = {{ $value }}% \n LABELS = {{ $labels }}"
Access Grafana (default: http://localhost:3000 )Ensure Grafana service is running and port is accessible Check for any proxy or network restrictions Login with default credentials (admin/admin)You'll be prompted to change the password on first login Set a strong password and store it securely Consider setting up additional users with appropriate permissions Add Prometheus data source Navigate to Configuration > Data Sources Click "Add data source" and select "Prometheus" Set URL to http://prometheus:9090 (or appropriate address) Set scrape interval to match Prometheus configuration Test the connection to ensure it works Advanced settings:
HTTP Method: GET
Type: Server (default)
Access: Server (default)
Disable metrics lookup: No
Custom query parameters: None
Import Docker monitoring dashboards Navigate to Dashboards > Import Enter dashboard ID or upload JSON file Recommended dashboard IDs:
893: Docker and system monitoring (1 server) 10619: Docker monitoring with Prometheus 11467: Container metrics dashboard 1860: Node Exporter Full Adjust variables to match your environment Save dashboard with appropriate name and folder Configure alerts Navigate to Alerting > Alert Rules Create alert rules based on critical metrics Set appropriate thresholds and evaluation intervals Configure notification channels (email, Slack, PagerDuty, etc.) Test alerts to ensure proper delivery Example alert rule:
Rule name: High Container CPU Usage
Data source: Prometheus
Expression: max by(container_name) (rate(container_cpu_usage_seconds_total{container_name!=""}[1m]) * 100) > 80
Evaluation interval: 1m
Pending period: 5m
Docker Host metrics dashboard System load, CPU, memory, disk, and network Host uptime and stability metrics Docker daemon metrics Number of running containers Resource utilization trends Example panels:
CPU Usage by Container (stacked graph) Memory Usage by Container (stacked graph) Container Status Count (stat panel) System Load (gauge) Disk Space Usage (pie chart) Container resource usage dashboard Per-container CPU, memory, and I/O metrics Container restart counts Health check status Network traffic by container Key visualizations:
Heatmap of container resource usage Time-series charts for each resource type Top N resource consumers (table) Container lifecycle events (annotations) Resource limit vs. actual usage comparison Application-specific metrics dashboards Business KPIs relevant to your application Request rates, error rates, and latencies Database connection pool status Cache hit ratios Custom instrumentation metrics User experience metrics Example: E-commerce dashboard with:
Orders per minute Cart abandonment rate Payment processing time Product search latency Active user sessions Alert thresholds visualization Combine metrics with alert thresholds Visual indicators for approaching thresholds Alert history and frequency Mean time to resolution tracking Alert correlation with system events Example panel: Graph with colored threshold bands Log correlation views Combined metrics and log panels Error rate correlation with log volume Event markers on time-series charts Log context for anomalies Drill-down from metrics to logs Example: Request latency chart with error log entries as annotations Configure alerts for critical conditions:
Container restarts Alert when containers restart too frequently Track crash loops and failure patterns Set thresholds based on service criticality Correlate with deployment events Example query: increase(kube_pod_container_status_restarts_total[1h]) > 3
High resource usage CPU utilization exceeding thresholds (e.g., >80% for 15min) Memory approaching limits (e.g., >90% of limit) Disk space running low (e.g., <10% free) Network saturation Example query: container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 > 90
Health check failures Failed health checks for critical services Increased health check latency Intermittent health check failures Health check timeout frequency Example query: rate(container_health_status_changes_total{health_status="unhealthy"}[5m]) > 0
Error log patterns Increased error rate in application logs Critical error messages or patterns Authentication failures Database connection issues Example query: rate(log_messages_total{level="error"}[5m]) > 10
Abnormal application behavior Request latency spikes Increased error responses (HTTP 5xx) Abnormal traffic patterns Unexpected drop in transactions Database query performance degradation Example query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
Infrastructure issues Docker daemon errors Storage driver issues Network connectivity problems Host resource exhaustion System-level errors Example query: rate(docker_engine_daemon_errors_total[5m]) > 0
# Comprehensive Alertmanager configuration example
global :
# The smarthost and SMTP sender used for mail notifications
smtp_smarthost : 'smtp.example.org:587'
smtp_from : 'alertmanager@example.org'
smtp_auth_username : 'alertmanager'
smtp_auth_password : 'password'
# The auth token for Slack
slack_api_url : 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXX'
# PagerDuty integration
pagerduty_url : 'https://events.pagerduty.com/v2/enqueue'
# Default notification settings
resolve_timeout : 5m
# Templates for notifications
templates :
- '/etc/alertmanager/template/*.tmpl'
# The root route on which all alerts enter
route :
# Default receiver
receiver : 'team-emails'
# Group alerts by category and environment
group_by : [ 'alertname' , 'environment' , 'service' ]
# Wait 30s to aggregate alerts of the same group
group_wait : 30s
# Send updated notification if new alerts added to group
group_interval : 5m
# How long to wait before sending a notification again
repeat_interval : 4h
# Child routes
routes :
- receiver : 'critical-pages'
matchers :
- severity="critical"
# Don't wait to send critical alerts
group_wait : 0s
# Lower interval for critical alerts
repeat_interval : 1h
# Continue to forward to other child routes
continue : true
- receiver : 'slack-notifications'
matchers :
- severity=~"warning|info"
# Continue processing other child routes
continue : true
- receiver : 'database-team'
matchers :
- service=~"database|postgres|mysql"
- receiver : 'frontend-team'
matchers :
- service=~"frontend|ui|web"
# Inhibition rules prevent notifications of lower severity alerts if a higher severity
# alert is already firing for the same issue
inhibit_rules :
- source_matchers :
- severity="critical"
target_matchers :
- severity="warning"
# Only inhibit if the alertname is the same
equal : [ 'alertname' , 'instance' ]
receivers :
- name : 'team-emails'
email_configs :
- to : 'team@example.com'
send_resolved : true
html : '{{ template "email.default.html" . }}'
headers :
Subject : '{{ template "email.subject" . }}'
- name : 'critical-pages'
pagerduty_configs :
- service_key : 'your-pagerduty-service-key'
send_resolved : true
description : '{{ template "pagerduty.default.description" . }}'
severity : 'critical'
client : 'Alertmanager'
client_url : 'https://alertmanager.example.com'
- name : 'slack-notifications'
slack_configs :
- channel : '#monitoring'
send_resolved : true
icon_url : 'https://avatars3.githubusercontent.com/u/3380462'
title : '{{ template "slack.default.title" . }}'
title_link : 'https://alertmanager.example.com/#/alerts'
text : '{{ template "slack.default.text" . }}'
footer : 'Prometheus Alertmanager'
actions :
- type : 'button'
text : 'Runbook 📚'
url : '{{ (index .Alerts 0).Annotations.runbook_url }}'
- name : 'database-team'
slack_configs :
- channel : '#db-alerts'
send_resolved : true
title : '{{ template "slack.default.title" . }}'
text : '{{ template "slack.default.text" . }}'
- name : 'frontend-team'
slack_configs :
- channel : '#frontend-alerts'
send_resolved : true
title : '{{ template "slack.default.title" . }}'
text : '{{ template "slack.default.text" . }}'
Effective log analysis is crucial for troubleshooting container issues. Here are practical techniques for extracting valuable information from container logs:
# Debugging container issues with error filtering
docker logs --tail=100 -t container_name | grep ERROR
# Follow logs with specific pattern matching (using extended regex)
docker logs -f container_name | grep -E "error|exception|fail|fatal|panic"
# Extract logs with context (3 lines before and after each match)
docker logs container_name | grep -A 3 -B 3 "Exception"
# Find all occurrences of specific error types and count them
docker logs container_name | grep -o "OutOfMemoryError" | wc -l
# Search for logs within a specific time window
docker logs --since 2023-07-01T10:00:00 --until 2023-07-01T11:00:00 container_name
# Save logs to file for offline analysis
docker logs container_name > container.log
# Compare logs across time periods
docker logs --since 2h container_name > recent.log
docker logs --since 4h --until 2h container_name > older.log
diff recent.log older.log | less
# Parse JSON logs for better readability
docker logs container_name | grep -v '^$' | jq '.'
# Extract specific fields from JSON logs
docker logs container_name | grep -v '^$' | jq 'select(.level=="error") | {timestamp, message, error}'
# Follow logs and highlight different log levels with colors
docker logs -f container_name | GREP_COLOR = "01;31" grep -E --color=always 'ERROR|$' | GREP_COLOR = "01;33" grep -E --color=always 'WARN|$'
# Extract logs for a specific request ID
docker logs container_name | grep "request-123456"
# Find slow operations (e.g., queries taking more than 1 second)
docker logs container_name | grep -E "took [1-9][0-9]{3,}ms"
# Extract stack traces
docker logs container_name | grep -A 20 "Exception" | grep -v "^$"
# Analyze log volume by time (logs per minute)
docker logs -t container_name | cut -d ' ' -f1 | cut -d 'T' -f2 | cut -c1-8 | sort | uniq -c
Advanced debugging techniques:
Correlate logs with system events (deployments, scaling, etc.) Compare logs across multiple containers for distributed issues Use timestamps to create a sequence of events Check container environment variables for configuration issues Analyze container startup logs separately from runtime logs Monitor both application and infrastructure logs simultaneously Use regex patterns to extract structured data from unstructured logs Implement OpenTelemetry or Jaeger Track requests as they flow through distributed systems Generate trace IDs to correlate logs across services Instrument code with OpenTelemetry SDK Deploy Jaeger or Zipkin collectors Example Jaeger deployment:
version : '3.8'
services :
jaeger :
image : jaegertracing/all-in-one:1.37
ports :
- "6831:6831/udp" # Jaeger agent - accepts spans in Thrift format
- "16686:16686" # Jaeger UI
environment :
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
Trace requests across multiple containers Propagate trace context between services Capture request path across microservices Record parent-child relationships between spans Preserve baggage items for request context Example client instrumentation (Node.js):
const tracer = opentelemetry.trace. getTracer ( 'my-service' );
const span = tracer. startSpan ( 'process-order' );
try {
// Add attributes to span
span. setAttribute ( 'order.id' , orderId);
span. setAttribute ( 'customer.id' , customerId);
// Create child span
const dbSpan = tracer. startSpan ( 'database-query' , {
parent: span,
});
// Process database operations
dbSpan. end ();
// Propagate context to HTTP requests
const headers = {};
opentelemetry.propagation. inject (opentelemetry.context. active (), headers);
await axios. post ( 'http://inventory-service/check' , data, { headers });
} catch (error) {
span. recordException (error);
span. setStatus ({ code: SpanStatusCode. ERROR });
} finally {
span. end ();
}
Measure service latency Calculate time spent in each service Break down processing time by operation Identify slow components or dependencies Compare latency across different environments Correlate latency with resource utilization Example span attributes for latency analysis:
database.query.duration_ms: 45.2
http.request.duration_ms: 120.7
cache.lookup_time_ms: 2.1
business_logic.processing_time_ms: 15.8
Identify bottlenecks Analyze critical path in request processing Find operations with highest latency contribution Detect contention points and resource constraints Quantify impact of external dependencies Trace-based hotspot analysis techniques Example hotspot dashboard showing service latency breakdown Visualize service dependencies Generate service dependency graphs Analyze traffic patterns between services Calculate error rates between service pairs Identify redundant or unnecessary calls Detect circular dependencies Example visualization tools:
Jaeger UI service graph Grafana service graph panel Kiali for service mesh visualization Custom D3.js visualization Expose application-specific metrics Identify key business and technical metrics Instrument application code with metrics Expose metrics endpoints (/metrics) Design meaningful metric names and labels Balance cardinality and usefulness Example custom metrics:
# Counter for business events
order_total{status="completed",payment_method="credit_card"} 1550.75
# Gauge for active resource usage
active_user_sessions{region="us-west"} 1250
# Histogram for latency distributions
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.1"} 1500
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="0.5"} 1950
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="1.0"} 1990
api_request_duration_seconds_bucket{endpoint="/api/v1/products",le="+Inf"} 2000
Implement Prometheus client libraries Use official client libraries for language-specific instrumentation Create counters, gauges, histograms, and summaries Register metrics with registry Set up middleware for standard metrics Add custom labels for filtering and aggregation Example implementation (Python):
from prometheus_client import Counter, Gauge, Histogram, Summary, start_http_server
# Create metrics
REQUEST_COUNT = Counter('app_request_count', 'Application Request Count', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Application Request Latency', ['method', 'endpoint'])
ACTIVE_SESSIONS = Gauge('app_active_sessions', 'Active Sessions', ['region'])
# Start metrics endpoint
start_http_server(8000)
# Update metrics in code
def process_request(request):
ACTIVE_SESSIONS.labels(region='us-west').inc()
with REQUEST_LATENCY.labels(method='POST', endpoint='/api/v1/order').time():
# Process request
result = handle_request(request)
REQUEST_COUNT.labels(method='POST', endpoint='/api/v1/order', status=result.status_code).inc()
ACTIVE_SESSIONS.labels(region='us-west').dec()
return result
Create custom dashboards Design purpose-specific visualization panels Combine technical and business metrics Create drill-down capabilities Use appropriate visualization types Implement dynamic variables for filtering Example Grafana dashboard JSON structure:
{
"title" : "E-Commerce Platform Dashboard" ,
"panels" : [
{
"title" : "Order Volume by Payment Method" ,
"type" : "barchart" ,
"datasource" : "Prometheus" ,
"targets" : [
{
"expr" : "sum by(payment_method) (order_total)" ,
"legendFormat" : "{{payment_method}}"
}
]
},
{
"title" : "API Latency (95th Percentile)" ,
"type" : "timeseries" ,
"datasource" : "Prometheus" ,
"targets" : [
{
"expr" : "histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint))" ,
"legendFormat" : "{{endpoint}}"
}
]
}
]
}
Set relevant thresholds Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives) Create alert thresholds based on business impact Implement multi-level thresholds (warning, critical) Use historical data to establish baselines Account for traffic patterns and seasonality Example alert thresholds:
- alert : APIHighLatency
expr : histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 0.5
for : 5m
labels :
severity : warning
annotations :
summary : "High API latency on {{ $labels.endpoint }}"
description : "95th percentile latency is {{ $value }}s, which exceeds the SLO of 0.5s"
Correlate with business metrics Connect technical metrics to business outcomes Measure conversion impact of performance issues Track costs associated with resource usage Create composite KPI dashboards Implement business impact scoring Example correlation queries:
# Conversion rate vs. page load time
sum(rate(purchase_completed_total[1h])) / sum(rate(product_page_view_total[1h]))
# Revenue impact of errors
sum(rate(order_total{status="error"}[1h]))
# Customer satisfaction correlation
rate(support_ticket_created{category="performance"}[1d]) / rate(active_user_sessions[1d])
Implement auto-scaling based on metrics Configure horizontal pod autoscaling Use custom metrics for scaling decisions Set appropriate cooldown periods Implement predictive scaling for known patterns Test scaling behavior under various conditions Example Docker Swarm service scaling:
# Autoscaling with docker service update
while true ; do
# Get current metrics
REQUESTS = $( curl -s http://prometheus:9090/api/v1/query?query=sum \( rate \( http_requests_total \[ 1m \]\)\) | jq -r '.data.result[0].value[1]' )
# Calculate desired replicas (1 replica per 100 requests/second)
DESIRED = $( echo " $REQUESTS / 100" | bc )
if [ $DESIRED -lt 2 ]; then DESIRED = 2 ; fi
if [ $DESIRED -gt 10 ]; then DESIRED = 10 ; fi
# Get current replicas
CURRENT = $( docker service ls --filter name=myapp --format "{{.Replicas}}" | cut -d '/' -f1 )
# Scale if needed
if [ $DESIRED -ne $CURRENT ]; then
echo "Scaling from $CURRENT to $DESIRED replicas"
docker service update --replicas $DESIRED myapp
fi
sleep 30
done
Configure auto-healing for failed containers Set appropriate restart policies Implement health checks for accurate failure detection Configure liveness and readiness probes Capture diagnostic information before restarts Implement circuit breakers for dependent services Example Docker Compose configuration:
services :
app :
image : myapp:latest
restart : unless-stopped
healthcheck :
test : [ "CMD" , "curl" , "-f" , "http://localhost:8080/health" ]
interval : 10s
timeout : 5s
retries : 3
start_period : 40s
deploy :
restart_policy :
condition : on-failure
max_attempts : 3
window : 120s
logging :
driver : "json-file"
options :
max-size : "10m"
max-file : "3"
Create runbooks for common issues Document standard troubleshooting procedures Include diagnostic commands and expected output Link alerts to specific runbook sections Provide escalation paths for unresolved issues Maintain version control for runbooks Example runbook structure:
# API Service High Latency Runbook
## Symptoms
- API response time > 500ms (95th percentile)
- Increased error rate in downstream services
- Alert: APIHighLatency firing
## Diagnostic Steps
1. Check container resource usage:
docker stats api-service
2. Examine database connection pool:
curl http://api-service:8080/actuator/metrics/hikaricp.connections.usage
3. Check for slow queries:
docker logs api-service | grep "slow query"
## Resolution Steps
1. If connection pool utilization > 80%:
- Increase pool size in config
- Restart service with `docker restart api-service`
2. If slow queries detected:
- Check database indexes
- Optimize identified queries
3. If CPU/memory usage high:
- Scale service: `docker service scale api-service=3`
## Escalation
If issue persists after 15 minutes:
- Contact: database-team@example.com
- Slack: #database-support
Develop automated remediation Implement scripted responses to common problems Create self-healing capabilities Add circuit breakers for degraded dependencies Implement graceful degradation modes Balance automation with human oversight Example automated remediation script:
#!/bin/bash
# Automated database connection reset
# Check connection errors
ERROR_COUNT = $( docker logs --since 5m db-service | grep "connection reset" | wc -l )
if [ $ERROR_COUNT -gt 10 ]; then
echo "Detected database connection issues, performing remediation"
# Capture diagnostics before remediation
docker logs --since 15m db-service > /var/log/remediation/db- $( date +%s ) .log
# Perform remediation
docker exec db-service /scripts/connection-reset.sh
# Verify fix
sleep 5
NEW_ERRORS = $( docker logs --since 1m db-service | grep "connection reset" | wc -l )
# Notify on outcome
if [ $NEW_ERRORS -eq 0 ]; then
echo "Remediation successful" | slack-notify "#monitoring"
else
echo "Remediation failed, escalating" | slack-notify "#monitoring" --urgent
# Trigger PagerDuty incident
pagerduty-trigger "Database connection issues persist after remediation"
fi
fi
Setup escalation policies Define clear escalation thresholds Create tiered response teams Implement on-call rotation schedules Track mean time to acknowledge and resolve Document communication protocols Example escalation policy:
# PagerDuty escalation policy
escalation_policies :
- name : "API Service Escalation"
description : "Escalation policy for API service incidents"
num_loops : 2
escalation_rules :
- escalation_delay_in_minutes : 15
targets :
- id : "PXXXXXX" # Primary on-call engineer
type : "user_reference"
- escalation_delay_in_minutes : 15
targets :
- id : "PXXXXXX" # Secondary on-call engineer
type : "user_reference"
- escalation_delay_in_minutes : 30
targets :
- id : "PXXXXXX" # Engineering manager
type : "user_reference"
For comprehensive monitoring:
Combine infrastructure and application metrics Correlate container metrics with application performance Identify resource constraints affecting application behavior View system and application health holistically Create dashboards showing both layers together Implement consistent labeling across metric types Establish performance baselines Capture normal behavior patterns Document expected resource utilization Create baseline metrics for different load levels Use percentiles rather than averages Update baselines after significant changes Example baseline document:
API Service Baseline (10 rps):
- CPU: 0.2-0.4 cores (p95)
- Memory: 250-350MB (p95)
- Latency: 75-150ms (p95)
- Error rate: <0.1%
Track historical trends Store metrics with appropriate retention policies Analyze seasonal patterns and growth trends Compare current performance to historical data Detect gradual degradation over time Correlate performance changes with application updates Example trend analysis:
# PromQL query for weekly latency comparison
avg_over_time(histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{endpoint="/api/v1/search"}[5m])) by (le))[7d:1d])
Correlate logs with metrics Connect error spikes with log events Add trace IDs to both logs and metrics Use common timestamps and identifiers Create linked visualizations Implement log-derived metrics Example correlation techniques:
Use Grafana's Explore view to show logs and metrics side by side Add log annotations to metric graphs Create composite dashboards with both data types Use trace IDs to link between systems Monitor from both internal and external perspectives Internal: Resource usage, component health External: End-user experience, global availability Edge performance: CDN, DNS, SSL Third-party dependencies Regional variations in performance Example multi-perspective monitoring:
# Internal metric (container health)
- name : "container_health"
query : "container_memory_usage_bytes{name=~'api-.*'} / container_spec_memory_limit_bytes{name=~'api-.*'} * 100 > 80"
# External metric (user experience)
- name : "user_experience"
query : "probe_duration_seconds{job='blackbox',target='https://api.example.com/health'} > 0.5"
Implement synthetic checks Simulate user interactions regularly Test critical business workflows Monitor from multiple geographic locations Set up transactional checks Alert on business-impacting failures Example synthetic check configuration:
# Blackbox exporter configuration
- job_name : 'blackbox'
metrics_path : /probe
params :
module : [ http_2xx ] # Look for a HTTP 200 response
static_configs :
- targets :
- https://api.example.com/health
- https://api.example.com/ready
- https://www.example.com/login
relabel_configs :
- source_labels : [ __address__ ]
target_label : __param_target
- source_labels : [ __param_target ]
target_label : instance
- target_label : __address__
replacement : blackbox-exporter:9115
Log to STDOUT/STDERR Follow container best practices Enable centralized collection Avoid managing log files inside containers Let Docker logging drivers handle transport Example: Configure applications to write directly to stdout/stderr rather than log files Use structured logging format Implement JSON-formatted logs Include consistent metadata fields Use proper data types within JSON Add correlation IDs for request tracking Maintain schema consistency Example structured log:
{
"timestamp" : "2023-07-10T15:23:45.123Z" ,
"level" : "error" ,
"service" : "order-service" ,
"message" : "Failed to process payment" ,
"traceId" : "abc123def456" ,
"orderId" : "order-789" ,
"errorCode" : "PAYMENT_DECLINED" ,
"duration_ms" : 345
}
Implement log rotation Configure size and time-based rotation Set appropriate retention periods Compress rotated logs Monitor disk usage Handle rotation gracefully Example Docker logging configuration:
{
"log-driver" : "json-file" ,
"log-opts" : {
"max-size" : "20m" ,
"max-file" : "5" ,
"compress" : "true"
}
}
Set appropriate log levels Use DEBUG for development environments Use INFO or WARN for production Reserve ERROR for actionable issues Make log levels configurable at runtime Use consistent level definitions across services Example log level configuration:
logging :
level :
root : WARN
com.example.api : INFO
com.example.database : WARN
com.example.payment : INFO
Configure centralized logging Aggregate logs from all containers Implement proper indexing and search Set up log parsing and normalization Configure access controls for logs Establish retention and archival policies Example EFK stack setup:
Filebeat or Fluentd to collect logs Elasticsearch for storage and indexing Kibana for visualization and search Curator for index lifecycle management Avoid sensitive data in logs Implement data masking for PII Never log credentials or secrets Truncate potentially large payloads Remove sensitive headers Anonymize personal identifiers Example masking implementation:
function logRequest ( req ) {
const sanitized = {
method: req.method,
path: req.path,
query: sanitizeObject (req.query),
headers: sanitizeHeaders (req.headers),
body: sanitizeObject (req.body)
};
logger. info ({ request: sanitized }, "Incoming request" );
}
function sanitizeObject ( obj ) {
const masked = { ... obj};
const sensitiveFields = [ 'password' , 'token' , 'ssn' , 'credit_card' ];
for ( const field of sensitiveFields) {
if (masked[field]) masked[field] = '***REDACTED***' ;
}
return masked;
}
Monitor both host and containers Track host-level resources (CPU, memory, disk, network) Monitor container-specific metrics Correlate container performance with host constraints Watch for noisy neighbor problems Track Docker daemon health Example monitoring stack:
Node Exporter for host metrics cAdvisor for container metrics Docker daemon metrics endpoint Process-specific metrics from applications Implement alerting with appropriate thresholds Create multi-level alerts (warning/critical) Avoid alert fatigue with proper thresholds Include runbook links in alert notifications Group related alerts to reduce noise Implement alert de-duplication Example alerting configuration:
- alert : ContainerHighCpuUsage
expr : rate(container_cpu_usage_seconds_total{name!=""}[1m]) * 100 > 80
for : 5m
labels :
severity : warning
annotations :
summary : "Container CPU usage high ({{ $labels.name }})"
description : "Container CPU usage is above 80% for 5 minutes"
runbook_url : "https://wiki.example.com/runbooks/container-cpu"
Use visualization dashboards Create role-specific dashboards Include both overview and detailed views Use appropriate visualization types Implement template variables for filtering Balance information density with readability Example dashboard organization:
Executive overview: High-level health and KPIs Operations dashboard: System health and resource usage Developer dashboard: Application performance and errors Service-specific dashboards: Detailed metrics for each service On-call dashboard: Current alerts and recent incidents Track business-relevant metrics Monitor key performance indicators (KPIs) Create business-technical correlation views Measure user experience metrics Track conversion and engagement metrics Monitor transaction value and volume Example business metrics:
# Business metrics in Prometheus format
# HELP order_value_total Total value of orders in currency
# TYPE order_value_total counter
order_value_total{currency="USD",status="completed"} 15234.50
# HELP checkout_started Total number of checkout processes started
# TYPE checkout_started counter
checkout_started 5423
# HELP checkout_completed Total number of checkout processes completed
# TYPE checkout_completed counter
checkout_completed 4231
Implement health checks Create meaningful application health endpoints Check dependencies in health probes Implement readiness vs. liveness separation Make health checks lightweight Include version and dependency information Example health check endpoint:
app. get ( '/health' , async ( req , res ) => {
try {
// Check database connectivity
const dbStatus = await checkDatabase ();
// Check cache service
const cacheStatus = await checkRedis ();
// Check external API dependency
const apiStatus = await checkExternalApi ();
const allHealthy = dbStatus && cacheStatus && apiStatus;
res. status (allHealthy ? 200 : 503 ). json ({
status: allHealthy ? 'healthy' : 'unhealthy' ,
version: '1.2.3' ,
timestamp: new Date (). toISOString (),
components: {
database: dbStatus ? 'up' : 'down' ,
cache: cacheStatus ? 'up' : 'down' ,
api: apiStatus ? 'up' : 'down'
}
});
} catch (error) {
res. status ( 500 ). json ({ status: 'error' , error: error.message });
}
});
Plan for monitoring scalability Design for growth in container count Implement metric aggregation for high-cardinality data Use appropriate retention policies Consider resource requirements for monitoring tools Implement federation for large-scale deployments Example scalability techniques:
Prometheus hierarchical federation Service discovery for dynamic environments Metric aggregation and downsampling Sharding metrics by service or namespace Custom recording rules for common queries
# Recording rules for efficient querying
groups :
- name : container_aggregation
interval : 1m
rules :
- record : job:container_cpu:usage_rate5m
expr : sum(rate(container_cpu_usage_seconds_total[5m])) by (job)
- record : job:container_memory:usage_bytes
expr : sum(container_memory_usage_bytes) by (job)