Docker Swarm for Orchestration

Learn how to use Docker's native clustering and orchestration capabilities with Docker Swarm

Docker Swarm is Docker's native clustering and orchestration solution that turns a group of Docker hosts into a single virtual host, providing high availability and load balancing for containerized applications. Unlike external orchestrators, Swarm is built into the Docker engine, offering a simpler approach to container orchestration with minimal setup overhead while maintaining the familiar Docker CLI experience.

Swarm Architecture

Control Plane

Manager nodes: Maintain cluster state, orchestrate services, and handle API requests; recommended to have 3, 5, or 7 for high availability
Distributed state store (Raft consensus): Built-in database that maintains consistent cluster state across manager nodes; requires a majority of managers (N/2+1) to be available
API endpoints: Expose the Docker API for cluster management; automatically load-balanced across manager nodes
Orchestration decision making: Schedule tasks, reconcile desired state, handle node failures, and perform rolling updates
High availability configuration: Replicated state across multiple manager nodes with leader election for fault tolerance

Data Plane

Worker nodes: Execute containerized workloads as assigned by managers; can scale to thousands of nodes
Task execution: Run container instances as directed by the orchestration layer with health monitoring
Container runtime: Standard Docker engine running on each node for consistent container execution
Network overlay: Multi-host networking with automatic service discovery and load balancing
Load distribution: Spread services across nodes based on placement constraints, resource availability, and high availability requirements

Setting Up a Swarm Cluster

# Initialize a new swarm (on manager node)
docker swarm init --advertise-addr <MANAGER-IP>
# --advertise-addr specifies the address other nodes will use to connect to this manager
# If omitted, Docker will attempt to auto-detect the IP, which may be incorrect in multi-interface servers

# Output will provide a token to join worker nodes
# Example output:
# docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c <MANAGER-IP>:2377
# This token authenticates new nodes to the swarm, ensuring secure cluster expansion

# Join a node as worker
docker swarm join --token <TOKEN> <MANAGER-IP>:2377
# Workers receive and execute tasks but cannot manage the cluster state
# Port 2377 is the standard swarm management port

# Join a node as manager
docker swarm join-token manager
# This command generates a different token specifically for manager nodes
# Then use the provided token to join as manager
docker swarm join --token <MANAGER-TOKEN> <MANAGER-IP>:2377
# Manager nodes participate in the Raft consensus and can manage the cluster

# List nodes in the swarm
docker node ls
# Shows all nodes with their roles, availability status, and manager status
# MANAGER STATUS column shows "Leader" for the primary manager node

Services and Tasks

In Docker Swarm, services are the definitions of tasks to execute on the manager or worker nodes:

Services define the container image, replicas, networks, and more, forming the core abstraction layer for deploying applications
Tasks are the individual containers running across the swarm, with each task corresponding to one container instance
The scheduler distributes tasks across available nodes based on resource availability, placement constraints, and service requirements
The orchestrator maintains the desired state by continuously monitoring and reconciling the actual state with the desired configuration, automatically replacing failed containers and handling scaling operations

Services follow a declarative model, where you specify the desired state of your application (what you want to happen), and Swarm continuously works to ensure that state is maintained, handling failures and updates automatically.

Creating and Managing Services

# Create a replicated service
docker service create --name webserver \
  --replicas 3 \                        # Deploy 3 instances of the container
  --publish 80:80 \                     # Publish port 80 on all swarm nodes
  --update-delay 10s \                  # Wait 10s between updating each container
  --update-parallelism 1 \              # Update one container at a time
  --restart-condition on-failure \      # Automatically restart containers that exit non-zero
  nginx:latest                          # Container image to use

# List services
docker service ls
# Shows all services with their ID, name, mode (replicated/global), replicas, image, and ports

# Inspect a service
docker service inspect webserver
# Provides detailed JSON output of the service configuration, including:
# - Container configuration
# - Resource constraints
# - Networks
# - Update and rollback configuration
# - Placement constraints

# View service tasks (containers)
docker service ps webserver
# Shows all tasks (containers) for the service with their:
# - ID, name, image
# - Node assignment
# - Desired and current state
# - Error messages (if any)
# - When the task was created and updated

# Scale a service
docker service scale webserver=5
# Increases or decreases the number of replicas
# Swarm will create or remove containers to match the desired count
# Use comma-separated list to scale multiple services: webserver=5,api=3

# Update a service
docker service update --image nginx:1.21 \  # Change the image version
  --limit-cpu 0.5 \                         # Add CPU limits
  --limit-memory 512M \                     # Add memory limits  
  --rollback-parallelism 2 \                # Configure rollback behavior
  webserver

# Remove a service
docker service rm webserver
# Removes the service and stops all associated containers

Swarm Networking

Overlay Networks

# Create an overlay network
docker network create --driver overlay \     # Multi-host network driver
  --attachable \                             # Allow standalone containers to connect
  --subnet 10.0.9.0/24 \                     # Define custom subnet (optional)
  --gateway 10.0.9.1 \                       # Define custom gateway (optional)
  --opt encrypted \                           # Enable encryption for traffic (optional)
  backend-network                            # Network name

# Create service with network
docker service create --name api \
  --network backend-network \                # Attach to the overlay network
  --replicas 3 \                             # Run 3 instances
  --endpoint-mode dnsrr \                    # DNS round-robin mode (alternative to VIP)
  myapi:latest

Overlay networks provide:

Cross-host communication between containers
Isolated network segments for multi-tier applications
Internal DNS resolution for service discovery
Optional encryption for sensitive traffic

Ingress Network

Built-in load balancing: Transparent load balancing for published ports using Linux IPVS
Routing mesh for service ports: Expose services on every node, regardless of whether the node is running instances of that service
Distributes requests across nodes: Route incoming requests to any node to an active container, even if it's running on a different node
Automatic service discovery: Services can communicate by name without manual linking or IP configuration
Container-to-container communication: Containers can communicate securely across hosts within the same overlay network

The ingress network uses a stateless load balancing mechanism:

[External Client] → [Any Swarm Node Port 80] → [Routing Mesh] → [Container Running Service]

For more control over routing:

# Create service with specific publishing mode
docker service create --name web \
  --publish mode=host,target=80,published=8080 \
  nginx
# 'host' mode bypasses the routing mesh but requires manual port assignment per node

Service Discovery

Docker Swarm provides built-in service discovery, allowing containers to find and communicate with each other using service names:

# Service discovery example
docker service create --name db \
  --network backend-network \               # Join the overlay network
  --mount type=volume,source=db-data,target=/var/lib/postgresql/data \  # Persistent storage
  --env POSTGRES_PASSWORD_FILE=/run/secrets/db_password \  # Use secrets for security
  --replicas 1 \                            # Database typically runs a single instance
  postgres:13                               # Database image

docker service create --name api \
  --network backend-network \               # Same network as the db service
  --env DB_HOST=db \                        # Reference the database by service name
  --env DB_PORT=5432 \                      # Standard PostgreSQL port
  --replicas 3 \                            # Run multiple API instances
  --update-order start-first \              # Start new tasks before stopping old ones during updates
  myapi:latest                              # API image

Service discovery works through:

Internal DNS server: Every container in Swarm can query the embedded DNS server
Service VIPs (Virtual IPs): Each service gets a virtual IP in the overlay network
DNS round-robin: Containers can resolve service names to all task IPs using DNS

Example connection from inside a container:

# Connect to database from inside the api container
docker exec -it $(docker ps -q -f name=api) bash
ping db       # Resolves to the service VIP
psql -h db -U postgres -d myapp  # Connect to database using service name

For services with multiple replicas, connections are automatically load-balanced:

# Create frontend service that connects to the API
docker service create --name frontend \
  --network backend-network \
  --env API_URL=http://api:8000/ \  # API service name is automatically resolved and load-balanced
  --publish 80:80 \
  frontend:latest

Swarm Deployment with Stacks

Stacks allow you to define and deploy complete applications with multiple services using a compose file:

# Deploy a stack
docker stack deploy -c docker-compose.yml myapp
# This creates all services, networks, and volumes defined in the compose file
# All resources are labeled with the stack name for easy management
# Services are named as: <stack-name>_<service-name>

# List stacks
docker stack ls
# Shows all stacks with the number of services in each

# List services in a stack
docker stack services myapp
# Shows detailed information about each service in the stack, including:
# - Replica status (current/desired)
# - Image and configuration
# - Port mappings

# List tasks in a stack
docker stack ps myapp
# Shows all running containers across the stack with:
# - Status (running, failed, etc.)
# - Error messages if applicable
# - Node assignment
# - Current and desired state

# Remove a stack
docker stack rm myapp
# Removes all services, networks created by the stack
# Does NOT remove volumes to protect data

Stacks provide several advantages:

Declarative deployment: Define entire application topology in a single file
Version control: Store compose files in source control for application versioning
Application integrity: Deploy or remove all components together
Simplified management: Group-related services for easier monitoring and updates
Resource isolation: Each stack has its own namespace for services

Docker Compose for Swarm

# docker-compose.yml for Swarm deployment
version: '3.8'  # Compose file format version with swarm support

services:
  web:
    image: nginx:latest
    ports:
      - "80:80"  # Published port (accessible from outside)
    deploy:
      mode: replicated       # 'replicated' (scaled) or 'global' (one per node)
      replicas: 3            # Number of container instances
      update_config:
        parallelism: 1       # Update one container at a time
        delay: 10s           # Wait 10s between updates
        order: start-first   # Start new tasks before stopping old ones
        failure_action: rollback  # Auto-rollback on failure
      restart_policy:
        condition: on-failure    # Restart if container exits non-zero
        max_attempts: 3          # Maximum restart attempts
        window: 120s             # Time window to evaluate restart attempts
      placement:
        constraints:
          - node.role == worker  # Only run on worker nodes
        preferences:
          - spread: node.labels.datacenter  # Spread across datacenters
      resources:
        limits:
          cpus: '0.5'       # Maximum CPU usage
          memory: 512M      # Maximum memory usage
        reservations:
          cpus: '0.1'       # Guaranteed CPU allocation
          memory: 128M      # Guaranteed memory allocation
  
  api:
    image: myapi:latest
    deploy:
      replicas: 3
      placement:
        constraints:
          - node.labels.zone == frontend  # Only run on nodes with this label
        max_replicas_per_node: 1          # Limit instances per node for HA
      update_config:
        parallelism: 2                    # Update two containers at once
      rollback_config:
        parallelism: 3                    # Faster rollback if needed
    environment:
      - DB_HOST=db                     # Service discovery by name
      - API_KEY_FILE=/run/secrets/api_key  # Reference a secret
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s                # Initial grace period
    networks:
      - backend                        # Connect to backend network
    secrets:
      - api_key                        # Reference to secret defined below
  
  db:
    image: postgres:13
    volumes:
      - db-data:/var/lib/postgresql/data  # Persistent storage
    deploy:
      placement:
        constraints:
          - node.labels.zone == database  # Only run on database nodes
      replicas: 1                         # Single database instance
      restart_policy:
        condition: any                    # Always restart
    environment:
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
    networks:
      - backend
    secrets:
      - db_password

networks:
  backend:
    driver: overlay               # Multi-host network
    attachable: true              # Allow standalone containers to connect
    driver_opts:
      encrypted: "true"           # Encrypt traffic on this network

volumes:
  db-data:                        # Named volume for persistent data
    driver: local                 # Use local storage (default)
    # For production, consider using a volume driver that supports replication

secrets:
  api_key:
    file: ./secrets/api_key.txt   # Load from local file during deployment
  db_password:
    file: ./secrets/db_password.txt

High Availability and Fault Tolerance

Manager High Availability

Deploy multiple manager nodes (3, 5, or 7 recommended):
- 3 managers tolerate 1 failure
- 5 managers tolerate 2 failures
- 7 managers tolerate 3 failures
- More than 7 reduces performance without improving fault tolerance
Maintain quorum for consensus:
- Requires majority (N/2+1) of managers to be operational
- Loss of quorum prevents changes to cluster state
- Critical operations require consensus:
  - Service creation/updates
  - Node additions/removals
  - Secret management
Distribute managers across failure domains:
- Place on different physical hosts
- Spread across availability zones
- Use different network segments
- Consider power and cooling redundancy
Automatic leader election:
- Raft protocol selects a leader manager
- Handles failover without manual intervention
- Typically completes within seconds
- Only one leader actively makes orchestration decisions
State replication between managers:
- Raft consensus ensures consistent state
- Each manager maintains a complete copy of cluster state
- Changes are propagated to all managers
- Persistent state stored in /var/lib/docker/swarm/

Worker Fault Tolerance

Automatic task rescheduling:
- Tasks (containers) from failed nodes automatically rescheduled
- New tasks maintain service configurations and constraints
- System attempts to distribute load evenly
- Honors placement constraints during rescheduling
Health checks:
- Container health checks monitor application health
- Node health monitoring detects infrastructure failures
- Proactive replacement of unhealthy containers
- Customizable health check parameters:
  healthcheck: test: ["CMD", "curl", "-f", "http://localhost/health"] interval: 30s timeout: 10s retries: 3
Service recreation:
- Failed services automatically recreated
- Maintains desired replica count
- Preserves service configuration
- Attempts to restart with exponential backoff
Rolling updates:
- Update services without downtime
- Control parallelism and delay between updates
- Monitor health during updates
- Automatic rollback on failure (when configured)
Restart policies:
- always: Always restart containers
- on-failure: Restart only on non-zero exit codes
- unless-stopped: Restart unless explicitly stopped
- none: Never automatically restart
- Configurable maximum attempts and restart window

Swarm Secrets Management

# Create a secret
echo "mypassword" | docker secret create db_password -
# Secrets are stored encrypted in the Raft log
# They are never written to disk unencrypted
# `-` at the end indicates input from stdin

# Create a service with a secret
docker service create --name db \
  --secret db_password \                           # Reference the created secret
  --env POSTGRES_PASSWORD_FILE=/run/secrets/db_password \  # Tell application where to find it
  --secret source=ssl_cert,target=server.crt \     # Custom target path for a secret
  --secret source=ssl_key,target=server.key,mode=0400 \  # Custom file permissions
  postgres:13

# Inside the container, secrets appear as files in /run/secrets/
# Example:
# /run/secrets/db_password
# /run/secrets/server.crt
# /run/secrets/server.key

# List secrets
docker secret ls
# Shows all secrets with creation time and ID

# Inspect a secret (shows metadata only, not the actual content)
docker secret inspect db_password

# Remove a secret
docker secret rm db_password
# Note: Cannot remove secrets used by running services
# Must update or remove services first

# Create a secret from a file
docker secret create ssl_cert ./server.crt

Swarm Configs

Configs allow you to store and distribute configuration files across the swarm:

# Create a config
docker config create nginx_conf nginx.conf
# Configs are similar to secrets but intended for non-sensitive data
# Can be larger than secrets (up to several MB)
# Also stored in the Swarm Raft log for distribution

# Create service with config
docker service create --name webserver \
  --config source=nginx_conf,target=/etc/nginx/nginx.conf \  # Mount at specific path
  --config source=proxy_params,target=/etc/nginx/proxy.conf,mode=0444 \  # Read-only
  --publish 80:80 \
  nginx:latest

# Inside the container, configs appear as files at the specified paths
# Unlike volumes, configs are immutable after creation
# To update a config, create a new one and update the service

# List configs
docker config ls

# Inspect a config
docker config inspect nginx_conf
# Shows metadata and creation information

# Remove a config
docker config rm nginx_conf
# Like secrets, configs in use cannot be removed

# Update a service to use a new config
docker service update --config-rm nginx_conf --config-add source=nginx_conf_v2,target=/etc/nginx/nginx.conf webserver

Configs are ideal for:

Application configuration files
Static web content
Default settings
SSL certificates (if not sensitive)
Any file that needs to be identical across all service instances

Resource Constraints

# Create a service with resource constraints
docker service create --name resource-limited \
  --limit-cpu 0.5 \                   # Maximum CPU usage (50% of one core)
  --limit-memory 512M \               # Maximum memory usage (512MB)
  --reserve-cpu 0.2 \                 # Guaranteed CPU (20% of one core)
  --reserve-memory 256M \             # Guaranteed memory (256MB)
  --reserve-memory-swap 512M \        # Total memory+swap reservation
  --generic-resource "gpu=1" \        # Request 1 GPU (requires node labels)
  --ulimit nofile=65536:65536 \       # Set file descriptor limits
  --ulimit nproc=4096:4096 \          # Process limits
  nginx:latest

# Resource constraints ensure:
# 1. Fair resource distribution across services
# 2. Protection against noisy neighbors
# 3. Predictable performance
# 4. Efficient bin-packing of containers

# The scheduler uses reservations to make placement decisions
# Limits enforce maximum resource usage
# A node must have available resources matching the reservation

Rolling Updates and Rollbacks

Rolling Updates

# Update with rolling update strategy
docker service update \
  --image nginx:1.21 \                      # New image to deploy
  --update-parallelism 2 \                  # Update 2 tasks at a time
  --update-delay 20s \                      # Wait 20s between updating each batch
  --update-order start-first \              # Start new tasks before stopping old ones
  --update-failure-action pause \           # Pause updates if a task fails
  --update-monitor 30s \                    # Monitor new tasks for 30s before proceeding
  --update-max-failure-ratio 0.2 \          # Allow 20% of tasks to fail before pausing
  --health-cmd "curl -f http://localhost/ || exit 1" \  # Health check command
  --health-interval 5s \                    # Check health every 5s during update
  --health-retries 3 \                      # Number of retries for health check
  webserver

# The rolling update process:
# 1. Start 2 new containers with the updated image
# 2. Wait for them to be healthy (30s monitoring period)
# 3. If healthy, stop 2 old containers
# 4. Wait 20s (update-delay)
# 5. Repeat until all containers are updated

Rollbacks

# Rollback to previous version
docker service update --rollback webserver
# This reverts to the configuration before the last update

# Configure rollback behavior
docker service update \
  --rollback-parallelism 3 \                # Rollback 3 tasks at a time
  --rollback-delay 10s \                    # Wait 10s between batches
  --rollback-failure-action continue \      # Continue rollback even if tasks fail
  --rollback-order stop-first \             # Stop old tasks before starting new ones
  --rollback-monitor 10s \                  # Monitor new tasks for 10s before proceeding
  webserver

# Check update/rollback status
docker service inspect --pretty webserver
# Shows current update status, progress, and configuration

# View update history
docker service ps --no-trunc webserver
# Shows all previous versions of tasks with their images

Rolling updates and rollbacks allow you to:

Deploy new versions without downtime
Test updates with canary deployments (partial updates)
Quickly revert problematic changes
Maintain control over the update process and timing
Implement blue-green deployment patterns

Node Management

# List nodes
docker node ls
# Output shows:
# - NODE ID: Unique identifier for each node
# - HOSTNAME: Node's hostname
# - STATUS: Ready, Down, or Disconnected
# - AVAILABILITY: Active, Pause, or Drain
# - MANAGER STATUS: Leader, Reachable, or Unreachable (for manager nodes)
# - ENGINE VERSION: Docker engine version

# Inspect node
docker node inspect node-1
# Returns detailed information about the node:
# - Labels and node attributes
# - Resource availability
# - Platform and architecture
# - Network addresses
# - Join tokens and certificates
# - Status and health

# Format inspect output for specific information
docker node inspect -f '{{.Status.Addr}}' node-1
# Returns just the node's IP address

# Set node availability
docker node update --availability drain node-1  # Prepare for maintenance
# Drain: Stops scheduling new tasks and removes existing tasks from the node
# Existing tasks are gracefully rescheduled to other nodes
# Perfect for maintenance operations and updates

docker node update --availability active node-1 # Return to service
# Active: Node accepts new tasks and retains existing tasks
# Pause: Node keeps existing tasks but won't receive new ones

# Add labels to nodes
docker node update --label-add zone=frontend node-1
docker node update --label-add datacenter=east --label-add cpu=high-performance node-1
# Labels enable sophisticated scheduling strategies:
# - Geographic distribution
# - Hardware differentiation
# - Environment separation (prod/staging)
# - Specialized workload targeting

# Filter nodes by labels
docker node ls --filter node.label=zone=frontend

# Promote worker to manager
docker node promote node-2
# Converts worker to manager role
# Joins the Raft consensus group
# Receives a copy of the cluster state
# Can now accept management commands

# Demote manager to worker
docker node demote node-2
# Removes node from management plane
# If this would break quorum, the operation fails
# Best practice: Demote before removing manager nodes

# Remove a node from the swarm
# First, on the node to remove:
docker swarm leave
# Then on a manager node:
docker node rm node-3

Simpler learning curve: Easier to learn and deploy with minimal new concepts
Native Docker integration: Built into Docker Engine without additional components
Lightweight implementation: Lower resource overhead and simpler architecture
Consistent Docker CLI experience: Uses familiar Docker commands and syntax
Faster deployment for smaller clusters: Quicker setup time for small to medium deployments
Lower operational complexity: Fewer moving parts and configuration options
Seamless Docker Compose integration: Direct deployment of Compose files with stack deploy
Integrated secret management: Built-in handling of sensitive data

Kubernetes Advantages

More extensive ecosystem: Larger community with more tools, extensions, and integrations
Advanced scheduling capabilities: More sophisticated pod placement and affinity rules
Broader industry adoption: Wider use in production environments across various industries
More extensible architecture: Custom Resource Definitions and Operators for extending functionality
Greater scalability for large deployments: Better performance at very large scale (1000+ nodes)
More granular control: Fine-grained configuration of networking, security, and resources
Declarative configuration: Stronger emphasis on GitOps and infrastructure-as-code patterns
Robust self-healing: More advanced health checking and automatic recovery mechanisms
Standardized approach: Becoming the industry standard for container orchestration

Backup and Restore

# Backup swarm state (on manager node)
# First, create a full backup
tar -czvf swarm-backup.tar.gz /var/lib/docker/swarm
# This includes:
# - Raft logs and consensus data
# - TLS certificates and keys
# - Secret data (encrypted)
# - Node information and join tokens

# For a more comprehensive backup, include:
docker service ls > services.txt
docker service inspect $(docker service ls -q) > service-details.json
docker secret ls > secrets.txt
docker config ls > configs.txt
docker network ls --filter driver=overlay > networks.txt

# Restore swarm
# 1. Stop Docker on the manager
systemctl stop docker

# 2. Restore the backup
tar -xzvf swarm-backup.tar.gz -C /
# Backup should be restored to the same path structure

# 3. Start Docker
systemctl start docker
# Docker will initialize using the restored swarm state
# The node resumes its role (leader or follower)

# 4. Verify the restore
docker node ls
docker service ls

# Alternative backup approach using manager UCP
# Save the UCP backup file which includes swarm configuration
docker container run --log-driver none --rm -i --name ucp \
  -v /var/run/docker.sock:/var/run/docker.sock \
  docker/ucp:latest backup --id $(docker container ls -q --filter name=ucp-controller) \
  --passphrase "secret" > ucp-backup.tar.gz

For a complete disaster recovery plan:

Regular automated backups of swarm state
Documentation of all custom configurations
Scripts to recreate services if backup is unavailable
Testing of restore procedures in a sandbox environment
Consideration of stateful services and their data

Best Practices

Deploy an odd number of manager nodes (3, 5, 7)
- 3 managers can tolerate 1 failure
- 5 managers can tolerate 2 failures
- More than 7 managers can reduce performance due to consensus overhead
# Check current manager count docker node ls --filter "role=manager" | wc -l # Promote a worker to add a manager docker node promote worker-node-1
Distribute managers across availability zones
- Place managers in different physical locations
- Use labels to track manager placement
# Add datacenter/zone label docker node update --label-add zone=us-east-1a manager-1 docker node update --label-add zone=us-east-1b manager-2 docker node update --label-add zone=us-east-1c manager-3
Use node labels for intelligent placement
- Direct workloads to appropriate hardware
- Separate production and staging workloads
# In docker-compose.yml deploy: placement: constraints: - node.labels.environment == production - node.labels.disk == ssd
Implement proper resource constraints
- Prevent resource starvation with limits
- Ensure critical services get needed resources with reservations
deploy: resources: limits: cpus: '0.50' memory: 512M reservations: cpus: '0.25' memory: 256M
Use secrets for sensitive information
- Never hardcode passwords or keys
- Rotate secrets regularly
# Create and use secrets echo "mypassword" | docker secret create db_password - docker service create --name db \ --secret db_password \ --env PASSWORD_FILE=/run/secrets/db_password \ postgres:13
Deploy rolling updates with health checks
- Validate new versions before full deployment
- Configure appropriate update parameters
deploy: update_config: parallelism: 1 delay: 10s failure_action: rollback monitor: 30s rollback_config: parallelism: 1 delay: 5s healthcheck: test: ["CMD", "curl", "-f", "http://localhost/health"] interval: 5s timeout: 2s retries: 3 start_period: 15s

Implement proper monitoring and logging

Centralize logs for analysis
Monitor node and service health

# Set up centralized logging
docker service create --name logstash \
  --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
  --network logging \
  logstash:7.10.1

# Create services with appropriate logging
docker service create --name app \
  --log-driver=gelf \
  --log-opt gelf-address=udp://logstash:12201 \
  myapp:latest

Regular swarm state backups
- Schedule automated backups
- Test restore procedures
# Create a backup script #!/bin/bash BACKUP_DIR="/mnt/backups/swarm" mkdir -p $BACKUP_DIR BACKUP_FILE="$BACKUP_DIR/swarm-$(date +%Y%m%d-%H%M%S).tar.gz" tar -czvf $BACKUP_FILE /var/lib/docker/swarm # Schedule with cron # 0 2 * * * /path/to/backup-script.sh

Document network architecture

Map all overlay networks and their purposes
Document port mappings and exposure

# Create network documentation
docker network ls --filter driver=overlay --format "{{.Name}}" | \
  xargs -I{} sh -c 'echo "Network: {}"; docker network inspect {} | jq ".[0].IPAM.Config"'

# Document port mappings
docker service ls --format "{{.Name}}" | \
  xargs -I{} sh -c 'echo "Service: {}"; docker service inspect {} | jq ".[0].Endpoint.Ports"'

Test failure scenarios
- Simulate node failures
- Practice recovery procedures
# Simulate manager failure docker-machine ssh manager-2 sudo systemctl stop docker # Check cluster status docker node ls # Test service resilience docker service ps --filter "desired-state=running" myservice # Restore manager docker-machine ssh manager-2 sudo systemctl start docker
Implement a drain strategy for maintenance
- Gracefully remove nodes for updates
- Return nodes to service after maintenance
# Drain a node before maintenance docker node update --availability drain worker-3 # Verify tasks were rescheduled docker node ps worker-3 # Perform maintenance... # Return to service docker node update --availability active worker-3
Use stack files for deployment
- Version control your stack definitions
- Structure stacks by application or team
# Deploy or update a stack docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml myapp # Organize stacks by function docker stack deploy -c frontend/docker-compose.yml frontend docker stack deploy -c backend/docker-compose.yml backend

Troubleshooting

Common Issues

Manager node availability
- Symptoms: Unable to create/update services, quorum loss
- Causes: Network issues, hardware failure, improper scaling
- Resolution: Restore manager nodes, ensure proper distribution
# Check manager status docker node ls # Look for "Leader", "Reachable", or "Unreachable" in MANAGER STATUS column
Network connectivity problems
- Symptoms: Services can't communicate, DNS resolution failures
- Causes: Firewall rules, overlay network issues, DNS configuration
- Resolution: Check firewall, verify network configuration
# Inspect overlay network configuration docker network inspect service_network # Check if required ports are open between nodes # TCP port 2377 - cluster management # TCP/UDP port 7946 - node-to-node communication # UDP port 4789 - overlay network traffic nc -zv manager-node 2377
Task placement constraints
- Symptoms: Tasks pending but not starting, "no suitable node" errors
- Causes: Impossible constraints, resource unavailability
- Resolution: Adjust constraints, add resources
# View failed task details docker service ps --no-trunc service_name # Check placement constraints docker service inspect --format '{{.Spec.TaskTemplate.Placement}}' service_name # List nodes matching constraints docker node ls --filter node.label=region=east
Resource limitations
- Symptoms: Services failing to start, OOM kills
- Causes: Insufficient memory/CPU, incorrect resource specifications
- Resolution: Adjust resource limits, scale infrastructure
# Check node resources docker node inspect node_name --format '{{.Description.Resources}}' # View service resource requirements docker service inspect --format '{{.Spec.TaskTemplate.Resources}}' service_name # Monitor resource usage docker stats $(docker ps --format "{{.Names}}" --filter label=com.docker.swarm.service.name=service_name)
Image availability
- Symptoms: "image not found" errors, services stuck in "preparing" state
- Causes: Private registry auth issues, image not existing, network problems
- Resolution: Verify registry access, check image path
# Test image pull manually docker pull image_name:tag # Configure registry authentication docker login registry.example.com # Create registry credentials as a secret echo "$DOCKER_AUTH_CONFIG" | docker secret create registry-auth - # Use credentials with service docker service create --name myservice \ --secret registry-auth \ --env DOCKER_AUTH_CONFIG_FILE=/run/secrets/registry-auth \ registry.example.com/myimage:latest
Service update/scaling issues
- Symptoms: Updates hanging, inconsistent replica count
- Causes: Health check failures, resource constraints
- Resolution: Check health checks, adjust update config
# Monitor update progress docker service inspect --format '{{.UpdateStatus}}' service_name # Reset service update if stuck docker service update --force service_name

Diagnostic Commands

# Check swarm status
docker info | grep Swarm
# Verify if node is in swarm mode and its role (manager/worker)

# View task logs
docker service logs service_name
# Add --details to see metadata including which node is running each task
docker service logs --details service_name

# Check failed tasks
docker service ps --filter "desired-state=running" --filter "actual-state=failed" service_name
# Shows tasks that should be running but failed, with error messages

# View task history
docker service ps --no-trunc service_name
# Shows all tasks including previously failed attempts

# Verify network connectivity
docker run --rm --network service_network alpine ping service_name
# Tests DNS resolution and connectivity within overlay network

# Check service configuration
docker service inspect --pretty service_name
# Human-readable service configuration

# Examine network configuration
docker network inspect service_network
# Shows network details, connected containers, and subnet information

# Check node status and health
docker node inspect --pretty node_name
# Human-readable node information including health status

# View service constraints
docker service inspect --format '{{.Spec.TaskTemplate.Placement.Constraints}}' service_name
# Shows placement constraints that might prevent scheduling

# Debug service discovery
docker run --rm --network service_network nicolaka/netshoot \
  dig service_name
# Advanced network debugging using specialized tools

Edit this page

Docker Desktop & Development Environments

Learn how to use Docker Desktop for streamlined development and create consistent development environments

Docker Extensions

Learn about the Docker Extensions ecosystem and how to leverage it to enhance your Docker workflow