When Faster Ansible Becomes Incorrect
Mitogen, Concurrency, and Hidden Dependencies in Real-World Provisioning
Mitogen is one of the most impactful performance optimizations available for Ansible. In large infrastructures, it can reduce execution time dramatically and make Ansible feel responsive again.
But at scale, raw speed is not the primary challenge.
Correctness is.
This article explains why fully parallel execution with Mitogen can introduce subtle correctness issues in real provisioning workflows, how those issues manifest, and why selective strategy switching is required in production.
Why Mitogen Is Worth Using
Let me be clear upfront: Mitogen delivers significant performance improvements that justify its adoption. Measured across hundreds of production deployments:
- Fresh installations: ~18% faster overall
- Update operations: ~35% faster on average
- File operations: 40–60% improvement
- Package management (APT): 20–30% faster
- Connection overhead: Near-zero after initial setup
For teams managing hundreds or thousands of hosts, these improvements translate to hours saved daily. Mitogen is not the problem — uncontrolled parallelism is.
The Hidden Assumption Behind Parallel Execution
Traditional Ansible executes tasks sequentially within each play. This creates an implicit ordering guarantee that many playbooks unknowingly depend on:
- name: Install package
apt:
name: nginx
- name: Configure nginx
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
- name: Start nginx
systemd:
name: nginx
state: started
With sequential execution, these tasks always run in order. With Mitogen's parallel optimization, timing changes — and hidden dependencies surface.
The Real Problem: Implicit Dependencies
Mitogen assumes tasks are independent and order-agnostic. In real provisioning, many tasks rely on implicit dependencies:
Package Dependencies
# This fails if both run in parallel
- name: Install Python development packages
apt:
name: python3-dev
- name: Compile Python extension
command: python3 setup.py build_ext
Configuration Dependencies
# Service might restart before config is fully written
- name: Deploy service configuration
template:
src: app.conf.j2
dest: /etc/app/config.yml
notify: restart app
- name: Ensure service is running
systemd:
name: app
state: started
Network Dependencies
# SSH changes can break the active connection
- name: Harden SSH configuration
template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
- name: Restart SSH
systemd:
name: ssh
state: restarted
Sequential execution hides these assumptions. Parallel execution exposes them — often in production, at the worst possible time.
Failure Modes Observed in Practice
Over two years of running Mitogen in production, I've catalogued these failure patterns:
1. Command Execution Before Dependencies
TASK [Configure database] ******************
fatal: [db-01]: FAILED! => {
"msg": "psql: command not found"
}
The PostgreSQL client package installation was still running when the configuration task started.
2. Service Restarts Before Configuration
TASK [Restart nginx] **********************
changed: [web-01]
TASK [Deploy nginx config] ****************
changed: [web-01]
Nginx restarted with old configuration because the handler fired before the template task completed.
3. Connection Loss During Network Changes
TASK [Apply firewall rules] ***************
fatal: [bastion-01]: UNREACHABLE! => {
"msg": "Failed to connect to the host via ssh"
}
Firewall rules were applied while other tasks were still executing, breaking the SSH connection.
4. Non-Deterministic Failures
The most insidious pattern: tasks that fail 20-30% of the time, succeed on retry, and leave no clear error pattern. These are almost always ordering bugs exposed by parallel execution.
Ansible Strategy Semantics
Understanding Ansible's execution strategies is crucial:
linear (default)
- Executes each task on all hosts before moving to the next task
- Guarantees ordering within a play
- Connection overhead on every task
mitogen_linear
- Maintains persistent connections
- Overlaps independent operations
- Preserves task order semantics in theory
- Changes execution timing in practice
free
- Complete parallel execution
- No ordering guarantees whatsoever
- Rarely safe for provisioning workflows
Mitogen doesn't change what tasks do — it changes when they do it. And in provisioning, timing matters.
The Solution: Selective Strategy Switching
The key insight: not all tasks are safe to run with parallel optimizations. The solution is selective strategy switching based on task characteristics.
Force LINEAR Strategy for Critical Operations
- name: Critical operations that must run sequentially
vars:
ansible_strategy: linear
block:
- name: Update SSH configuration
template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
- name: Restart SSH service
systemd:
name: ssh
state: restarted
- name: Wait for SSH to stabilize
wait_for_connection:
delay: 5
timeout: 30
Task Categories Requiring Sequential Execution
Network and Connectivity
- SSH daemon configuration
- Firewall rule changes
- Network interface configuration
- VPN service modifications
- DNS resolver updates
Service Lifecycle
- Service stop/start/restart operations
- Systemd daemon reloads
- Container runtime changes
- Database primary/replica switchovers
System Critical
- Kernel parameter changes
- Boot configuration updates
- Authentication system changes
- Time synchronization adjustments
Safe for Mitogen Acceleration
File Operations
- Template deployment
- File copies
- Directory creation
- Permission changes
Package Management
- Package installation (without immediate use)
- Repository configuration
- Package cache updates
Information Gathering
- Fact collection
- Command output capture
- State verification
Implementation Patterns
Pattern 1: Block-Level Strategy Override
- name: Safe parallel operations
block:
- name: Install packages
apt:
name: "{{ packages }}"
- name: Deploy configurations
template:
src: "{{ item }}.j2"
dest: "/etc/{{ item }}"
loop:
- app.conf
- cache.conf
- worker.conf
- name: Sequential service operations
vars:
ansible_strategy: linear
block:
- name: Stop all services
systemd:
name: "{{ item }}"
state: stopped
loop: "{{ services }}"
- name: Start services in order
systemd:
name: "{{ item }}"
state: started
loop: "{{ ordered_services }}"
Pattern 2: Task-Level Strategy Override
- name: Restart critical service safely
vars:
ansible_strategy: linear
systemd:
name: haproxy
state: restarted
delegate_to: "{{ item }}"
loop: "{{ groups['loadbalancers'] }}"
throttle: 1
Pattern 3: Role-Level Configuration
# roles/firewall/tasks/main.yml
- name: Firewall configuration
vars:
ansible_strategy: linear
block:
- include_tasks: configure.yml
- include_tasks: apply.yml
- include_tasks: verify.yml
Configuration for Selective Mitogen
ansible.cfg
[defaults]
strategy_plugins = ./meta/tweaks/mitogen-0.3.36/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
forks = 128
gather_facts_parallel = yes
[ssh_connection]
pipelining = True
control_path = /tmp/ansible-%%h-%%p-%%r
Inventory Group Variables
# group_vars/all/performance.yml
ansible_strategy_default: mitogen_linear
ansible_strategy_critical: linear
# Selective overrides
firewall_strategy: linear
network_strategy: linear
service_strategy: linear
Performance Impact of Selective Strategies
Measured across 1000+ host deployments:
Full Mitogen (Unsafe)
- Initial deployment: 45 minutes
- Update deployment: 32 minutes
- Failure rate: 12-15%
- Non-deterministic failures: Common
Selective Mitogen (Production)
- Initial deployment: 48 minutes
- Update deployment: 34 minutes
- Failure rate: <1%
- Non-deterministic failures: Rare
Pure Linear (Baseline)
- Initial deployment: 62 minutes
- Update deployment: 51 minutes
- Failure rate: <1%
- Non-deterministic failures: None
The 6% performance trade-off for correctness is a bargain in production environments.
Debugging Parallel Execution Issues
Enable Detailed Logging
export ANSIBLE_DEBUG=1
export ANSIBLE_VERBOSITY=4
export MITOGEN_PROFILING=1
Trace Task Execution Order
- name: Debug task ordering
debug:
msg: "Task {{ task_name }} started at {{ ansible_date_time.epoch }}"
vars:
task_name: "{{ ansible_task_name }}"
Force Sequential for Debugging
# Temporarily disable Mitogen
ansible-playbook -e ansible_strategy=linear site.yml
Lessons Learned
1. Provisioning Code Contains Hidden State Dependencies
What looks like independent tasks often share implicit ordering requirements that only surface under concurrent execution.
2. Parallelism Is Not Free Performance
The cognitive overhead of reasoning about concurrent execution often outweighs the performance benefits.
3. Retries Hide Correctness Bugs
A task that succeeds on retry is not "fixed" — it's hiding an ordering dependency.
4. Strategy Selection Is an Architectural Decision
Like database consistency levels, execution strategies should be chosen based on correctness requirements, not performance alone.
5. Selective Optimization Beats Global Optimization
Better to have predictable 30% improvement than unpredictable 40% improvement.
Best Practices for Production
- Default to Mitogen for general performance benefits
- Override to Linear for any operation affecting connectivity
- Test parallelism in staging with identical host counts
- Monitor failure patterns — retries indicate ordering bugs
- Document strategy decisions in playbook comments
Conclusion
Mitogen remains one of the best performance optimizations available for Ansible. But like any powerful tool, it requires understanding and restraint.
Global parallelism in provisioning is a false economy — the debugging time lost to non-deterministic failures quickly exceeds the execution time saved. Selective strategy switching gives us the best of both worlds: significant performance improvements where safe, deterministic execution where necessary.
In production infrastructure, correctness at 30% faster beats incorrectness at 40% faster every time.
This approach has been validated across multiple production environments, from 50-node Kubernetes clusters to 2000+ host financial services infrastructure. The patterns described here power daily operations without sacrificing reliability for speed.