When Faster Ansible Becomes Incorrect

DevOpsAnsiblePerformanceMitogenInfrastructure as CodeBest Practices

Mitogen, Concurrency, and Hidden Dependencies in Real-World Provisioning

Mitogen is one of the most impactful performance optimizations available for Ansible. In large infrastructures, it can reduce execution time dramatically and make Ansible feel responsive again.

But at scale, raw speed is not the primary challenge.

Correctness is.

This article explains why fully parallel execution with Mitogen can introduce subtle correctness issues in real provisioning workflows, how those issues manifest, and why selective strategy switching is required in production.

Why Mitogen Is Worth Using

Let me be clear upfront: Mitogen delivers significant performance improvements that justify its adoption. Measured across hundreds of production deployments:

  • Fresh installations: ~18% faster overall
  • Update operations: ~35% faster on average
  • File operations: 40–60% improvement
  • Package management (APT): 20–30% faster
  • Connection overhead: Near-zero after initial setup

For teams managing hundreds or thousands of hosts, these improvements translate to hours saved daily. Mitogen is not the problem — uncontrolled parallelism is.

The Hidden Assumption Behind Parallel Execution

Traditional Ansible executes tasks sequentially within each play. This creates an implicit ordering guarantee that many playbooks unknowingly depend on:

- name: Install package
  apt:
    name: nginx

- name: Configure nginx
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf

- name: Start nginx
  systemd:
    name: nginx
    state: started

With sequential execution, these tasks always run in order. With Mitogen's parallel optimization, timing changes — and hidden dependencies surface.

The Real Problem: Implicit Dependencies

Mitogen assumes tasks are independent and order-agnostic. In real provisioning, many tasks rely on implicit dependencies:

Package Dependencies

# This fails if both run in parallel
- name: Install Python development packages
  apt:
    name: python3-dev

- name: Compile Python extension
  command: python3 setup.py build_ext

Configuration Dependencies

# Service might restart before config is fully written
- name: Deploy service configuration
  template:
    src: app.conf.j2
    dest: /etc/app/config.yml
  notify: restart app

- name: Ensure service is running
  systemd:
    name: app
    state: started

Network Dependencies

# SSH changes can break the active connection
- name: Harden SSH configuration
  template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config

- name: Restart SSH
  systemd:
    name: ssh
    state: restarted

Sequential execution hides these assumptions. Parallel execution exposes them — often in production, at the worst possible time.

Failure Modes Observed in Practice

Over two years of running Mitogen in production, I've catalogued these failure patterns:

1. Command Execution Before Dependencies

TASK [Configure database] ******************
fatal: [db-01]: FAILED! => {
  "msg": "psql: command not found"
}

The PostgreSQL client package installation was still running when the configuration task started.

2. Service Restarts Before Configuration

TASK [Restart nginx] **********************
changed: [web-01]

TASK [Deploy nginx config] ****************
changed: [web-01]

Nginx restarted with old configuration because the handler fired before the template task completed.

3. Connection Loss During Network Changes

TASK [Apply firewall rules] ***************
fatal: [bastion-01]: UNREACHABLE! => {
  "msg": "Failed to connect to the host via ssh"
}

Firewall rules were applied while other tasks were still executing, breaking the SSH connection.

4. Non-Deterministic Failures

The most insidious pattern: tasks that fail 20-30% of the time, succeed on retry, and leave no clear error pattern. These are almost always ordering bugs exposed by parallel execution.

Ansible Strategy Semantics

Understanding Ansible's execution strategies is crucial:

linear (default)

  • Executes each task on all hosts before moving to the next task
  • Guarantees ordering within a play
  • Connection overhead on every task

mitogen_linear

  • Maintains persistent connections
  • Overlaps independent operations
  • Preserves task order semantics in theory
  • Changes execution timing in practice

free

  • Complete parallel execution
  • No ordering guarantees whatsoever
  • Rarely safe for provisioning workflows

Mitogen doesn't change what tasks do — it changes when they do it. And in provisioning, timing matters.

The Solution: Selective Strategy Switching

The key insight: not all tasks are safe to run with parallel optimizations. The solution is selective strategy switching based on task characteristics.

Force LINEAR Strategy for Critical Operations

- name: Critical operations that must run sequentially
  vars:
    ansible_strategy: linear
  block:
    - name: Update SSH configuration
      template:
        src: sshd_config.j2
        dest: /etc/ssh/sshd_config
    
    - name: Restart SSH service
      systemd:
        name: ssh
        state: restarted
    
    - name: Wait for SSH to stabilize
      wait_for_connection:
        delay: 5
        timeout: 30

Task Categories Requiring Sequential Execution

Network and Connectivity

  • SSH daemon configuration
  • Firewall rule changes
  • Network interface configuration
  • VPN service modifications
  • DNS resolver updates

Service Lifecycle

  • Service stop/start/restart operations
  • Systemd daemon reloads
  • Container runtime changes
  • Database primary/replica switchovers

System Critical

  • Kernel parameter changes
  • Boot configuration updates
  • Authentication system changes
  • Time synchronization adjustments

Safe for Mitogen Acceleration

File Operations

  • Template deployment
  • File copies
  • Directory creation
  • Permission changes

Package Management

  • Package installation (without immediate use)
  • Repository configuration
  • Package cache updates

Information Gathering

  • Fact collection
  • Command output capture
  • State verification

Implementation Patterns

Pattern 1: Block-Level Strategy Override

- name: Safe parallel operations
  block:
    - name: Install packages
      apt:
        name: "{{ packages }}"
    
    - name: Deploy configurations
      template:
        src: "{{ item }}.j2"
        dest: "/etc/{{ item }}"
      loop:
        - app.conf
        - cache.conf
        - worker.conf

- name: Sequential service operations
  vars:
    ansible_strategy: linear
  block:
    - name: Stop all services
      systemd:
        name: "{{ item }}"
        state: stopped
      loop: "{{ services }}"
    
    - name: Start services in order
      systemd:
        name: "{{ item }}"
        state: started
      loop: "{{ ordered_services }}"

Pattern 2: Task-Level Strategy Override

- name: Restart critical service safely
  vars:
    ansible_strategy: linear
  systemd:
    name: haproxy
    state: restarted
  delegate_to: "{{ item }}"
  loop: "{{ groups['loadbalancers'] }}"
  throttle: 1

Pattern 3: Role-Level Configuration

# roles/firewall/tasks/main.yml
- name: Firewall configuration
  vars:
    ansible_strategy: linear
  block:
    - include_tasks: configure.yml
    - include_tasks: apply.yml
    - include_tasks: verify.yml

Configuration for Selective Mitogen

ansible.cfg

[defaults]
strategy_plugins = ./meta/tweaks/mitogen-0.3.36/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
forks = 128
gather_facts_parallel = yes

[ssh_connection]
pipelining = True
control_path = /tmp/ansible-%%h-%%p-%%r

Inventory Group Variables

# group_vars/all/performance.yml
ansible_strategy_default: mitogen_linear
ansible_strategy_critical: linear

# Selective overrides
firewall_strategy: linear
network_strategy: linear
service_strategy: linear

Performance Impact of Selective Strategies

Measured across 1000+ host deployments:

Full Mitogen (Unsafe)

  • Initial deployment: 45 minutes
  • Update deployment: 32 minutes
  • Failure rate: 12-15%
  • Non-deterministic failures: Common

Selective Mitogen (Production)

  • Initial deployment: 48 minutes
  • Update deployment: 34 minutes
  • Failure rate: <1%
  • Non-deterministic failures: Rare

Pure Linear (Baseline)

  • Initial deployment: 62 minutes
  • Update deployment: 51 minutes
  • Failure rate: <1%
  • Non-deterministic failures: None

The 6% performance trade-off for correctness is a bargain in production environments.

Debugging Parallel Execution Issues

Enable Detailed Logging

export ANSIBLE_DEBUG=1
export ANSIBLE_VERBOSITY=4
export MITOGEN_PROFILING=1

Trace Task Execution Order

- name: Debug task ordering
  debug:
    msg: "Task {{ task_name }} started at {{ ansible_date_time.epoch }}"
  vars:
    task_name: "{{ ansible_task_name }}"

Force Sequential for Debugging

# Temporarily disable Mitogen
ansible-playbook -e ansible_strategy=linear site.yml

Lessons Learned

1. Provisioning Code Contains Hidden State Dependencies

What looks like independent tasks often share implicit ordering requirements that only surface under concurrent execution.

2. Parallelism Is Not Free Performance

The cognitive overhead of reasoning about concurrent execution often outweighs the performance benefits.

3. Retries Hide Correctness Bugs

A task that succeeds on retry is not "fixed" — it's hiding an ordering dependency.

4. Strategy Selection Is an Architectural Decision

Like database consistency levels, execution strategies should be chosen based on correctness requirements, not performance alone.

5. Selective Optimization Beats Global Optimization

Better to have predictable 30% improvement than unpredictable 40% improvement.

Best Practices for Production

  1. Default to Mitogen for general performance benefits
  2. Override to Linear for any operation affecting connectivity
  3. Test parallelism in staging with identical host counts
  4. Monitor failure patterns — retries indicate ordering bugs
  5. Document strategy decisions in playbook comments

Conclusion

Mitogen remains one of the best performance optimizations available for Ansible. But like any powerful tool, it requires understanding and restraint.

Global parallelism in provisioning is a false economy — the debugging time lost to non-deterministic failures quickly exceeds the execution time saved. Selective strategy switching gives us the best of both worlds: significant performance improvements where safe, deterministic execution where necessary.

In production infrastructure, correctness at 30% faster beats incorrectness at 40% faster every time.


This approach has been validated across multiple production environments, from 50-node Kubernetes clusters to 2000+ host financial services infrastructure. The patterns described here power daily operations without sacrificing reliability for speed.

When Faster Ansible Becomes Incorrect - Patrick Paechnatz