pyats-parallel-ops

$npx mdskill add automateyournetwork/netclaw/pyats-parallel-ops

Executes parallel operations across all network devices for fast fleet management

  • Solves the problem of slow sequential checks by running tasks on all devices at once
  • Uses Python and pyATS with environment variable PYATS_TESTBED_PATH for device access
  • Groups devices by role or site to organize and optimize parallel execution
  • Returns aggregated results with failure isolation, severity sorting, and detailed reporting
SKILL.md
.github/skills/pyats-parallel-opsView on GitHub ↗
---
name: pyats-parallel-ops
description: "Fleet-wide parallel device operations - concurrent health checks, config audits, routing snapshots, severity-sorted reporting, and failure-isolated multi-device automation. Use when checking all devices at once, running bulk health checks, collecting configs from the entire fleet, or comparing state across multiple routers and switches."
license: Apache-2.0
user-invocable: true
metadata:
  { "openclaw": { "requires": { "bins": ["python3"], "env": ["PYATS_TESTBED_PATH"] } } }
---

# Parallel Fleet Operations

## When to Use

- Fleet-wide health checks across all devices in the testbed
- Mass configuration audits (collect running configs from every device)
- Network-wide routing table snapshots for baseline or comparison
- Pre/post change validation across all affected devices simultaneously
- Any operation where running sequentially on 10+ devices would be too slow

## How pCall Works in OpenClaw

In OpenClaw, parallel execution (pCall) is achieved by **listing multiple exec commands in a single response**. The agent runtime dispatches them concurrently and collects all results before proceeding.

### Key Principles

1. **Group by role or site** -- Organize devices into logical groups (core, distribution, access, WAN, DC) before dispatching
2. **Run operations concurrently** -- List one MCP call per device; they execute in parallel
3. **Failure isolation** -- If one device times out or errors, the other results are still collected
4. **Result aggregation** -- Collect all results, then produce a unified fleet report
5. **Severity sorting** -- Sort findings from CRITICAL to HEALTHY so the worst problems surface first

### pCall Pattern

To run the same command on multiple devices in parallel, list the calls together:

```bash
# Device 1
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'

# Device 2
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R2","command":"show version"}'

# Device 3
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"SW1","command":"show version"}'
```

All three commands execute concurrently. Results arrive independently and are aggregated by the agent.

## Step 0: Discover the Fleet

Always start by listing all devices in the testbed so you know what to operate on:

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices '{}'
```

This returns every device with its name, platform, OS, and connection details. Use this to build the device list for parallel operations.

## Example 1: Fleet-Wide Health Check

Run a health check on all devices in the testbed concurrently.

### Phase 1: Parallel Data Collection

Issue these commands simultaneously -- one set per device:

```bash
# R1 - CPU and memory
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'

# R2 - CPU and memory
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R2","command":"show processes cpu sorted"}'

# SW1 - CPU and memory
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"SW1","command":"show processes cpu sorted"}'
```

Then in a second parallel wave, collect interface and NTP status:

```bash
# R1 - Interfaces
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'

# R2 - Interfaces
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R2","command":"show ip interface brief"}'

# SW1 - Interfaces
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"SW1","command":"show ip interface brief"}'
```

### Phase 2: Parallel Log Collection

```bash
# R1 - Logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'

# R2 - Logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R2"}'

# SW1 - Logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"SW1"}'
```

### Phase 3: Aggregate and Report

After all parallel results return, analyze each device individually and produce the fleet summary (see Fleet Report Format below).

## Example 2: Fleet-Wide Config Audit

Collect the running configuration from every device in parallel for compliance analysis.

```bash
# R1 - Running config
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_running_config '{"device_name":"R1"}'

# R2 - Running config
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_running_config '{"device_name":"R2"}'

# SW1 - Running config
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_running_config '{"device_name":"SW1"}'

# SW2 - Running config
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_running_config '{"device_name":"SW2"}'
```

After collection, apply the **pyats-security** audit checks to each config and produce a fleet-wide security posture report.

**Common config audit checks to apply in parallel:**
- SSH version 2 only
- No telnet on VTY lines
- `service password-encryption` enabled
- VTY access-class applied
- NTP configured
- Logging host configured
- No default SNMP community strings

## Example 3: Fleet-Wide Routing Table Snapshot

Capture the routing table from every device simultaneously for baseline documentation or pre-change verification.

```bash
# R1 - Full routing table
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip route"}'

# R2 - Full routing table
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R2","command":"show ip route"}'

# R3 - Full routing table
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R3","command":"show ip route"}'

# R4 - Full routing table
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R4","command":"show ip route"}'
```

After collection, analyze per device:
- Total route count by protocol (connected, static, OSPF, BGP, EIGRP)
- Default route presence and source
- Expected prefix verification
- ECMP paths for critical prefixes

Produce a fleet routing summary:

```
Fleet Routing Snapshot - YYYY-MM-DD HH:MM UTC

┌──────────┬────────┬────────┬──────┬──────┬──────────────┬─────────┐
│ Device   │ Total  │ Conn.  │ OSPF │ BGP  │ Default Rte  │ Status  │
├──────────┼────────┼────────┼──────┼──────┼──────────────┼─────────┤
│ R1       │ 47     │ 5      │ 12   │ 28   │ via 10.1.1.2 │ HEALTHY │
│ R2       │ 45     │ 4      │ 12   │ 27   │ via 10.1.1.1 │ HEALTHY │
│ R3       │ 38     │ 3      │ 12   │ 21   │ via 10.2.1.1 │ WARNING │
│ R4       │ 0      │ 0      │ 0    │ 0    │ MISSING      │ CRITICAL│
└──────────┴────────┴────────┴──────┴──────┴──────────────┴─────────┘
```

## Example 4: Severity-Sorted Fleet Reporting

After collecting results from all devices, aggregate findings and sort by severity. This is the standard output format for all fleet operations.

### Severity Levels

1. **CRITICAL** -- Immediate action required. Device unreachable, process crash, zero routes, total connectivity loss.
2. **HIGH** -- Fix within hours. CPU > 90%, memory > 95%, routing adjacency down, interface flapping.
3. **MEDIUM** -- Fix within days. Missing NTP, elevated CPU (50-75%), log errors, config non-compliance.
4. **HEALTHY** -- No issues. All checks passed.

### Fleet Report Format

```
Fleet Health Report - YYYY-MM-DD HH:MM UTC
Testbed: production-network
Devices scanned: 8 | Duration: 12s (parallel)

=== CRITICAL (Immediate Action) ===

[C-001] R4 - UNREACHABLE
  Connection timed out after 30s. Verify device is powered on and management IP is reachable.
  Impact: No data collected for R4. Manual investigation required.

[C-002] SW2 - CPU 97% (5min avg)
  Top process: OSPF-1 Hello (45%), IP Input (32%)
  Impact: Risk of control plane failure. OSPF hellos may be missed.

=== HIGH (Fix Within Hours) ===

[H-001] R2 - GigabitEthernet3 down/down
  Last state change: 2 hours ago. 47 resets in last 24h.
  Impact: Backup WAN link unavailable. No redundancy for site B.

[H-002] SW1 - OSPF neighbor 3.3.3.3 in INIT state
  Expected: FULL. Interface: Vlan100. Duration: 45 minutes.
  Impact: Inter-VLAN routing for VLAN 100 may be impaired.

=== MEDIUM (Fix Within Days) ===

[M-001] R1 - NTP not synchronized
  No peer with '*' in show ntp associations. Clock offset: unknown.
  Impact: Log timestamps may be inaccurate for forensics.

[M-002] R3 - 3 OSPF adjacency flaps in last 24h
  Neighbors affected: 2.2.2.2 on Gi1 (flapped 3 times).
  Impact: Route convergence events. Brief traffic disruption during SPF.

=== HEALTHY ===

R1: All checks passed (CPU 12%, Mem 45%, 4/4 interfaces up, OSPF stable)
R3: All checks passed (CPU 8%, Mem 38%, 3/3 interfaces up, BGP stable)
SW3: All checks passed (CPU 5%, Mem 22%, 24/24 ports up, STP stable)

=== FLEET SUMMARY ===

┌──────────┬──────────┬──────────────────────────────────────────────┐
│ Device   │ Status   │ Key Finding                                  │
├──────────┼──────────┼──────────────────────────────────────────────┤
│ R4       │ CRITICAL │ Unreachable - connection timeout              │
│ SW2      │ CRITICAL │ CPU 97% - OSPF/IP Input                      │
│ R2       │ HIGH     │ Gi3 down/down - 47 resets                    │
│ SW1      │ HIGH     │ OSPF neighbor INIT - Vlan100                 │
│ R1       │ MEDIUM   │ NTP not synchronized                         │
│ R3       │ MEDIUM   │ 3 OSPF flaps in 24h                          │
│ R1       │ HEALTHY  │ All checks passed                            │
│ R3       │ HEALTHY  │ All checks passed                            │
│ SW3      │ HEALTHY  │ All checks passed                            │
└──────────┴──────────┴──────────────────────────────────────────────┘

Overall Fleet Status: CRITICAL (2 critical, 2 high, 2 medium, 3 healthy)
```

## Failure Isolation

When one device fails during parallel execution, it does **not** block or cancel the other operations:

- **Connection timeout** -- Mark device as CRITICAL/UNREACHABLE, continue with others
- **Command error** -- Record the error for that device, continue collecting from others
- **Parse failure** -- Fall back to raw text output for that device, report as WARNING

### Handling Unreachable Devices

```bash
# If R4 times out, you still get results from R1, R2, R3
# In the fleet report, R4 appears as:
#   [C-001] R4 - UNREACHABLE
#   Connection timed out. Device excluded from further checks.
```

The key principle: **always produce a report for every device, even if the report says "unreachable."**

## Grouping Strategies

### By Role

Group devices by their function in the network to prioritize operations:

```
Core routers:    R1, R2          (check first - highest blast radius)
Distribution:    SW1, SW2        (check second)
Access:          SW3, SW4, SW5   (check third)
WAN:             WAN1, WAN2      (check in parallel with core)
```

### By Site

For multi-site networks, group by location:

```
Site A (HQ):      R1, SW1, SW2
Site B (Branch):  R2, SW3
Site C (DR):      R3, SW4
```

### By Change Scope

When validating a change, group by affected vs unaffected:

```
Affected devices:    R1, R2       (check thoroughly - full health check)
Adjacent devices:    SW1, R3      (check routing adjacencies and connectivity)
Unaffected devices:  SW3, SW4     (spot check - verify no collateral damage)
```

## Scaling Guidelines

| Fleet Size | Strategy |
|------------|----------|
| 1-5 devices | Single parallel wave, all commands at once |
| 6-20 devices | Two waves: critical devices first, then remaining |
| 20-50 devices | Group by role/site, run 10-15 devices per wave |
| 50+ devices | Group by site, sample 20% per wave, expand if issues found |

For large fleets, start with a **sampling strategy**: pick 2-3 devices per role per site, run full health checks, then expand to the full fleet only if anomalies are found.

## Integration with Other Skills

- **pyats-health-check** -- The single-device health check procedure. pCall scales it to the fleet by issuing one health check per device in parallel.
- **pyats-security** -- Fleet-wide security audit. Collect all running configs in parallel, then apply security checks to each config.
- **pyats-topology** -- Fleet-wide topology discovery. Run CDP/LLDP neighbor collection on all devices in parallel to build the complete network map.
- **pyats-dynamic-test** -- Run the same aetest validation script against multiple devices in parallel for fleet-wide compliance testing.
- **pyats-config-mgmt** -- Pre/post change validation on all affected devices simultaneously.
- **drawio-diagram** -- After fleet discovery, generate a topology diagram showing device status (color-coded by health severity).
- **markmap-viz** -- Generate fleet health mind maps organized by severity or site.
More from automateyournetwork/netclaw