pyats-health-check

Name: pyats-health-check
Author: automateyournetwork/netclaw
$npx mdskill add automateyournetwork/netclaw/pyats-health-check
Monitors and analyzes network device health metrics
Solves device health monitoring for CPU, memory, interfaces, and NTP
Uses pyATS tools and device APIs to collect and parse data
Analyzes thresholds and patterns to detect anomalies or issues
Returns structured JSON with health status and actionable insights
SKILL.md
.github/skills/pyats-health-checkView on GitHub ↗
---
name: pyats-health-check
description: "Comprehensive network device health monitoring - CPU, memory, interfaces, hardware, NTP, logging, environment, and uptime analysis. Use when running a device health check, monitoring CPU or memory usage, checking interface errors, or validating NTP sync."
license: Apache-2.0
user-invocable: true
metadata:
  { "openclaw": { "requires": { "bins": ["python3"], "env": ["PYATS_TESTBED_PATH"] } } }

netshell:
  mcp_tools:
    - mcp: pyats-mcp
      tools:
        - pyats_run_show_command
        - pyats_parse_show_command
        - pyats_get_device_info
        - pyats_list_devices
---

# Device Health Check

## When to Use

- Proactive daily/weekly health monitoring
- Pre-change and post-change validation
- Incident response — first thing you run when alerted
- Capacity planning and trending
- Compliance checks for operational readiness

## Health Check Procedure

Always run health checks in this exact order. Each section builds on the previous one.

### Step 1: Device Identity & Uptime

Run `show version` to establish baseline identity.

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'
```

**Extract and report:**
- Hostname, model, serial number
- IOS-XE version and image filename
- Uptime (flag if < 24 hours — indicates recent reload)
- Last reload reason (flag if unexpected: crash, power failure)
- Total/available memory
- License status

**Thresholds:**
- Uptime < 24h → WARNING: Recent reload
- Uptime < 1h → CRITICAL: Very recent reload, check for crash
- Last reload reason contains "crash" or "error" → CRITICAL

### Step 2: CPU Utilization

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
```

**Thresholds (5-second / 1-minute / 5-minute averages):**
- < 50% → HEALTHY
- 50-75% → WARNING: Elevated CPU
- 75-90% → HIGH: Investigate top processes
- > 90% → CRITICAL: Immediate investigation required

**Top processes to watch:**
- `IP Input` — high traffic volume or routing loops
- `BGP Router` / `BGP I/O` — large BGP table or instability
- `OSPF-1 Hello` — OSPF adjacency issues
- `Crypto IKMP` / `Crypto Engine` — IPsec overhead
- `SNMP ENGINE` — polling storm
- `ARP Input` — ARP storm or L2 loop

### Step 3: Memory Utilization

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
```

Also run:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'
```

**Thresholds:**
- Used < 70% → HEALTHY
- 70-85% → WARNING: Memory pressure
- 85-95% → HIGH: May impact routing table updates
- > 95% → CRITICAL: Risk of process crashes or OOM

**Memory consumers to watch:**
- `BGP Router` — large BGP table (full internet table = ~1M routes)
- `CEF process` — large FIB
- `OSPF Router` — large OSPF LSDB
- `HTTP CORE` — web server / RESTCONF overhead
- `IOSD iomem` — I/O memory for packet buffers

### Step 4: Interface Status

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
```

Then for each active interface, get detailed counters:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
```

**Report for each interface:**
- Admin status (up/down) and protocol status (up/down)
- IP address and subnet
- Speed, duplex, MTU
- Input/output rate (bps and pps)
- Error counters: CRC, input errors, output errors, drops, overruns
- Resets counter (flag if incrementing — indicates flapping)
- Last input/output timestamps

**Flags:**
- Interface up/down → WARNING: Check physical or protocol
- CRC errors > 0 → WARNING: Physical layer issue (cabling, optics, duplex mismatch)
- Input errors incrementing → WARNING: Packet corruption
- Output drops > 0 → WARNING: Congestion or QoS issue
- Resets incrementing → CRITICAL: Interface flapping
- Line protocol down on configured interface → CRITICAL

### Step 5: Hardware & Environment

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
```

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'
```

**Report:** Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.

### Step 6: NTP Synchronization

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
```

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'
```

**Flags:**
- No NTP peer synchronized (no `*` in associations) → CRITICAL for logging/forensics
- Clock offset > 100ms → WARNING
- Clock offset > 1s → CRITICAL
- No NTP configured at all → CRITICAL

### Step 7: System Logs

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
```

**Scan for these patterns:**
- `%SYS-*-RELOAD` — reload events
- `%LINEPROTO-5-UPDOWN` — interface flaps
- `%OSPF-*-ADJCHG` — OSPF adjacency changes
- `%BGP-*-ADJCHANGE` — BGP peer state changes
- `%DUAL-*-NBRCHANGE` — EIGRP neighbor changes
- `%SYS-2-MALLOCFAIL` — memory allocation failure (CRITICAL)
- `%SYS-3-CPUHOG` — process monopolizing CPU (HIGH)
- `%TRACKING-*` — IP SLA or object tracking changes
- `%SEC-*` / `%AUTHMGR-*` — security events
- `%PLATFORM-*-CRASH` — crash events (CRITICAL)
- `Traceback` — software bug (CRITICAL — open TAC case)

### Step 8: Connectivity Validation

Test reachability to critical infrastructure:

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'
```

**Thresholds:**
- 100% success, RTT < 50ms → HEALTHY
- 100% success, RTT > 100ms → WARNING: High latency
- 80-99% success → WARNING: Packet loss
- < 80% success → CRITICAL: Significant packet loss
- 0% success → CRITICAL: No reachability

## Health Report Format

Always produce a summary table:

```
Device: R1 (devnetsandboxiosxec8k.cisco.com)
Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh

┌──────────────────┬──────────┬─────────────────────────┐
│ Check            │ Status   │ Details                 │
├──────────────────┼──────────┼─────────────────────────┤
│ CPU (5min avg)   │ HEALTHY  │ 12%                     │
│ Memory           │ HEALTHY  │ 45% used (1.2G/2.6G)   │
│ Interfaces       │ WARNING  │ Gi2 down/down           │
│ Hardware         │ HEALTHY  │ All modules OK          │
│ NTP              │ HEALTHY  │ Synced, offset 2ms      │
│ Logs             │ WARNING  │ 3 OSPF adjacency flaps  │
│ Connectivity     │ HEALTHY  │ 100% to 8.8.8.8, 23ms  │
└──────────────────┴──────────┴─────────────────────────┘

Overall: WARNING — 2 items need attention
```

Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.

## NetBox Cross-Reference (MISSION02 Enhancement)

When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:

### Interface State Validation

Query NetBox for expected interface states:

```bash
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
```

**Compare NetBox intent vs device reality:**
- NetBox shows interface enabled but device shows down → CRITICAL: Unexpected outage
- NetBox shows interface disabled but device shows up → WARNING: Undocumented activation
- Interface exists on device but not in NetBox → WARNING: Undocumented interface
- Interface in NetBox but not on device → WARNING: NetBox stale data

### IP Address Validation

Query NetBox for expected IP assignments:

```bash
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
```

**Compare:** Flag any IP_DRIFT where the device IP differs from NetBox.

## Fleet-Wide Health (pCall)

To run health checks across ALL devices simultaneously, first list all devices:

```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
```

Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:

```
┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐
│ Device   │ CPU      │ Mem  │ Intf   │ NTP      │ Overall     │
├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤
│ R1       │ HEALTHY  │ WARN │ HEALTHY│ HEALTHY  │ WARNING     │
│ R2       │ HEALTHY  │ OK   │ CRIT   │ HEALTHY  │ CRITICAL    │
│ SW1      │ HIGH     │ OK   │ HEALTHY│ CRIT     │ CRITICAL    │
└──────────┴──────────┴──────┴────────┴──────────┴─────────────┘
```

Sort devices by severity (CRITICAL first) for triage prioritization.

## GAIT Audit Trail

After completing a health check, record the session in GAIT:

```bash
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'
```