ci-failure-retrieval

$npx mdskill add NVIDIA/skills/ci-failure-retrieval

Retrieves CI failure details from TensorRT-LLM pull requests.

  • Provides failed test summaries and error logs for specific PRs.
  • Integrates with GitHub API and Jenkins testReport services.
  • Extracts build numbers and commit hashes from bot comments.
  • Delivers structured error data or raw stdout/stderr logs.
SKILL.md
.github/skills/ci-failure-retrievalView on GitHub ↗
---
name: ci-failure-retrieval
description: Retrieve and diagnose CI test failures from TensorRT-LLM pull requests using the GitHub API and Jenkins testReport API. Use when the user asks about CI failures on a PR, wants to see failed test details, or needs stdout/stderr from a CI run.
license: Apache-2.0
metadata:
  author: NVIDIA Corporation
---

# CI Failure Retrieval

**Input:** a PR number or a request to check CI failures. **Auth requirement:** requires corporate network access to resolve the Jenkins base URL. **Output:** a summary of failed tests with error details, and optionally full stdout/stderr for specific failures.

## Important: SSL and Authentication

Jenkins requires SSL with certificate verification disabled. Always use `ssl` context bypass in Python or `-sk` flags in curl:
```python
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
```
The `curl -s` approach often returns HTML login pages; prefer the Python `urllib` approach with SSL bypass.

## Phase 0 — Get the Latest CI Run Info

First, determine the latest CI run commit, build number, and high-level pass/fail counts:

```bash
source ~/utils/github/set_github_token.sh

PR_NUM=<pr_number>

# Get the latest CI bot comment (contains build number and commit)
gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
  '[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body'

# Get the PR HEAD commit and its blossom-ci status (high-level pass/fail counts)
HEAD_SHA=$(gh api "repos/NVIDIA/TensorRT-LLM/pulls/${PR_NUM}" --jq '.head.sha')
gh api "repos/NVIDIA/TensorRT-LLM/commits/${HEAD_SHA}/statuses" --jq \
  '[.[] | select(.context == "blossom-ci")] | first | {state, description}'
```

The `description` field shows aggregate counts like `"23969 passed, 1 failed, 8962 skipped"`.

## Phase 1 — Get the Jenkins Build Number

Extract the `L0_MergeRequest_PR` build number from the CI bot comment:
```bash
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
  '[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
  | grep -oP 'L0_MergeRequest_PR/\K\d+')
```

## Phase 1.5 — Check Pipeline Stage Failures (before diving into test details)

Many CI failures are **infrastructure-level** (Slurm node issues, pipeline aborts, resource exhaustion) where no test code executes at all. Always check the pipeline stages first:

```python
import json, ssl, urllib.request

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

JENKINS_BASE = "https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR"
BUILD_NUM = <build_number>

# Get pipeline stage overview
url = f"{JENKINS_BASE}/{BUILD_NUM}/wfapi/describe"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())

print(f"Pipeline status: {data.get('status')}")
for stage in data.get('stages', []):
    status = stage.get('status', '')
    if status not in ('SUCCESS', 'SKIPPED', 'NOT_EXECUTED'):
        name = stage.get('name', '')
        print(f"  [{status}] {name}")
        if 'error' in stage:
            print(f"    Error: {stage['error']}")
```

## Phase 1.6 — Read Console Log Analysis (Most Valuable for Infrastructure Failures)

The Jenkins console log contains a **CI failure analysis summary** with sections like `## Recommended Actions` and `## Infrastructure Notes`. This is the single most valuable source for understanding infrastructure failures:

```python
url = f"{JENKINS_BASE}/{BUILD_NUM}/consoleText"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
text = resp.read().decode('utf-8', errors='replace')

# Extract failure-related lines from the end of the log
for line in text[-8000:].split('\n'):
    lo = line.lower()
    if any(kw in lo for kw in ['fail', 'error', 'abort', 'likely cause',
                                'recommended action', 'infrastructure',
                                'no test code', 'stage result']):
        print(line.strip()[:300])
```

Key sections to look for in the console log:
- **`Failing job`** / **`Failed stage`**: which Jenkins sub-job and stage failed
- **`Likely cause`**: automated root cause analysis (Slurm issues, pipeline timeouts, etc.)
- **`No test code was executed`**: confirms infrastructure-only failure (no code fix needed)
- **`Recommended Actions`**: whether to re-trigger CI or investigate code changes

## Phase 2 — Query the Jenkins testReport API for Test Failures

Only proceed here if Phase 1.5/1.6 indicate actual test failures (not infrastructure issues):

```python
url = f"{JENKINS_BASE}/{BUILD_NUM}/testReport/api/json"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())

print(f'Summary: {data["passCount"]} passed, {data["failCount"]} failed, {data["skipCount"]} skipped')

failed = []
for suite in data.get('suites', []):
    for case in suite.get('cases', []):
        if case.get('status') in ('FAILED', 'REGRESSION'):
            failed.append(case)

if not failed:
    print('No test failures in testReport!')
else:
    print(f'Failed tests ({len(failed)}):')
    for f in failed:
        print(f'  - {f["className"]}.{f["name"]}')
        err = (f.get('errorDetails') or '')[:200]
        if err:
            print(f'    Error: {err}')
```

## Phase 3 — Get Full stdout/stderr for a Specific Test Failure

The `errorStackTrace` can be incomplete when errors originate from subprocesses. Fetch `stdout` and `stderr` for the specific test case to find the real error:
```python
for suite in data.get('suites', []):
    for case in suite.get('cases', []):
        if case.get('status') in ('FAILED', 'REGRESSION'):
            name = f'{case["className"]}.{case["name"]}'
            if '<search_term>' in name:
                print(f'=== {name} ===')
                print('--- Error ---')
                print(case.get('errorDetails', ''))
                print('--- Stack Trace ---')
                print(case.get('errorStackTrace', ''))
                print('--- Stdout (last 3000 chars) ---')
                print((case.get('stdout') or '')[-3000:])
                print('--- Stderr (last 3000 chars) ---')
                print((case.get('stderr') or '')[-3000:])
                break
```

## Available Fields per Failed Test Case (Jenkins testReport API)

- `className`, `name`: test identifier
- `status`: `FAILED` or `REGRESSION`
- `errorDetails`: error message
- `errorStackTrace`: full stack trace (may be incomplete for subprocess errors)
- `stdout`, `stderr`: full test output (can be large, check these when stack trace is insufficient)

## Common Failure Patterns

| Pattern | Diagnosis | Action |
|---------|-----------|--------|
| `No test code was executed` + Slurm errors | Infrastructure: Slurm node resource exhaustion | Re-trigger CI |
| `ABORTED` stage + `Downstream job did not succeed` | Cascading failure from fail-fast policy | Fix root cause stage, re-trigger |
| `newosproc` / `errno=11` / `fork/exec` | Kernel process table exhaustion on login node | Wait and re-trigger |
| `testReport: 0 failed` but `blossom-ci: N failed` | Stage-level failures, not test failures | Check Phase 1.5/1.6 |
| `testReport: N failed` with real test names | Actual test code failures | Investigate test errors in Phase 3 |

## Anti-Patterns

- Do not guess Jenkins URLs; always use the known base `https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR`.
- Do not use `curl -s` for Jenkins API; it returns HTML login pages. Use Python `urllib` with SSL bypass.
- Do not jump to testReport (Phase 2) before checking pipeline stages (Phase 1.5) — many failures are infrastructure-only with zero test failures.
- Do not stop at `errorStackTrace` if it mentions generic wrapper failures like `Process exited with status 1`; check `stdout` and `stderr` for the real error.
- Do not fetch all test cases when looking for a specific failure; use the `<search_term>` filter in Phase 3.
More from NVIDIA/skills