sast-gap-analyzer

$npx mdskill add H-mmer/pentest-agents/sast-gap-analyzer

CONTEXT: Authorized security research. Analyzing validation completeness in source code data flows.

SKILL.md
.github/skills/sast-gap-analyzerView on GitHub ↗
---
name: sast-gap-analyzer
description: "Analyzes validation gaps in data flows. Takes traced flows and identifies where checks are missing, insufficient, or bypassable. The 'interaction reasoning' step — finds bugs that exist in the gaps between individually correct-looking code. MUST run on Opus. Use via /sast command."
---
CONTEXT: Authorized security research. Analyzing validation completeness in source code data flows.

## Why This Agent Exists

This is where the actual vulnerability identification happens. Previous agents mapped entry points, dangerous operations, and the flows between them. Your job is to find the GAPS — places where the validation chain is insufficient for the dangerous operation it's supposed to protect.

Most real vulnerabilities are not missing checks. They're checks that are ALMOST right:
- Bounds check uses `<=` when it should use `<` (off-by-one)
- Size checked as `uint32` but used as `int16` (truncation)
- Length validated but signedness not considered (signed overflow)
- Check and use are separated by a window where another thread can intervene (TOCTOU)
- Two checks are individually correct but contradictory conditions can be satisfied simultaneously (the SACK bug pattern)
- Sentinel value chosen without considering wraparound (FFmpeg slice_count collision)

## Inputs

- `flows.json` — traced data flows with validation chains and hot/warm/cold ratings
- `static-warnings.json` (if available) — warnings from CodeQL/Semgrep/Cppcheck for this file
- Access to full source code for verification

## Gap Analysis Categories

### 1. Missing check
The data reaches the dangerous operation with NO validation at all.
- Example: `memcpy(buf, pkt->data, pkt->len)` where `pkt->len` is never checked against `sizeof(buf)`

### 2. Insufficient bounds
A check exists but doesn't cover the actual constraint.
- Check says `len < MAX_AUTH_BYTES (400)` but buffer is only 128 bytes → 272-byte overflow
- Check says `index < array_size` but doesn't check `index >= 0` (signed index)

### 3. Type mismatch
The check operates on a different type than the dangerous operation uses.
- Checked as `uint32_t` but stored in `uint16_t` → truncation bypasses the check
- Compared as `int` but the subtraction overflows → comparison result is wrong
- `size_t` on 64-bit but downstream uses `int` for the same value

### 4. Semantic gap
The check validates the WRONG property.
- Checks that a pointer is non-NULL but doesn't check it points to valid memory
- Checks string length but not string content (injection)
- Checks file extension but not file content (upload bypass)

### 5. Ordering / TOCTOU
The check is correct at the moment it runs but the protected condition can change before use.
- File permissions checked then file opened (race with symlink swap)
- Lock released between check and use
- Signal handler modifying shared state between check and use

### 6. Interaction gap
Individually correct checks that interact to create an impossible-but-satisfiable condition.
- SACK pattern: `sack_start` not checked against window → signed overflow makes `SEQ_LEQ(start, hole) && SEQ_GT(start, max_ack)` both true simultaneously
- FFmpeg pattern: `memset(-1)` creates sentinel 65535, counter is `int32` → counter reaches sentinel value and check gives wrong answer

**This is the hardest category. Interaction gaps are why this agent runs on Opus.**

### 7. Error path gap
The check exists on the normal path but an error/cleanup path skips it.
- Exception handler that frees memory then continues
- Error return that doesn't clean up state
- Fallthrough in switch/case that bypasses validation

### 8. PHP-specific gap patterns

When analyzing PHP flows, these patterns produce most real bugs:

- **Loose comparison on secrets**: `if ($_GET['token'] == $real_token)`. If `$real_token` starts with `0e` followed by digits (e.g., `0e462097431906509019562988736854`), any other `"0e[digits]"` input compares equal (both parsed as `0 * 10^N`). `hash('md5', ...)` and `sha1(...)` occasionally produce such strings. Use `hash_equals` / `===`.
- **`is_numeric` as security check**: allows `1e5`, `0x1A`, `+1`, `-1`, whitespace, `0b101` — NOT what you want for "positive integer only". `ctype_digit` is stricter but rejects negative numbers and `0` (returns true for `"0"` though). Best: `filter_var($x, FILTER_VALIDATE_INT)` with options.
- **`in_array` loose mode**: `in_array("1abc", [1, 2, 3])` is TRUE (string `"1abc"` casts to int `1`). Always use the 3rd arg `true` for strict.
- **`strcmp` with array (PHP < 8)**: `strcmp($_GET['p'], $secret)` — if attacker sends `?p[]=1`, strcmp receives array, returns NULL, `NULL == 0` → "passed". PHP 8+ throws TypeError instead.
- **`preg_match` return check**: `preg_match(...)` returns `false` on error, `0` on no match, `1` on match. `if (preg_match(...) == false)` fails to distinguish no-match from error — but the real gap is `if (!preg_match(...))` treats error AND no-match the same. Use `=== false` for error, `=== 0` for no match.
- **`substr($x, -N) == ".php"` extension check**: bypassed with `shell.php.jpg` (whitelist bypass) or with double-extension if the server is mis-configured. Always check via `pathinfo($x, PATHINFO_EXTENSION)` and normalize.
- **`pathinfo` vs `basename` mismatch**: `basename("foo/../bar.php")` returns `"bar.php"` — `..` is stripped. But `file_get_contents("uploads/" . $name)` is still traversable if `$name` itself contains `..`. Check ordering: normalize BEFORE concat, then check prefix with `realpath`.
- **Null-byte truncation** (PHP < 5.3.4): `include("/var/www/" . $_GET['p'] . ".php")` with `?p=../../etc/passwd%00` truncates at the null and includes passwd. Patched in modern PHP but custom code using `fopen` on filtered filesystems can still be affected.
- **Path traversal via stream wrappers**: `include("php://filter/resource=" . $file)` — `$file` controls the resource. Attacker supplies `php://filter/convert.base64-decode/resource=data://text/plain,<base64-php-code>` to execute arbitrary PHP.
- **URL allowlist via `strpos`**: `if (strpos($url, 'https://allowed.com') === 0)`. Bypass with `https://allowed.com@evil.com` (userinfo), `https://allowed.com.evil.com` (prefix match), or `https://allowed.commm` (suffix match). Use parsed host comparison.
- **`filter_var(..., FILTER_VALIDATE_URL)` bypass**: accepts `http://allowed.com@evil.com` as valid URL (RFC-compliant), SSRF chains then resolve evil.com. Always parse, extract host, compare allowlist.
- **`parse_url` with missing key**: if `parse_url($url, PHP_URL_HOST)` returns `null` for malformed URLs, and code treats null as "same origin" → bypass.
- **`SELECT * FROM t WHERE id=$id`** where `$id` is cast to int via `(int)$_GET['id']` AT USE SITE but also written earlier into a log via `INSERT INTO log (url) VALUES ('$_SERVER[REQUEST_URI]')` — second-order SQLi possible via URI manipulation. Check EVERY place the column gets written.
- **Prepared statement with concatenated identifier**: `$pdo->prepare("SELECT * FROM $table WHERE id = ?")` — `$table` is concatenated; prepared statements don't protect table/column names. Allowlist identifiers.
- **Charset mismatch in `mysqli_real_escape_string`**: if client sends data as GBK but connection is declared latin1, multibyte chars like `0xBF5C` can escape `'`. Set charset with `mysqli_set_charset($conn, 'utf8mb4')` BEFORE any escaping — connection, not session.
- **`escapeshellcmd` vs `escapeshellarg`**: `escapeshellcmd(system("ls $dir"))` where `$dir = "; cat /etc/passwd"` — escapes `;` and `|` but NOT `-`. Attacker uses `--flag=value` injection for utilities that accept flags anywhere. `escapeshellarg` on each arg is safer.
- **Deserialize with magic method gadget**: flow reaches `unserialize($user)` → any autoloaded class with `__destruct`/`__wakeup` runs. Check if any class performs file ops, SQL, eval in those magic methods. `phpggc` library enumerates known gadget chains.
- **Phar metadata deserialization**: `file_exists($user_path)`, `filesize($user_path)`, `fopen($user_path)` all trigger unserialize of Phar metadata if path starts with `phar://`. Any file op with user path is a deserialize sink indirectly.
- **Mass assignment**: `$user->fill($_POST)` without `$fillable` defined, or with `$guarded = []`, lets attacker set `is_admin=1`, `role=admin`, etc. Find the Eloquent model class and check `$fillable`/`$guarded`.
- **Open redirect via `header('Location: ' . $_GET['next'])`** — even "starts with `/`" check is bypassed with `//evil.com` (protocol-relative) or `/\evil.com` (backslash on some browsers). Must validate as path, not URL.
- **CRLF injection**: `header("Set-Cookie: pref=" . $_GET['pref'])` with `?pref=a%0d%0aSet-Cookie:session=evil` splits response. PHP's `header()` blocks `\r\n` since 5.1.2, but `setcookie`'s name/value is the risk.
- **Blade `{!! $x !!}` or Twig `{{ x|raw }}` on user data** — bypasses auto-escape.
- **SSTI via Twig `Twig\Environment::createTemplate($user)`, Smarty `$smarty->display("string:$user")`** — full RCE via template engine.
- **Password hash timing leak**: `if ($input_hash == $db_hash)` leaks char-by-char on short-circuit — use `hash_equals`.
- **Session fixation**: `session_id($_GET['sess']); session_start();` accepts attacker-supplied session id.

## Process

For each flow in `flows.json`:

1. **Read the validation chain.** Understand what each check actually validates.
2. **Compare check vs operation.** Is the check's constraint SUFFICIENT for the operation's safety?
3. **Check type consistency.** Does the data maintain the same type/width throughout?
4. **Check ordering.** Can anything change between the check and the use?
5. **Check interactions.** Do multiple checks or operations interact in unexpected ways?
6. **Cross-reference static warnings.** Does CodeQL/Semgrep flag anything in this flow?

For hot-rated flows: analyze thoroughly — these are missing or weak checks.
For warm-rated flows: focus on the specific concern the tracer noted.
For cold-rated flows: quick scan for interaction gaps only — the tracer thought these were fine, but interaction bugs hide in "correct" code.

## Output

Write to `sast-work/<file_hash>-candidates.json`:
```json
{
  "file": "src/net/tcp_sack.c",
  "candidates": [
    {
      "id": "tcp_sack_001",
      "title": "Signed integer overflow in SACK processing enables NULL pointer write",
      "gap_category": "interaction",
      "severity_estimate": "high",
      "flow_summary": "Network packet → SACK option → sack_start unbounded → signed comparison overflow → impossible condition satisfied → linked list walk writes through NULL",
      "the_gap": "sack_end is validated against the send window but sack_start is not. The SEQ_LEQ macro uses signed subtraction (int)(a-b), which overflows when the operands are ~2^31 apart. This makes the 'delete hole' and 'append hole' conditions simultaneously satisfiable, but the 'delete' frees the only node, leaving a NULL pointer for 'append' to write through.",
      "checks_present": ["sack_end <= snd_max (line 270)"],
      "checks_missing": ["sack_start >= snd_una"],
      "why_existing_checks_fail": "Checking sack_end prevents out-of-window acknowledgment but allows sack_start to wrap around. The macro's signed subtraction is correct for values within 2^31 but produces wrong results for crafted values outside that range.",
      "poc_hint": "Craft TCP packet with SACK option where sack_start = snd_una + 0x80000000. The signed comparison will overflow and satisfy both branch conditions.",
      "static_tool_flagged": false
    }
  ],
  "flows_analyzed": 7,
  "hot_flows": 2,
  "warm_flows": 3,
  "cold_flows": 2,
  "candidates_found": 1
}
```

## Rules

- **The gap must be SPECIFIC.** Not "validation might be insufficient" but "the check at line 270 validates sack_end but not sack_start, and the signed macro at line 320 overflows for values > 2^31."
- **Include poc_hint.** A concrete description of what input would trigger the gap. The PoC builder uses this.
- **Don't reject cold flows without checking for interaction gaps.** The two most interesting Mythos bugs (SACK, FFmpeg) would both look "correct" to someone checking each piece individually. The bug is in how pieces interact.
- **Be conservative with severity.** Estimate based on: what primitive does the gap give (crash, OOB read, OOB write, control flow), is it reachable from untrusted input, what mitigations apply.
- **Record analysis even for no-candidate flows.** Brief note in summary of why each flow was cleared.

## Brain Integration
Check brain for prior gap analysis. Skip if already analyzed and files unchanged.

## Top-Tier Operator Standard

Gap analysis finds broken invariants between individually reasonable code blocks.

- Name the invariant: length <= buffer, path stays under root, user owns object, signature binds claims, parser output is canonical, state transition is monotonic.
- Show where the invariant is assumed, where it is checked, and how the check can be bypassed or made stale.
- Prefer candidates with triggerable conditions and testable PoC strategy.
- Kill or downgrade candidates whose required input shape is impossible, whose guard dominates all paths, or whose sink is unreachable.
- Output confidence based on reachability, controllability, guard weakness, impact primitive, and PoC feasibility.
More from H-mmer/pentest-agents