document-scanning

$npx mdskill add Community-Access/accessibility-agents/document-scanning

Scan folders to extract document metadata for accessibility audits.

  • Enables inventory building for Office and PDF files.
  • Integrates with git diff to detect document changes.
  • Executes platform-specific discovery commands on Windows or macOS.
  • Delivers structured property data including title and author.
SKILL.md
.github/skills/document-scanningView on GitHub ↗
---
name: document-scanning
description: Document discovery, inventory building, and metadata extraction for accessibility audits. Use when scanning folders for Office documents (.docx, .xlsx, .pptx) and PDFs, building file inventories, detecting changes via git diff, or extracting document properties like title, author, and language.
---

# Document Scanning

## Supported File Types

| Extension | Type | Sub-Agent |
|-----------|------|-----------|
| .docx | Word document | word-accessibility |
| .xlsx | Excel workbook | excel-accessibility |
| .pptx | PowerPoint presentation | powerpoint-accessibility |
| .pdf | PDF document | pdf-accessibility |

## File Discovery Commands

### PowerShell (Windows)

```powershell
# Non-recursive scan
Get-ChildItem -Path "<folder>" -File -Include *.docx,*.xlsx,*.pptx,*.pdf

# Recursive scan
Get-ChildItem -Path "<folder>" -File -Include *.docx,*.xlsx,*.pptx,*.pdf -Recurse |
  Where-Object { $_.Name -notlike '~$*' -and $_.Name -notlike '*.tmp' -and $_.Name -notlike '*.bak' } |
  Where-Object { $_.FullName -notmatch '[\\/](\.git|node_modules|__pycache__|\.vscode)[\\/]' }
```

### Bash (macOS)

```bash
# Non-recursive scan
find "<folder>" -maxdepth 1 -type f \( -name "*.docx" -o -name "*.xlsx" -o -name "*.pptx" -o -name "*.pdf" \) ! -name "~\$*"

# Recursive scan
find "<folder>" -type f \( -name "*.docx" -o -name "*.xlsx" -o -name "*.pptx" -o -name "*.pdf" \) \
  ! -name "~\$*" ! -name "*.tmp" ! -name "*.bak" \
  ! -path "*/.git/*" ! -path "*/node_modules/*" ! -path "*/__pycache__/*" ! -path "*/.vscode/*"
```

## Delta Detection

### Git-based

```bash
# Files changed since last commit
git diff --name-only HEAD~1 HEAD -- '*.docx' '*.xlsx' '*.pptx' '*.pdf'

# Files changed since a specific tag
git diff --name-only <tag> HEAD -- '*.docx' '*.xlsx' '*.pptx' '*.pdf'

# Files changed in the last N days
git log --since="N days ago" --name-only --diff-filter=ACMR --pretty="" -- '*.docx' '*.xlsx' '*.pptx' '*.pdf' | sort -u
```

### Timestamp-based (PowerShell)

```powershell
# Files modified since a specific date
Get-ChildItem -Path "<folder>" -File -Include *.docx,*.xlsx,*.pptx,*.pdf -Recurse |
  Where-Object { $_.LastWriteTime -gt [datetime]"2025-01-01" }
```

## Files to Skip

Always exclude these patterns during scanning:

- `~$*` - Office lock/temp files (created when a document is open)
- `*.tmp` - Temporary files
- `*.bak` - Backup files
- Files inside `.git/`, `node_modules/`, `.vscode/`, `__pycache__/` directories

## Scan Configuration Files

| File | Purpose |
|------|---------|
| `.a11y-office-config.json` | Rule enable/disable for Word, Excel, PowerPoint |
| `.a11y-pdf-config.json` | Rule enable/disable for PDF scanning |

### Scan Profiles

| Profile | Rules | Severities | Use Case |
|---------|-------|------------|----------|
| Strict | All | Error, Warning, Tip | Public-facing, legally required documents |
| Moderate | All | Error, Warning | Most organizations |
| Minimal | All | Error only | Triaging large document libraries |

## Context Passing Format

When delegating to a sub-agent, always provide this context block:

```text
## Document Scan Context
- **File:** [full path]
- **Scan Profile:** [strict | moderate | minimal]
- **Severity Filter:** [error, warning, tip]
- **Disabled Rules:** [list or "none"]
- **User Notes:** [any specifics]
- **Part of Batch:** [yes/no - if yes, indicate X of Y]
```
More from Community-Access/accessibility-agents