liteparse-targeted
$
npx mdskill add run-llama/benchmark-claude-pdfs/liteparse-targetedExtract text from documents locally with the `lit` CLI — a fast, model-free parser (a drop-in, faster replacement for `pdftotext`/`pypdf`).
SKILL.md
.github/skills/liteparse-targetedView on GitHub ↗
--- name: liteparse-targeted description: Use this skill whenever a task involves a document file (PDF, DOCX, PPTX, XLSX, or image) and you need to read it or get text, tables, or specific values out of it — including to answer a question about its contents, look up a figure, extract data, or convert it to text/JSON. Provides fast, local, model-free extraction with no cloud or API key. Reach for this instead of ad-hoc pdftotext/pypdf/textract whenever a question or task references a document file. compatibility: Requires Node 18+ and `@llamaindex/liteparse` installed globally (`npm i -g @llamaindex/liteparse`). LibreOffice for Office files; ImageMagick for images. license: MIT metadata: author: LlamaIndex version: "0.3.0" --- # LiteParse Extract text from documents locally with the `lit` CLI — a fast, model-free parser (a drop-in, faster replacement for `pdftotext`/`pypdf`). ## Answering a question about ONE document: stream and search in a SINGLE shell command `lit parse` writes plain text to **stdout**, so pipe it straight into your normal search tools in **one Bash command** — exactly how you would use `pdftotext -layout file.pdf - | grep`. Do **not** write an intermediate file, and do **not** use the Read or Grep *tools* on a saved file: each of those is an extra agent round-trip. Keep parse+search fused in one command: ```bash lit parse ./input.pdf --format text --no-ocr | grep -i -n -A3 -B3 "total assets" | head -40 lit parse ./input.pdf --format text --no-ocr | sed -n '900,945p' ``` - **Born-digital PDF** (has a real text layer): add `--no-ocr` — much faster, identical text. - **Scanned PDF / image**: drop `--no-ocr` (OCR on). If the value is missing from the OCR text or the digits look wrong, **read the page visually instead of trusting OCR**: render it with `lit screenshot ./input.pdf --target-pages "N" -o ./shots/` and view the PNG. - **Multi-column tables**: piped `--format text` keeps most layout; if columns collapse so you can't tell which column a number is in, render that page and read it visually. ## Answering MANY questions about the same document(s): parse once, reuse Only here is it worth materializing a file (so you don't re-parse per question): ```bash lit parse ./inputs/<doc>.pdf --format text --no-ocr -o ./parsed/<doc>.txt # once per doc grep -i -n -A3 -B3 "total assets" ./parsed/<doc>.txt # then search the file ``` ## Core flags `--format text|json` · `--no-ocr` · `--dpi <n>` (default 150) · `--target-pages "1-5,10"` · `--ocr-language <iso>` · `lit batch-parse ./in ./out`. Use `--format json` only when you need bounding boxes / layout (it is much larger — still search it, don't load it whole). ## Setup PDF works out of the box. If `lit` is missing: `npm i -g @llamaindex/liteparse` (verify `lit --version`). Office docs need LibreOffice; images need ImageMagick (auto-converted to PDF).
More from run-llama/benchmark-claude-pdfs