liteparse-targeted

$npx mdskill add run-llama/benchmark-claude-pdfs/liteparse-targeted

Extract text from documents locally with the `lit` CLI — a fast, model-free parser (a drop-in, faster replacement for `pdftotext`/`pypdf`).

SKILL.md
.github/skills/liteparse-targetedView on GitHub ↗
---
name: liteparse-targeted
description: Use this skill whenever a task involves a document file (PDF, DOCX, PPTX, XLSX, or image) and you need to read it or get text, tables, or specific values out of it — including to answer a question about its contents, look up a figure, extract data, or convert it to text/JSON. Provides fast, local, model-free extraction with no cloud or API key. Reach for this instead of ad-hoc pdftotext/pypdf/textract whenever a question or task references a document file.
compatibility: Requires Node 18+ and `@llamaindex/liteparse` installed globally (`npm i -g @llamaindex/liteparse`). LibreOffice for Office files; ImageMagick for images.
license: MIT
metadata:
  author: LlamaIndex
  version: "0.3.0"
---

# LiteParse

Extract text from documents locally with the `lit` CLI — a fast, model-free parser (a drop-in,
faster replacement for `pdftotext`/`pypdf`).

## Answering a question about ONE document: stream and search in a SINGLE shell command

`lit parse` writes plain text to **stdout**, so pipe it straight into your normal search tools in
**one Bash command** — exactly how you would use `pdftotext -layout file.pdf - | grep`. Do **not**
write an intermediate file, and do **not** use the Read or Grep *tools* on a saved file: each of
those is an extra agent round-trip. Keep parse+search fused in one command:

```bash
lit parse ./input.pdf --format text --no-ocr | grep -i -n -A3 -B3 "total assets" | head -40
lit parse ./input.pdf --format text --no-ocr | sed -n '900,945p'
```

- **Born-digital PDF** (has a real text layer): add `--no-ocr` — much faster, identical text.
- **Scanned PDF / image**: drop `--no-ocr` (OCR on). If the value is missing from the OCR text or the
  digits look wrong, **read the page visually instead of trusting OCR**: render it with
  `lit screenshot ./input.pdf --target-pages "N" -o ./shots/` and view the PNG.
- **Multi-column tables**: piped `--format text` keeps most layout; if columns collapse so you can't
  tell which column a number is in, render that page and read it visually.

## Answering MANY questions about the same document(s): parse once, reuse

Only here is it worth materializing a file (so you don't re-parse per question):

```bash
lit parse ./inputs/<doc>.pdf --format text --no-ocr -o ./parsed/<doc>.txt   # once per doc
grep -i -n -A3 -B3 "total assets" ./parsed/<doc>.txt                        # then search the file
```

## Core flags

`--format text|json` · `--no-ocr` · `--dpi <n>` (default 150) · `--target-pages "1-5,10"` ·
`--ocr-language <iso>` · `lit batch-parse ./in ./out`. Use `--format json` only when you need
bounding boxes / layout (it is much larger — still search it, don't load it whole).

## Setup

PDF works out of the box. If `lit` is missing: `npm i -g @llamaindex/liteparse` (verify
`lit --version`). Office docs need LibreOffice; images need ImageMagick (auto-converted to PDF).
More from run-llama/benchmark-claude-pdfs