doc-reader
$
npx mdskill add HKUDS/Vibe-Trading/doc-readerExtract text from PDFs using OCR for scanned documents.
- Processes papers, annual reports, and research documents instantly.
- Integrates with the read_document tool for direct file access.
- Selects extraction method based on whether pages are text or images.
- Returns structured JSON with page counts, OCR flags, and full text.
SKILL.md
.github/skills/doc-readerView on GitHub ↗
---
name: doc-reader
description: Read PDF documents (papers, annual reports, research reports), automatically extracting text pages and applying OCR to image/scanned pages. Use the `read_document` tool.
category: tool
---
# PDF Document Reading
## Purpose
Read the full text of PDF documents and automatically handle two page types:
- **Text pages** (most papers and digital reports) → extracted directly in milliseconds
- **Image / scanned pages** (annual report charts, scanned research reports) → OCR recognition with Chinese and English support
Applicable to PDF documents such as papers, annual reports, research reports, announcements, and contracts.
## Usage
**Call the `read_document` tool directly (do not use bash to write a Python script):**
```
read_document(file_path="uploads/paper.pdf")
read_document(file_path="uploads/annual_report.pdf", pages="1-10")
read_document(file_path="uploads/research.pdf", pages="1,3,15-20")
```
**Forbidden**: do not run a Python script from bash to read PDFs. Call the tool directly.
## Return Format
```json
{
"status": "ok",
"file": "paper.pdf",
"total_pages": 45,
"pages_read": 45,
"ocr_pages": 3,
"char_count": 52000,
"truncated": true,
"text": "--- Page 1 ---\n...\n--- Page 5 [OCR] ---\n..."
}
```
- `ocr_pages`: number of pages recognized via OCR (image / scanned pages)
- `truncated`: content is truncated when it exceeds 15000 characters
- `[OCR]` indicates that the page content was obtained via image recognition
## Typical Workflows
### Paper Summary
```
1. read_document(file_path="paper.pdf") → get the full text
2. Analyze the text and extract the abstract, methodology, and conclusion
3. Output the summary
```
### Annual Report Analysis
```
1. read_document(file_path="annual_report.pdf", pages="1-5") → read the summary first
2. Determine the key sections from the summary
3. read_document(file_path="...", pages="15-25") → read the financial-data section
4. Extract key metrics
```
### Research Report Review
```
1. read_document(file_path="research.pdf") → full text
2. Extract the core thesis, target price, and risk factors
```
## Notes
- Content longer than 15000 characters will be truncated. For long documents, read them in chunks with the `pages` parameter
- OCR pages are slower (about 1-3 seconds per page), while pure text pages are processed in milliseconds
- OCR for charts and tables inside images may be imperfect, so complex tables should be checked manually
- Only PDF format is supported
More from HKUDS/Vibe-Trading
- adr-hshareADR/H-share/A-share cross-listing premium analysis — track pricing gaps between US-listed ADRs, HK-listed H-shares, and A-shares for arbitrage signals, dual-listing valuation, and delisting risk assessment.
- akshareAKShare financial data aggregator (18k+ stars). Free, no API key. Covers A-shares, US, HK, futures, macro, forex. Primary fallback for tushare and yfinance.
- asset-allocationAsset allocation theory and optimizer usage — MPT / Black-Litterman / risk budgeting / all-weather strategy, including guides for 4 optimizers and rebalancing rules.
- backtest-diagnoseDiagnose failed or underperforming backtests, locate the root cause, and fix the issue
- behavioral-financeBehavioral finance applications: theories of overreaction and underreaction, behavioral explanations for momentum and reversal, investor sentiment cycles, cognitive-bias checklists, and debiasing quantitative strategies.
- candlestickCandlestick pattern recognition engine, pure pandas vectorized implementation of 15 classic candlestick patterns (5 single-candle + 5 double-candle + 4 triple-candle + 1 trend confirmation), generating a composite signal from bullish/bearish pattern scores.
- ccxtCCXT unified crypto exchange library (100+ exchanges). Free public market data. Fallback when OKX is unavailable.
- chanlun基于缠论(缠中说禅)的形态识别引擎,使用czsc库自动检测K线分型、笔、中枢,并生成一买/一卖/二买/二卖/三买/三卖等买卖点信号。支持多周期分析和形态分类(3/5/7/9/11笔形态)。
- commodity-analysisCommodity analysis (oil supply-demand balance / gold pricing / copper as an economic predictor / inventory cycles / futures premium-discount structure / seasonality), generating directional commodity signals.
- convertible-bondA股可转债分析——转股/纯债/期权三维估值、下修/强赎/回售博弈、双低策略与转债轮动选债框架