bigdata-skill

Name: bigdata-skill
Author: daymade/claude-code-skills
$npx mdskill add daymade/claude-code-skills/bigdata-skill
Get the structured substrate the Bigdata.com MCP server doesn't hand over. The MCP returns clean prose and pre-synthesized tearsheets, but its search tool gives chunks with no per-chunk sentiment or entity spans, and its tearsheets give aggregate values — not the fiscal-period time series, universe screener, or per-field JSON you'd build a pipeline on. The official `bigdata-client` SDK plus a thin REST passthrough over the *same backend, same JWT* reach the official `/v1/*` endpoints that hold it. This skill bundles a toolkit that does exactly that — already debugged, already cost-guarded — so you don't re-pay the discovery cost.
SKILL.md
.github/skills/bigdata-skillView on GitHub ↗
---
name: bigdata-skill
description: >-
  Pull Bigdata.com (RavenPack) financial and news data via the official
  `bigdata-client` SDK and `/v1/*` REST endpoints — structured financials,
  prices, analyst estimates, daily entity-sentiment series, annotated chunk
  search, screener — when the Bigdata MCP returns only pre-synthesized tearsheets
  but you need the machine-readable substrate. Use when the user mentions
  Bigdata.com, RavenPack, a `bd_v2_` key, the bigdata MCP, rp_entity_id,
  chunk/query_unit cost, or wants structured financials, fundamentals, prices,
  sentiment, or annotated news.
---

# Bigdata.com SDK + REST Toolkit

Get the structured substrate the Bigdata.com MCP server doesn't hand over. The
MCP returns clean prose and pre-synthesized tearsheets, but its search tool
gives chunks with no per-chunk sentiment or entity spans, and its tearsheets
give aggregate values — not the fiscal-period time series, universe screener, or
per-field JSON you'd build a pipeline on. The official `bigdata-client` SDK plus
a thin REST passthrough over the *same backend, same JWT* reach the official
`/v1/*` endpoints that hold it. This skill bundles a toolkit that does exactly
that — already debugged, already cost-guarded — so you don't re-pay the
discovery cost.

## The core problem this solves (read this first)

The Bigdata MCP server answers "what's the sentiment around NVIDIA?" with a
readable paragraph or a pre-synthesized tearsheet — genuinely useful for a chat
turn. But the moment you need the **machine-readable substrate** to build a
pipeline on, the MCP doesn't hand it over:

- its **search** tool returns chunks with text + relevance only — **no per-chunk
  sentiment number, no entity character spans**;
- its **tearsheets** give aggregate values (a single sentiment score, a summary
  of estimates) — **not** a fiscal-period time series you can compute on, a
  universe screener, or per-field JSON.

The fix is a general pattern, not a Bigdata trick:

> **When an MCP data source returns only synthesized output but you need the
> structured fields underneath, drop to the vendor SDK or REST.** MCP optimizes
> for a chat turn, not a pipeline.

Crucially, for Bigdata these structured fields are **official, publicly
documented REST endpoints** (`docs.bigdata.com/api-reference/...`), not a hidden
backend — and Bigdata is **sunsetting the SDK (EOL 2026-12-31) in favour of this
REST API**, so the REST layer here is the forward-compatible path, not a hack.
The SDK (`bigdata_client.Bigdata`) covers search + knowledge-graph; **`bd._api.http`**
reaches every `/v1/*` endpoint the SDK never wrapped. The bundled
`bigdata_toolkit` packages both behind one `BigdataClient`.

## When to use this skill

Trigger on any of these, in any language:

- The user is using **Bigdata.com / RavenPack** and the MCP result feels thin —
  "where's the sentiment score?", "I need entity-level data", "the calendar".
- They want **forward / structured** financials for a ticker: analyst
  estimates, earnings or event calendar, earnings surprise, analyst ratings,
  price targets, a company screener / universe.
- They want **annotated news chunks** with numeric sentiment + entity spans, or
  a sentiment time series / co-mention graph.
- They mention a **`bd_v2_` API key**, `rp_entity_id`, `query_unit` / chunk
  cost, `bigdata-client`, or "the bigdata MCP isn't enough".
- They're building an **investment-research dataset** and need a reusable,
  cost-aware data-pull layer rather than one-off MCP calls.

## Setup (one time)

**1 — API key (never hardcode it).** The client fail-fasts if it's missing:

```bash
export BIGDATA_API_KEY=bd_v2_xxxxxxxx
```

**2 — An isolated Python env with the official SDK.** The bundled toolkit
imports `bigdata_client`; install it once:

```bash
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python bigdata-client
# Behind a slow/blocked PyPI (e.g. mainland China) add a mirror, and unset any
# outbound proxy for the install step so uv reaches the index directly:
#   --index-url https://pypi.tuna.tsinghua.edu.cn/simple
```

**3 — Outbound proxy (only if your network needs one to reach
`api.bigdata.com`).** Two equivalent options — the official SDK accepts both: an
env var, or `BigdataClient(proxy=...)` in code. The env var is simplest:

```bash
export HTTPS_PROXY=http://<host>:<port>     # plus WSS_PROXY for chat/WebSocket
```

If a proxy does TLS interception (self-signed CA) and you hit SSL handshake
errors, the official fix is `BigdataClient(verify_ssl="<proxy-CA>.pem")` — not
blind retries.

**4 — Make the bundled package importable** by putting this skill's `scripts/`
on `PYTHONPATH` (or `sys.path.insert(0, "<this-skill>/scripts")`).

**Smoke-test the whole path** (entity resolve + quota are free; `--with-search`
adds one ~1 query_unit chunk search):

```bash
BIGDATA_API_KEY=bd_v2_xxx PYTHONPATH=scripts .venv/bin/python scripts/probe_example.py
```

## Quickstart

```python
import sys
sys.path.insert(0, "<this-skill>/scripts")          # so `import bigdata_toolkit` resolves
from bigdata_toolkit import (
    BigdataClient, EntityResolver, AnnotatedSearcher,
    StructuredDataREST, CostTracker, CostModel, rc,   # rc = SSL-retry wrapper
)

c  = BigdataClient()                                  # SDK + REST escape hatch, one object
er = EntityResolver(c)
nvda = rc(lambda: er.resolve_id("NVIDIA", country="US"))   # -> 'E09E2B'  (rp_entity_id is the gateway key)

# --- Structured financials the MCP does NOT expose (REST escape hatch) ---
rest = StructuredDataREST(c)
est  = rc(lambda: rest.analyst_estimates(nvda, period="quarter", limit=5))  # forward consensus
surp = rc(lambda: rest.latest_surprise(nvda))                               # last EPS/revenue surprise
cal  = rc(lambda: rest.events_calendar(nvda, categories=["earnings-call"],
                                       start_date="2026-06-01", end_date="2026-12-31"))

# --- Annotated chunks the MCP STRIPS: sentiment + entity spans (cost-guarded) ---
s    = AnnotatedSearcher(c)
docs = rc(lambda: s.search_entity(nvda, keyword="data center", chunk_limit=10))
# each chunk dict: {"sentiment": float, "entities": [{"key": rp_id, "start", "end"}], "text", ...}

# --- Always know your spend (chunk-billed; see Cost discipline) ---
ct = CostTracker(c); ct.snapshot()
# ... run a batch ...
print(ct.delta())     # {'delta_chunks':..., 'delta_query_units':..., 'usd_fast':...}
```

Wrap **every** network call in `rc(lambda: ...)` — a first-handshake `SSL:
UNEXPECTED_EOF` is common and the SDK's internal retry doesn't cover it.

## Routing — which capability answers the question

| The user wants… | Use | Module |
|---|---|---|
| Company name / ISIN / CUSIP / SEDOL → `rp_entity_id` | `EntityResolver.resolve_id` / `.resolve_by_isin` | `kg.py` (SDK) |
| Forward analyst consensus (revenue/EPS by fiscal period) | `StructuredDataREST.analyst_estimates` | `rest_ext.py` |
| Latest earnings surprise (actual vs estimate) | `.latest_surprise` | `rest_ext.py` |
| Upcoming earnings / event calendar (one name or whole market) | `.events_calendar` | `rest_ext.py` |
| Analyst ratings / price-target consensus | `.analyst_ratings` / `.price_target` | `rest_ext.py` |
| Full financial statements (income / balance / cash-flow, multi-year) | `.income_statement` / `.balance_sheet` / `.cash_flow_statement` | `rest_ext.py` |
| TTM valuation metrics & ratios (EV/EBITDA, ROE, P/E, margins) | `.key_metrics_ttm` / `.company_ratios_ttm` | `rest_ext.py` |
| Company profile (CEO, sector, employees, IPO date) | `.company_profile` | `rest_ext.py` |
| Daily OHLC prices / dividend history | `.daily_prices` / `.dividends` | `rest_ext.py` |
| Revenue by geography / product segment | `.revenue_geographic_segments` / `.revenue_product_segments` | `rest_ext.py` |
| Daily entity-sentiment time series (don't self-aggregate from chunks!) | `.entity_sentiment` | `rest_ext.py` |
| Co-mention graph (supply-chain / competitor / customer — ⚠️ chunk-billed) | `.connected_entities` | `rest_ext.py` |
| Build a universe by market-cap / sector / country | `.company_screener` | `rest_ext.py` |
| News/filing/transcript chunks with sentiment + entity spans | `AnnotatedSearcher.search_entity` | `search.py` (SDK) |
| Bulk-pull many searches 50% cheaper (portfolio backfill) | `BatchSearch` (create→upload→poll→download) | `rest_ext.py` |
| Track / forecast quota spend before a backfill | `CostTracker` / `CostModel` | `cost.py` |
| Hit an endpoint the toolkit hasn't wrapped yet | `client.http.post("v1/<resource>/query", body)` | `client.py` |

> `income/balance/cash-flow/daily-prices/dividends/revenue-segments` return
> `{fields, values}` — wrap them in `fields_values_to_records()` to get
> `[{field: value}]`. The `*_ttm` / `company_profile` endpoints are already flat.
> All structured endpoints above are **free** (0 chunks) except
> `connected_entities` and `AnnotatedSearcher` (chunk-billed).

## The two data faces (do NOT say "Bigdata fails for Chinese / A-shares")

This split is the most important non-obvious conclusion — state it precisely:

| Face | Path | A-share / Chinese verdict |
|---|---|---|
| **Structured financial** (estimates, calendar, surprise, ratings, target, screener, **financials, prices, dividends, revenue segments, daily entity-sentiment**) | REST (`rest_ext.py`) | **Works** — via `rp_entity_id` resolved from the **English name or ISIN** (not the Chinese name). Data is fresh. Minor holes (some A-share price-targets return the entity with no numeric target). The daily `entity_sentiment` series lives **here** and works for any resolvable entity — it is **not** the dead end below. |
| **Unstructured Chinese NLP** (Chinese-news entity detection, per-chunk Chinese sentiment) | SDK search (`search.py`) | **Dead end** — a data-source-level gap, not an SDK bug: Chinese entity detection ≈ 0, per-chunk CJK sentiment is a doc-level inherited value, and `language` mislabels Chinese filings as English. Pair Bigdata with a China-domestic source for Chinese-language *chunk* content; use Bigdata for the structured face (incl. aggregate `entity_sentiment`) + ISIN/KG crosswalk + English-language chunk sentiment. |

## Cost discipline

`1 query_unit = 10 chunks` (official). **Only chunk-search is billed** — the
structured `/v1/*` endpoints (estimates, financials, prices, calendar, surprise,
ratings, the sentiment time series, screener…) are **free** (0 chunks,
contract-tested). `connected_entities` (co-mentions) and `AnnotatedSearcher`
**are** chunk-billed.

Three levers when you do pay for chunks:

1. **`ChunkLimit`, never a bare `int`.** `Search.run(int)` is a *document* limit
   billed by the full chunk page; `ChunkLimit(n)` bills per chunk.
   `AnnotatedSearcher.search` forces `ChunkLimit` for you. (We observed roughly a
   52x gap once — **a single measured data point, not stated in the official
   docs**; treat the exact multiple as indicative. The rule "use `ChunkLimit`"
   holds regardless, because `max_chunks` is the official billing unit.)
2. **Rerank bills only the *returned* chunks** (official) — pass a
   `rerank_threshold` to recall broadly but pay only for the high-relevance hits.
3. **Batch search is 50% cheaper** (`$0.0075` vs `$0.015` / qu) — use
   `BatchSearch` for a large multi-query backfill.

Use `CostModel` to veto an over-budget job *before* running it, and
`CostTracker.snapshot()` / `delta()` to measure real spend. Full accounting →
`references/cost_accounting.md`.

## Known pitfalls (already solved — don't re-debug these)

Each cost real debugging time and is fixed or guarded in the toolkit. Full
reproductions and fixes in **`references/known_pitfalls.md`**:

1. **First-handshake `SSL: UNEXPECTED_EOF`** → wrap calls in `rc()`; the SDK's
   urllib3 retry only covers HTTP status, not the SSL EOF.
2. **`All(entity, Keyword(kw))` raises `TypeError`** → combine with the `&`
   operator (`entity & Keyword(kw)`); `All` takes a single iterable. (Fixed in
   `AnnotatedSearcher.entity_query`.)
3. **The 52x doc-limit billing trap** → always `ChunkLimit`, never a bare `int`.
4. **Closure capture in loops** → bind loop vars: `rc(lambda q=q, dr=dr: ...)`.
5. **`analyst_estimates(period="quarter")` 400s above `limit≈20`.**
6. **`company_screener` filters must nest under `"filters"`** — flat top-level
   keys don't 400, they're silently dropped → unfiltered universe.
7. **`Document.reporting_period` is always `None`** (the SDK model drops a field
   present on the REST wire) → `fetch_reporting_period_raw`.

## What this skill will not do

- **Never hardcode an API key.** `BigdataClient` reads `BIGDATA_API_KEY` and
  fail-fasts if absent — no plaintext fallback (that is exactly the pattern
  secret scanners catch).
- **Only ever reads — never writes or uploads.** Every method is a read-only
  query (`uploads` is `NotImplementedError` in API-key mode anyway), so the
  toolkit can't mutate your account or push data anywhere.
- **Never invent an endpoint or a schema.** Every signature here is runtime
  L4-verified or marked L3 (doc-confirmed, not yet run); see
  `references/verified_api_signatures.md`. For a new endpoint, confirm the path
  via `docs.bigdata.com/llms.txt` rather than guessing.

## File layout

```
bigdata-skill/
├── SKILL.md                       # this file — routing + setup + quickstart
├── scripts/
│   ├── bigdata_toolkit/           # the verified, cost-guarded package
│   │   ├── client.py              # BigdataClient: SDK (.bd) + REST escape hatch (.http/.conn)
│   │   ├── kg.py                  # EntityResolver: name/ISIN/CUSIP/SEDOL → rp_entity_id
│   │   ├── search.py              # AnnotatedSearcher: chunks + sentiment + entity spans (SDK)
│   │   ├── rest_ext.py            # StructuredDataREST (estimates/financials/prices/dividends/sentiment/co-mentions/screener) + BatchSearch + fields_values_to_records — official REST
│   │   ├── cost.py                # CostTracker + CostModel: chunk billing + budget veto
│   │   └── retry.py               # rc(): SSL/transient-error retry passthrough
│   └── probe_example.py           # runnable end-to-end smoke test
└── references/
    ├── escape_hatch_architecture.md  # WHY the MCP is lossy; bd._api.http mechanism; adding endpoints
    ├── verified_api_signatures.md    # L4/L3-verified signatures + the two data faces, with evidence
    ├── cost_accounting.md            # chunk billing, the 52x trap, CostModel/CostTracker, budgeting
    └── known_pitfalls.md             # every pitfall above, with reproduction + fix
```

## References

| Read when you need to… | File |
|---|---|
| Understand *why* the MCP is insufficient and how the REST escape hatch works (and how to wrap a new `/v1/*` endpoint) | `references/escape_hatch_architecture.md` |
| Look up an exact verified method signature + its verification level | `references/verified_api_signatures.md` |
| Budget a backfill or debug a surprise quota burn | `references/cost_accounting.md` |
| Diagnose an error you hit while pulling data | `references/known_pitfalls.md` |