sciverse-academic-retrieval
$
npx mdskill add InternScience/scp/sciverse-academic-retrievalConnects to the **Sciverse** SCP Server via the SCP Hub MCP gateway to perform **citation-grade scientific literature retrieval** over a corpus that includes peer-reviewed papers (Nature, Cell, …), preprints (arXiv, bioRxiv, …) and other academic sources.
SKILL.md
.github/skills/sciverse-academic-retrievalView on GitHub ↗
---
name: sciverse-academic-retrieval
description: Citation-grade academic literature retrieval (search, semantic chunks, byte-range read, figure fetch) over Sciverse, an open scientific platform indexing peer-reviewed and preprint papers.
license: Apache-2.0
metadata:
skill-author: OpenDataLab
---
# Sciverse Academic Retrieval
Connects to the **Sciverse** SCP Server via the SCP Hub MCP gateway to perform
**citation-grade scientific literature retrieval** over a corpus that includes
peer-reviewed papers (Nature, Cell, …), preprints (arXiv, bioRxiv, …) and other
academic sources.
The server exposes 5 tools designed for **RAG by autonomous research agents**:
structured metadata search, natural-language semantic chunk retrieval,
byte-range source-text reading, and figure/table image fetching — all returning
stable `doc_id` / `chunk_id` for reproducible citation.
## Usage
```python
import asyncio
import json
import base64
from mcp.client.streamable_http import streamablehttp_client
from mcp import ClientSession
class SciverseClient:
"""Sciverse SCP Server client (5 academic-retrieval tools).
All requests transparently proxied by the SCP Hub to the Sciverse backend.
Authentication uses the SCP-HUB-API-KEY header (your SCP Platform key).
"""
def __init__(self, server_url: str, api_key: str):
self.server_url = server_url
self.api_key = api_key
self.session = None
async def connect(self):
try:
self.transport = streamablehttp_client(
url=self.server_url,
headers={"SCP-HUB-API-KEY": self.api_key},
)
self.read, self.write, self.get_session_id = await self.transport.__aenter__()
self.session_ctx = ClientSession(self.read, self.write)
self.session = await self.session_ctx.__aenter__()
await self.session.initialize()
return True
except Exception as e:
print(f"[sciverse] connect failed: {e}")
return False
async def disconnect(self):
if self.session:
await self.session_ctx.__aexit__(None, None, None)
if hasattr(self, "transport"):
await self.transport.__aexit__(None, None, None)
def parse_text_result(self, result):
"""Extract concatenated text from a tool result's content blocks.
Works for: search_papers, semantic_search, read_content, list_catalog.
Returns: str (the tool's JSON payload as text).
"""
if isinstance(result, dict):
content_list = result.get("content") or []
else:
content_list = getattr(result, "content", []) or []
texts = []
for item in content_list:
if isinstance(item, dict):
if item.get("type") == "text":
texts.append(item.get("text") or "")
else:
if getattr(item, "type", None) == "text":
texts.append(getattr(item, "text", "") or "")
return "".join(texts)
def parse_image_result(self, result):
"""Extract a figure/table image (used by get_resource).
Returns: dict with keys 'mime_type' (e.g. 'image/png') and 'bytes'
(decoded binary). Returns None if the result is not an image.
"""
if isinstance(result, dict):
content_list = result.get("content") or []
else:
content_list = getattr(result, "content", []) or []
for item in content_list:
data = item.get("data") if isinstance(item, dict) else getattr(item, "data", None)
mime = item.get("mimeType") if isinstance(item, dict) else getattr(item, "mimeType", None)
type_ = item.get("type") if isinstance(item, dict) else getattr(item, "type", None)
if type_ == "image" and data:
return {"mime_type": mime, "bytes": base64.b64decode(data)}
return None
```
### Initialize and use
```python
SERVER_URL = "https://scp.intern-ai.org.cn/api/v1/mcp/43/Sciverse"
API_KEY = "<YOUR_SCP_HUB_API_KEY>"
async def main():
client = SciverseClient(SERVER_URL, API_KEY)
if not await client.connect():
print("connect failed")
return
try:
# 1. Structured search: recent transformer papers
result = await client.session.call_tool(
"search_papers",
arguments={
"query": "transformer attention", # BM25 over title/abstract/journal
"year_from": 2023,
"page_size": 5,
},
)
papers = json.loads(client.parse_text_result(result))
print(f"search_papers hits: {len(papers.get('hits', []))}")
# 2. Semantic search: RAG-style chunk retrieval
result = await client.session.call_tool(
"semantic_search",
arguments={"query": "How does transformer attention work?", "top_k": 3},
)
chunks = json.loads(client.parse_text_result(result))
for hit in chunks.get("hits", []):
print(f" - {hit['title']} (score={hit['score']:.3f}, doc_id={hit['doc_id']})")
# 3. Read content: expand context around a known offset
if chunks.get("hits"):
first = chunks["hits"][0]
result = await client.session.call_tool(
"read_content",
arguments={"doc_id": first["doc_id"], "offset": first["offset"], "limit": 4096},
)
text_window = json.loads(client.parse_text_result(result))
print(f"read_content next_offset={text_window.get('next_offset')} more={text_window.get('more')}")
# 4. List catalog: discover available filter fields and operators
result = await client.session.call_tool(
"list_catalog", arguments={"include_sample_values": False},
)
catalog = json.loads(client.parse_text_result(result))
print(f"available filter fields: {len(catalog.get('fields', []))}")
# 5. Get resource: fetch a figure referenced inside read_content's Markdown
# (Only call after read_content returned a Markdown snippet with .)
# result = await client.session.call_tool(
# "get_resource", arguments={"file_name": "figures/fig-3.png"},
# )
# image = client.parse_image_result(result)
# if image:
# from pathlib import Path
# Path("fig-3.png").write_bytes(image["bytes"])
finally:
await client.disconnect()
asyncio.run(main())
```
### Tool: `search_papers`
Structured metadata search by author, journal, year, subject, etc. Use when
the user knows specific filter values ("Hinton's papers from 2020-2023",
"Nature papers on CRISPR"). Do **not** use for free-text Q&A — that's
`semantic_search`.
- **Args**:
- `query` (str, optional) — BM25 keyword over title/abstract/journal
- `title_contains` (str, optional) — Substring match on title
- `abstract_contains` (str, optional) — Substring match on abstract
- `authors` (list[str], optional) — Any of these authors matches
- `year_from` / `year_to` (int, optional) — Publication year range (inclusive)
- `journals` (list[str], optional) — Journal names (any match)
- `subjects` (list[str], optional) — Subject classification (e.g. "biology")
- `sort_by_year` (str, default `"desc"`) — `desc` / `asc` / `none`
- `page` (int, default 1), `page_size` (int, default 10, max 50)
- `filters_advanced` (list, optional) — Escape hatch with full operator set
(`FILTER_OP_EQ`, `IN`, `CONTAINS`, `GTE`, `LTE`, …) for fields not surfaced above
- **Returns**: JSON `{hits: [...], total: int}` where each hit has
`doc_id`, `title`, `author`, `abstract`, `publication_venue_name`,
`publication_published_year`.
### Tool: `semantic_search`
Natural-language semantic search returning relevant **paper chunks** for
RAG-style answering. Use for free-text questions ("How does attention
work?"). Typical chain: `semantic_search` → pick chunk → `read_content`.
- **Args**:
- `query` (str, required) — Natural-language query, 1-200 words optimal
- `top_k` (int, default 10, max 30)
- `source_types` (list[str], optional) — Filter by `web` / `pdf`
- `mode` (str, default `"balanced"`) — `fast` (~200ms keyword only) /
`balanced` (~600ms hybrid) / `quality` (~2-4s LLM-rewrite + hybrid)
- **Returns**: JSON `{hits: [...]}` where each hit has
`chunk_id`, `doc_id`, `chunk` (the matched text), `score`, `title`,
`offset` (byte offset into source doc — pass to `read_content` for expansion).
### Tool: `read_content`
Read a UTF-8 byte range of a paper's source text. Typically called with a
`doc_id`/`offset` returned by `semantic_search` to expand context (read more
bytes before or after a chunk for fuller answers).
- **Args**:
- `doc_id` (str, required) — Paper ID from `search_papers` / `semantic_search`
- `offset` (int, default 0) — Byte offset to start reading
- `limit` (int, default 4096, max 16384) — Bytes to read
- **Returns**: JSON `{text: str, bytes_returned: int, next_offset: int, more: bool}`.
Markdown text may contain figure references like `` — pass
`file_name` to `get_resource` to fetch the image.
### Tool: `get_resource`
Fetch the binary bytes of a paper figure / table image referenced inside
`read_content`'s Markdown. Use when the user asks to see / describe a figure
and `read_content` output contains an image reference.
- **Args**:
- `file_name` (str, required) — Relative path from the Markdown ``.
Must not contain `..` or start with `/`.
- **Returns**: Image content block — `data` (base64) + `mimeType` (`image/*`).
Multimodal agents (Claude, GPT-4V, Gemini, …) can read it directly.
### Tool: `list_catalog`
Returns the schema catalog for `search_papers`: every field name, type,
whether it's filterable / sortable / default-returned, human description, and
applicable filter operators. Use when constructing precise `search_papers`
filters or facing an ambiguous field need.
- **Args**:
- `include_sample_values` (bool, default `false`) — If `true`, also fetch
top-20 values for enum-like fields (24h cached, ~100s of ms first call).
- **Returns**: JSON `{fields: [...]}` where each field has `name`, `type`
(`string`/`integer`/`list[string]`/…), `filterable`, `sortable`,
`default_return`, `description`, `applicable_operators`, and optionally
`sample_values`.
### Use Cases
- **Drug discovery / pharmacology**: literature scoping for a target before
triggering wet-lab skills; RAG context for ADMET / MoA reasoning.
- **Protein science**: gather structure/function papers around a UniProt ID
before predicting mutations or binding sites.
- **Genomics & rare disease**: pull recent papers on a variant / phenotype
for evidence-grade reasoning, then cite by `doc_id`.
- **Chemistry / materials**: find prior art around a SMILES or reaction
before computing properties.
- **Cross-domain literature review**: agentic survey writing — chain
`semantic_search` → `read_content` to assemble citation-grounded
summaries with stable `doc_id` references for verifiability.
More from InternScience/scp
- admet_druglikeness_reportADMET & Drug-Likeness Report - Generate comprehensive ADMET and drug-likeness report: molecular properties, H-bond analysis, hydrophobicity, topology, and ADMET prediction. Use this skill for medicinal chemistry tasks involving calculate mol basic info calculate mol hbond calculate mol hydrophobicity calculate mol topology pred molecule admet. Combines 5 tools from 2 SCP server(s).
- affinity_maturationAffinity Maturation Pipeline - Affinity maturation: compute binding affinity, predict mutations, compute hydrophilicity, and predict drug-target interaction. Use this skill for antibody engineering tasks involving ComputeAffinityCalculator zero shot sequence prediction ComputeHydrophilicity PredictDrugTargetInteraction. Combines 4 tools from 3 SCP server(s).
- alanine_scanning_pipelineAlanine Scanning Mutagenesis Pipeline - Alanine scanning: design scan, compute properties for each mutant, predict interactions, and compare. Use this skill for protein biochemistry tasks involving AlanineScanningDesigner ComputeProtPara PredictDrugTargetInteraction calculate protein sequence properties. Combines 4 tools from 3 SCP server(s).
- aliphatic_ring_analysisRing System Analysis - Analyze ring systems: count aliphatic carbocycles, analyze aromaticity, compute topology, and structure complexity. Use this skill for organic chemistry tasks involving GetAliphaticCarbocyclesNum AromaticityAnalyzer calculate mol topology calculate mol structure complexity. Combines 4 tools from 3 SCP server(s).
- alphafold_structure_pipelineAlphaFold Structure Analysis Pipeline - AlphaFold pipeline: download predicted structure, predict pockets, extract sequence, and compute properties. Use this skill for computational biology tasks involving download alphafold structure run fpocket extract pdb sequence calculate pdb basic info. Combines 4 tools from 3 SCP server(s).
- antibody_drug_developmentAntibody Drug Development - Develop antibody drug: target protein analysis, biotherapeutic lookup, protein properties, and interaction prediction. Use this skill for biologics tasks involving get uniprotkb entry by accession get biotherapeutic by name ComputeProtPara ComputeHydrophilicity. Combines 4 tools from 3 SCP server(s).
- antibody_target_analysisAntibody-Target Analysis - Analyze an antibody target: UniProt protein info, InterPro domains, protein properties, and biotherapeutic data from ChEMBL. Use this skill for immunology tasks involving get uniprotkb entry by accession query interpro ComputeProtPara get biotherapeutic by name. Combines 4 tools from 4 SCP server(s).
- atc_drug_classificationATC Drug Classification Lookup - Look up drug in ATC classification: ChEMBL ATC class, FDA drug info, PubChem compound, and mechanism of action. Use this skill for pharmacology tasks involving get atc class by level5 get mechanism of action by drug name get compound by name get drug by name. Combines 4 tools from 3 SCP server(s).
- atmospheric-science-calculationsCalculate atmospheric parameters including Coriolis parameter, geostrophic wind, heat index, potential temperature, and dewpoint for meteorology and climate science.
- binding_site_characterizationBinding Site Characterization - Characterize binding sites: predict pockets with fpocket and P2Rank, get binding site info from ChEMBL, and visualize. Use this skill for structural biology tasks involving run fpocket pred pocket prank get binding site by id visualize protein. Combines 4 tools from 3 SCP server(s).