claw-semantic-sim
$
npx mdskill add aAAaqwq/AGI-Super-Team/claw-semantic-simCalculate disease neglect metrics using PubMedBERT embeddings.
- Quantifies semantic isolation and knowledge transfer gaps in research.
- Depends on PubMedBERT embeddings and GBD taxonomy identifiers.
- Scores diseases using SII, KTP, RCC, and temporal drift metrics.
- Delivers multi-panel figures, markdown reports, and JSON data.
SKILL.md
.github/skills/claw-semantic-simView on GitHub ↗
---
name: claw-semantic-sim
version: 0.1.0
description: Semantic Similarity Index for disease research literature using PubMedBERT embeddings
author: Manuel Corpas
license: MIT
tags:
- health-equity
- semantic-analysis
- NLP
- PubMedBERT
- disease-neglect
inputs:
- name: diseases
type: file
format: [json, csv]
description: List of diseases with GBD taxonomy identifiers
- name: pubmed-cache
type: directory
format: directory
description: Optional pre-fetched PubMed abstracts (skips retrieval if provided)
outputs:
- name: figure
type: file
format: [png, pdf]
description: Multi-panel figure showing semantic isolation landscape, disease rankings, and NTD analysis
- name: report
type: file
format: markdown
description: Semantic analysis report with isolation scores, knowledge transfer metrics, and temporal drift
- name: metrics
type: file
format: json
description: Machine-readable metrics for all diseases (SII, KTP, RCC, temporal drift)
metadata:
openclaw:
category: bioinformatics
homepage: https://github.com/ClawBio/ClawBio
min_python: "3.11"
dependencies:
- torch
- transformers
- pandas
- numpy
- h5py
- scikit-learn
- umap-learn
- matplotlib
- seaborn
- biopython
- networkx
- tqdm
system_dependencies: []
hardware:
- Apple Silicon MPS recommended (CPU fallback available)
---
# 🦖 Semantic Similarity Index
Measure how isolated or connected disease research is across the global biomedical literature, using PubMedBERT embeddings on PubMed abstracts spanning 175 GBD diseases.
## What it does
1. Takes a disease list (GBD taxonomy) as input
2. Retrieves PubMed abstracts (2000-2025) for each disease with quality filtering
3. Generates 768-dimensional PubMedBERT embeddings for every abstract
4. Computes four semantic equity metrics per disease:
- **Semantic Isolation Index (SII)**: average cosine distance to k-nearest disease neighbours; higher = more isolated, less connected research
- **Knowledge Transfer Potential (KTP)**: cross-disease centroid similarity; higher = more potential for research spillover
- **Research Clustering Coefficient (RCC)**: within-disease embedding variance; higher = more diverse research approaches
- **Temporal Semantic Drift**: cosine distance between yearly centroids; measures how research focus evolves
5. Generates publication-quality multi-panel figures:
- **Panel A**: Semantic isolation by disease category (boxplot)
- **Panel B**: Top 20 most semantically isolated diseases (bar chart, NTD/Global South colour-coded)
- **Panel C**: Semantic isolation vs research volume (scatter with regression)
- **Panel D**: NTD vs non-NTD significance test (Welch's t-test, Cohen's d)
6. Produces a markdown report with all metrics, rankings, and reproducibility bundle
## Why this exists
If you ask ChatGPT to "measure research neglect for diseases," it will:
- Not know which embedding model to use for biomedical text
- Hallucinate metrics that sound plausible but have no methodological grounding
- Skip quality filtering (year coverage, abstract coverage, minimum papers)
- Not handle MPS acceleration or checkpointed batch processing
- Produce a single scatter plot with no disease classification
This skill encodes the correct methodological decisions:
- Uses PubMedBERT (the gold-standard biomedical language model)
- Fetches from PubMed with exponential backoff and NCBI rate limiting
- Quality filters: year coverage >= 70%, abstract coverage >= 95%, minimum 50 papers
- Batch embedding with Apple MPS acceleration and CPU fallback
- Checkpointed processing (resume after interruption)
- HDF5 storage with gzip compression and SHA-256 checksums
- Classification against WHO NTD list and Global South priority diseases
- Statistical significance testing (Welch's t-test, Cohen's d)
## Key Finding
Neglected tropical diseases (NTDs) are significantly more semantically isolated than other conditions (P < 0.001, Cohen's d = 0.8+). They exist in knowledge silos with limited cross-disciplinary research bridges. The 25 most isolated diseases are disproportionately Global South priority conditions.
## Pipeline
```
05-00-heim-sem-setup.py # Validate environment, create directories
05-01-heim-sem-fetch.py # Retrieve PubMed abstracts (checkpointed)
05-02-heim-sem-embed.py # Generate PubMedBERT embeddings (MPS/CPU)
05-03-heim-sem-compute.py # Compute SII, KTP, RCC, temporal drift
05-04-heim-sem-figures.py # Generate publication figures
05-05-heim-sem-integrate.py # Merge with biobank + clinical trial dimensions
```
### Demo (works out of the box)
```bash
python semantic_sim.py --demo --output demo_report
```
The demo uses pre-computed embeddings and metrics for 175 GBD diseases and generates the full 4-panel figure instantly.
## Example Output
```
Semantic Similarity Index
=========================
Diseases analysed: 175
Total PubMed abstracts: 13,100,000
Embedding model: PubMedBERT (768-dim)
Metric Ranges:
SII: 0.0412 - 0.1893
KTP: 0.6234 - 0.9187
RCC: 0.0891 - 0.3421
Key Finding:
NTDs show +38% higher semantic isolation
P < 0.0001, Cohen's d = 0.84
14/25 most isolated diseases are Global South priority
Figures saved to: demo_report/
Fig5_Semantic_Structure.png (300 dpi)
Fig5_Semantic_Structure.pdf (vector)
Reproducibility:
commands.sh | environment.yml | checksums.sha256
```
## Interpretation Guide
- **High SII**: Disease research exists in a knowledge silo; limited cross-disciplinary bridges
- **Low KTP**: Research on this disease has few methodological overlaps with others
- **High RCC**: Diverse research approaches within the disease (many subtopics)
- **High Temporal Drift**: Research focus has shifted significantly over time
- NTDs shown in **red**, Global South diseases in **orange**, others in **grey**
- The scatter plot (Panel C) reveals the inverse relationship between research volume and isolation
## Citation
If you use this skill in a publication, please cite:
- Corpas, M. et al. (2026). HEIM: Health Equity Index for Measuring structural bias in biomedical research. Under review.
- Corpas, M. (2026). ClawBio. https://github.com/ClawBio/ClawBio
More from aAAaqwq/AGI-Super-Team
- a-fund-monitor监控 A 股基金实时估值与盘后净值,自动判断交易日并生成提醒或分析。
- account-executive>
- add-leadAdd company/person/relationship to CRM
- adsComprehensive ad account analysis across all major platforms (Google, Meta
- ads-agentAI-агент для управления Facebook рекламой. Вызывай для анализа, оптимизации, создания кампаний и отчётов.
- afrexai-compliance-auditRun internal compliance audits against major governance and security
- afrexai-personal-financeComplete personal finance system — budgeting, debt payoff, investing, tax optimization, net worth tracking, and financial independence planning. Use when managing money, building wealth, paying off debt, planning retirement, or optimizing taxes. Zero dependencies.
- after-salesUse when managing post-purchase experience, building customer loyalty, or increasing repeat purchases
- agent-contactsAI agent contacts — add, list, remove MCP contacts. Use when someone gives an agent URL, or when you need to view/remove contacts.
- agent-model-switcher批量查看和切换子 agent 的模型配置,用于统一调整多 agent 的 provider/model 设置。