citation-network

Name: citation-network
Author: aipoch/medical-research-skills

$npx mdskill add aipoch/medical-research-skills/citation-network

Convert citation pairs into visual networks for rapid literature analysis.

Identify influential papers and research communities from citation data.
Depends on CSV input with source and target citation relationships.
Uses de-duplication by DOI or title to build accurate directed graphs.
Delivers interactive HTML views and Gephi-compatible graph files.

SKILL.md

.github/skills/citation-networkView on GitHub ↗

---
name: citation-network
description: Build and visualize a citation network from a source/target CSV to identify key papers, communities, and emerging hotspots; use when you have citation pairs and need fast literature review or trend analysis.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- You have a citation relationship table (who cites whom) and want to quickly turn it into a directed network for analysis.
- You are conducting a literature review and need to identify influential papers (high in-degree / centrality) and core clusters.
- You want to detect community structures (research subfields) and compare them across time or datasets.
- You need an interactive, shareable visualization (HTML) or a Gephi-importable graph file (GEXF).
- You are positioning a new project and want evidence of research hotspots and bridging papers between communities.

## Key Features

- Builds a directed citation graph from a minimal CSV containing `source` and `target`.
- De-duplicates nodes by identifier (DOI recommended; otherwise unique titles).
- Exports:
  - `citation_network.gexf` for Gephi and other graph tools
  - `network_metrics.json` for basic network statistics
  - `citation_network.html` for interactive browser viewing (auto-generated by the build script)
- Run-directory workflow to keep each execution reproducible and isolated under `outputs/runs/<timestamp>/`.
- Optional input encoding control to avoid garbled characters (e.g., UTF-8 / UTF-8-SIG).

## Dependencies

- Python 3.10+
- pandas >= 2.0
- networkx >= 3.0
- (Optional, for HTML visualization) pyvis >= 0.3

## Example Usage

### 1) Initialize a run directory

```bash
python scripts/init_run.py
```

This creates a new run folder:

```text
outputs/runs/<timestamp>/
  config.json
  data/
  outputs/
```

### 2) Prepare the citation CSV (minimal)

Create `citations.csv` and place it into:

```text
outputs/runs/<timestamp>/data/citations.csv
```

Minimal CSV format:

```csv
source,target
Paper A,Paper B
Paper A,Paper C
```

Recommended DOI-based identifiers:

```csv
source,target
10.1234/abcd.1,10.1234/abcd.2
10.1234/abcd.1,10.1234/abcd.3
```

### 3) Confirm configuration

Open:

```text
outputs/runs/<timestamp>/config.json
```

Ensure the configured input filename and column names match your CSV (at minimum `source` and `target`). If you see garbled characters, set an explicit encoding (e.g., `utf-8` or `utf-8-sig`) via an `input_encoding` field if supported by the config.

### 4) Build the citation network

```bash
python scripts/build_citation_network.py
```

The build script will also generate the HTML automatically (you do not need to run `scripts/export_gexf_html.py` manually).

### 5) Inspect outputs

Expected outputs under the same run directory:

- `citation_network.gexf` (import into Gephi)
- `network_metrics.json` (node/edge counts, density, etc.)
- `citation_network.html` (open in a browser)

## Implementation Details

### Data Model

- **Nodes**: papers, identified by the value in `source`/`target` (DOI preferred; otherwise a unique, consistent title string).
- **Edges**: directed citations `source -> target`.

### Input Requirements and Constraints

- The network builder reads **only** the `source` and `target` columns.
- Additional columns (e.g., author/year/venue) are ignored by the current scripts.
- If you need metadata, maintain a separate table for downstream joining/annotation (not consumed by the builder), for example:

```csv
id,title,authors,year,doi
10.1234/abcd.1,Paper A,"Zhang, Wei; Li, Ming",2021,10.1234/abcd.1
10.1234/abcd.2,Paper B,"Wang, Fang",2019,10.1234/abcd.2
```

### Run Directory Standard

- Always run `python scripts/init_run.py` before an execution to create a new run directory.
- All inputs, configs, and outputs must remain inside `outputs/runs/<timestamp>/`.
- By default, scripts operate on the latest run directory under `outputs/runs/`.

### Metrics and Analysis (Conceptual)

- Basic network statistics are exported to `network_metrics.json` (e.g., node/edge counts, density).
- Typical downstream analyses include:
  - centrality (degree, betweenness)
  - community detection (e.g., Louvain), if enabled/implemented in the pipeline

### Common Failure Modes

- **Garbled characters**: ensure CSV is UTF-8/UTF-8-SIG; set `input_encoding` in `config.json` if available.
- **Duplicate nodes**: identical identifiers are treated as the same node; prefer DOIs or enforce unique titles.
- **Empty or missing output**: verify the CSV header names match the configured `source`/`target` columns.

### Related References

- Data cleaning checklist: `references/data-cleaning-checklist.md`
- Network metrics notes: `references/network-metrics-notes.md`
- Additional documentation: `references/README.md`

More from aipoch/medical-research-skills