bio-entrez-link

Name: bio-entrez-link
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-entrez-link
Connect NCBI databases to discover related biological records.
Navigates gene-to-protein and sequence-to-publication relationships.
Depends on Biopython Bio.Entrez and Entrez Direct APIs.
Executes cross-database queries using the ELink utility function.
Returns structured record links for further biological exploration.
SKILL.md
.github/skills/bio-entrez-linkView on GitHub ↗
---
name: bio-entrez-link
description: Find cross-references between NCBI databases using Biopython Bio.Entrez. Use when navigating from genes to proteins, sequences to publications, finding related records, or discovering database relationships.
tool_type: python
primary_tool: Bio.Entrez
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Entrez Link

Navigate between NCBI databases using Biopython's Entrez module (ELink utility).

**"Find related records across NCBI databases"** → Use ELink to discover cross-references (e.g., gene to protein, sequence to publication).
- Python: `Entrez.elink(dbfrom=..., db=..., id=...)` (BioPython)
- CLI: `elink -db protein -target gene` (Entrez Direct)

## Required Setup

```python
from Bio import Entrez

Entrez.email = 'your.email@example.com'  # Required by NCBI
Entrez.api_key = 'your_api_key'          # Optional, raises rate limit
```

## Core Function

### Entrez.elink() - Cross-Database Links

Find related records in the same or different databases.

```python
# Find proteins linked to a gene
handle = Entrez.elink(dbfrom='gene', db='protein', id='672')
record = Entrez.read(handle)
handle.close()

# Extract linked IDs
linkset = record[0]
if linkset['LinkSetDb']:
    links = linkset['LinkSetDb'][0]['Link']
    protein_ids = [link['Id'] for link in links]
    print(f"Found {len(protein_ids)} linked proteins")
```

**Key Parameters:**
| Parameter | Description | Example |
|-----------|-------------|---------|
| `dbfrom` | Source database | `'gene'` |
| `db` | Target database | `'protein'` |
| `id` | Source record ID(s) | `'672'` or `'672,675'` |
| `linkname` | Specific link type | `'gene_protein_refseq'` |
| `cmd` | Link command | `'neighbor'`, `'neighbor_score'` |

### ELink Result Structure

```python
record[0]                          # First linkset
record[0]['DbFrom']                # Source database
record[0]['IdList']                # Input IDs
record[0]['LinkSetDb']             # List of link results
record[0]['LinkSetDb'][0]['DbTo']  # Target database
record[0]['LinkSetDb'][0]['LinkName']  # Link name
record[0]['LinkSetDb'][0]['Link']  # List of linked records
record[0]['LinkSetDb'][0]['Link'][0]['Id']  # Linked ID
```

## Common Link Paths

### Gene to Other Databases

| From | To | Link Name | Description |
|------|-----|-----------|-------------|
| gene | protein | `gene_protein` | All proteins |
| gene | protein | `gene_protein_refseq` | RefSeq proteins only |
| gene | nucleotide | `gene_nuccore` | Nucleotide sequences |
| gene | nucleotide | `gene_nuccore_refseqrna` | RefSeq mRNA |
| gene | pubmed | `gene_pubmed` | Related publications |
| gene | homologene | `gene_homologene` | Homologs |
| gene | snp | `gene_snp` | SNPs in gene |
| gene | clinvar | `gene_clinvar` | Clinical variants |

### Nucleotide to Other Databases

| From | To | Link Name | Description |
|------|-----|-----------|-------------|
| nucleotide | protein | `nuccore_protein` | Encoded proteins |
| nucleotide | gene | `nuccore_gene` | Gene records |
| nucleotide | pubmed | `nuccore_pubmed` | Publications |
| nucleotide | taxonomy | `nuccore_taxonomy` | Organism taxonomy |
| nucleotide | biosample | `nuccore_biosample` | Sample info |
| nucleotide | sra | `nuccore_sra` | Related SRA data |

### Protein to Other Databases

| From | To | Link Name | Description |
|------|-----|-----------|-------------|
| protein | nucleotide | `protein_nuccore` | Coding sequences |
| protein | gene | `protein_gene` | Gene records |
| protein | pubmed | `protein_pubmed` | Publications |
| protein | structure | `protein_structure` | 3D structures |
| protein | cdd | `protein_cdd` | Conserved domains |

### PubMed Links

| From | To | Link Name | Description |
|------|-----|-----------|-------------|
| pubmed | pubmed | `pubmed_pubmed` | Related articles |
| pubmed | gene | `pubmed_gene` | Mentioned genes |
| pubmed | protein | `pubmed_protein` | Mentioned proteins |
| pubmed | nucleotide | `pubmed_nuccore` | Mentioned sequences |

## Code Patterns

### Gene to Protein

```python
from Bio import Entrez

Entrez.email = 'your.email@example.com'

def get_proteins_for_gene(gene_id):
    handle = Entrez.elink(dbfrom='gene', db='protein', id=gene_id, linkname='gene_protein_refseq')
    record = Entrez.read(handle)
    handle.close()

    if not record[0]['LinkSetDb']:
        return []
    return [link['Id'] for link in record[0]['LinkSetDb'][0]['Link']]

protein_ids = get_proteins_for_gene('672')  # BRCA1
print(f"RefSeq proteins: {protein_ids[:5]}")
```

### Nucleotide to Gene

```python
def get_gene_for_nucleotide(nuc_id):
    handle = Entrez.elink(dbfrom='nucleotide', db='gene', id=nuc_id)
    record = Entrez.read(handle)
    handle.close()

    if not record[0]['LinkSetDb']:
        return None
    return record[0]['LinkSetDb'][0]['Link'][0]['Id']

gene_id = get_gene_for_nucleotide('NM_007294')
print(f"Gene ID: {gene_id}")
```

### Find Related PubMed Articles

```python
def get_related_articles(pmid, max_results=10):
    handle = Entrez.elink(dbfrom='pubmed', db='pubmed', id=pmid, linkname='pubmed_pubmed')
    record = Entrez.read(handle)
    handle.close()

    if not record[0]['LinkSetDb']:
        return []
    links = record[0]['LinkSetDb'][0]['Link']
    return [link['Id'] for link in links[:max_results]]

related = get_related_articles('35412348')
print(f"Related articles: {related}")
```

### Get All Available Links

```python
def discover_links(db, record_id):
    handle = Entrez.elink(dbfrom=db, id=record_id, cmd='acheck')
    record = Entrez.read(handle)
    handle.close()

    links = {}
    for linkset in record[0].get('LinkSetDb', []):
        links[linkset['LinkName']] = linkset['DbTo']
    return links

available = discover_links('gene', '672')
for name, target in available.items():
    print(f"{name} -> {target}")
```

### Navigate Gene -> Protein -> Structure

**Goal:** Traverse multiple NCBI databases to find 3D structures associated with a gene of interest.

**Approach:** Chain two ELink calls: first link from gene to RefSeq proteins, then link those protein IDs to the structure database.

```python
def gene_to_structures(gene_id):
    # Gene to protein
    handle = Entrez.elink(dbfrom='gene', db='protein', id=gene_id, linkname='gene_protein_refseq')
    record = Entrez.read(handle)
    handle.close()

    if not record[0]['LinkSetDb']:
        return []
    protein_ids = [link['Id'] for link in record[0]['LinkSetDb'][0]['Link'][:5]]

    # Protein to structure
    handle = Entrez.elink(dbfrom='protein', db='structure', id=','.join(protein_ids))
    record = Entrez.read(handle)
    handle.close()

    structure_ids = []
    for linkset in record:
        if linkset['LinkSetDb']:
            structure_ids.extend([link['Id'] for link in linkset['LinkSetDb'][0]['Link']])
    return structure_ids

structures = gene_to_structures('672')
print(f"Structure IDs: {structures[:5]}")
```

### Link Multiple IDs at Once

**Goal:** Find cross-database links for a batch of source IDs in a single API call.

**Approach:** Pass comma-joined IDs to ELink; the result contains one linkset per input ID, so iterate through the record list to map each source ID to its linked targets.

```python
def batch_link(dbfrom, db, ids):
    if isinstance(ids, list):
        ids = ','.join(ids)

    handle = Entrez.elink(dbfrom=dbfrom, db=db, id=ids)
    record = Entrez.read(handle)
    handle.close()

    # Returns one linkset per input ID
    results = {}
    for linkset in record:
        source_id = linkset['IdList'][0]
        linked_ids = []
        if linkset['LinkSetDb']:
            linked_ids = [link['Id'] for link in linkset['LinkSetDb'][0]['Link']]
        results[source_id] = linked_ids
    return results

results = batch_link('gene', 'protein', ['672', '675', '7157'])
for gene, proteins in results.items():
    print(f"Gene {gene}: {len(proteins)} proteins")
```

### Get Publications for a Sequence

```python
def get_sequence_publications(accession):
    # First get the GI/UID
    handle = Entrez.esearch(db='nucleotide', term=f'{accession}[accn]')
    search = Entrez.read(handle)
    handle.close()

    if not search['IdList']:
        return []
    uid = search['IdList'][0]

    # Link to PubMed
    handle = Entrez.elink(dbfrom='nucleotide', db='pubmed', id=uid)
    record = Entrez.read(handle)
    handle.close()

    if not record[0]['LinkSetDb']:
        return []
    return [link['Id'] for link in record[0]['LinkSetDb'][0]['Link']]

pmids = get_sequence_publications('NM_007294')
print(f"PubMed IDs: {pmids[:5]}")
```

## Link Commands

| Command | Description |
|---------|-------------|
| `neighbor` | Default - get linked records |
| `neighbor_score` | Include relevance scores |
| `neighbor_history` | Store results in history |
| `acheck` | List all available links |
| `ncheck` | Check if any links exist |
| `lcheck` | Check specific link exists |
| `llinks` | Get URLs to Entrez links |
| `prlinks` | Get provider links (external) |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| Empty `LinkSetDb` | No links exist | Check if record has linked data |
| `HTTPError 400` | Invalid ID or database | Verify ID exists in source database |
| `KeyError` | Missing expected field | Check if `LinkSetDb` is empty first |
| Single linkset expected, got list | Multiple input IDs | Iterate through record list |

## Decision Tree

```
Need to find related records?
├── Know what link you want?
│   └── Use elink with specific linkname
├── Discover what links exist?
│   └── Use elink with cmd='acheck'
├── Navigate to target database?
│   └── Use elink(dbfrom=X, db=Y, id=Z)
├── Find related records in same database?
│   └── Use elink(dbfrom=X, db=X) with neighbor
├── Chain multiple databases?
│   └── Call elink multiple times
└── Need the actual records?
    └── Use elink first, then efetch with IDs
```

## Related Skills

- entrez-search - Search databases before linking
- entrez-fetch - Retrieve records after finding linked IDs
- batch-downloads - Download many linked records efficiently