bio-uniprot-access

$npx mdskill add GPTomics/bioSkills/bio-uniprot-access

Fetch UniProt protein sequences and functional annotations instantly.

  • Retrieves protein sequences, GO terms, domains, and interaction data.
  • Integrates with UniProt REST API and BioPython libraries.
  • Adapts code dynamically when package versions differ from requirements.
  • Returns raw sequence text or structured JSON responses directly.
SKILL.md
.github/skills/bio-uniprot-accessView on GitHub ↗
---
name: bio-uniprot-access
description: Access UniProt protein database for sequences, annotations, and functional information. Use when retrieving protein data, GO terms, domain annotations, or protein-protein interactions.
tool_type: python
primary_tool: requests
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# UniProt Access

Query UniProt for protein sequences, functional annotations, and cross-references.

**"Get protein information from UniProt"** → Fetch sequences, GO terms, domains, and cross-references for a protein accession.
- Python: `requests.get(f'https://rest.uniprot.org/uniprotkb/{acc}.json')` (UniProt REST API)
- Python: `ExPASy.get_sprot_raw(acc)` + `SwissProt.read()` (BioPython)

## UniProt REST API

### Fetch Single Entry

```python
import requests

def fetch_uniprot(accession, format='fasta'):
    '''Fetch UniProt entry. Formats: fasta, json, txt, xml, gff'''
    url = f'https://rest.uniprot.org/uniprotkb/{accession}.{format}'
    response = requests.get(url)
    response.raise_for_status()
    return response.text

sequence = fetch_uniprot('P53_HUMAN', 'fasta')
entry_json = fetch_uniprot('P04637', 'json')
```

### Search UniProt

**Goal:** Find UniProt protein entries matching gene name, organism, or functional criteria.

**Approach:** Query the UniProt REST search endpoint with structured query syntax and parse the JSON results for accessions and protein descriptions.

```python
def search_uniprot(query, format='json', size=25):
    '''Search UniProt with query syntax'''
    url = 'https://rest.uniprot.org/uniprotkb/search'
    params = {'query': query, 'format': format, 'size': size}
    response = requests.get(url, params=params)
    response.raise_for_status()
    return response.json() if format == 'json' else response.text

results = search_uniprot('gene:BRCA1 AND organism_id:9606')
for entry in results['results']:
    print(entry['primaryAccession'], entry['proteinDescription']['recommendedName']['fullName']['value'])
```

### Query Syntax

| Query | Description |
|-------|-------------|
| `gene:TP53` | Gene name |
| `organism_id:9606` | Human (NCBI taxonomy) |
| `reviewed:true` | Swiss-Prot only |
| `length:[100 TO 500]` | Sequence length range |
| `go:0006915` | GO term (apoptosis) |
| `keyword:kinase` | Keyword |
| `ec:2.7.1.1` | Enzyme classification |
| `database:pdb` | Has PDB structure |

### Combine Queries

```python
# Human kinases with structures
query = 'organism_id:9606 AND keyword:kinase AND database:pdb AND reviewed:true'
results = search_uniprot(query, size=100)
```

## Batch Retrieval

### Multiple Accessions

```python
def batch_fetch(accessions, format='fasta'):
    '''Fetch multiple entries'''
    url = 'https://rest.uniprot.org/uniprotkb/accessions'
    params = {'accessions': ','.join(accessions), 'format': format}
    response = requests.get(url, params=params)
    return response.text

accessions = ['P04637', 'P53_HUMAN', 'Q9Y6K9']
sequences = batch_fetch(accessions)
```

### Stream Large Results

```python
def search_all(query, format='tsv', fields=None):
    '''Stream all results for large queries'''
    url = 'https://rest.uniprot.org/uniprotkb/stream'
    params = {'query': query, 'format': format}
    if fields:
        params['fields'] = ','.join(fields)
    response = requests.get(url, params=params, stream=True)
    return response.text

# Get all human proteins as TSV
all_human = search_all('organism_id:9606 AND reviewed:true',
                       fields=['accession', 'gene_names', 'protein_name'])
```

## ID Mapping

### Map Between Databases

**Goal:** Convert identifiers between databases (e.g., Ensembl gene IDs to UniProt accessions) in batch.

**Approach:** Submit an asynchronous ID mapping job to the UniProt API, poll for completion, then retrieve the mapped results.

```python
import time

def map_ids(ids, from_db, to_db):
    '''Map IDs between databases'''
    url = 'https://rest.uniprot.org/idmapping/run'
    response = requests.post(url, data={'ids': ','.join(ids), 'from': from_db, 'to': to_db})
    job_id = response.json()['jobId']

    # Poll for results
    while True:
        status = requests.get(f'https://rest.uniprot.org/idmapping/status/{job_id}')
        if 'results' in status.json() or 'failedIds' in status.json():
            break
        time.sleep(1)

    results = requests.get(f'https://rest.uniprot.org/idmapping/results/{job_id}')
    return results.json()

# Ensembl gene IDs to UniProt
mapping = map_ids(['ENSG00000141510', 'ENSG00000171862'], 'Ensembl', 'UniProtKB')
for result in mapping['results']:
    print(result['from'], '->', result['to']['primaryAccession'])
```

### Common Database Codes

| Code | Database |
|------|----------|
| `UniProtKB` | UniProt accessions |
| `UniProtKB_AC-ID` | UniProt AC or ID |
| `Ensembl` | Ensembl gene ID |
| `RefSeq_Protein` | RefSeq protein |
| `PDB` | PDB ID |
| `GeneID` | NCBI Gene ID |
| `Gene_Name` | Gene symbols |

## Extract Specific Data

### Parse JSON Entry

**Goal:** Extract structured annotations (GO terms, domains, PDB structures) from a UniProt JSON entry.

**Approach:** Fetch the entry in JSON format and navigate the nested structure to pull accession, gene name, sequence, and cross-reference lists filtered by database type.

```python
import json

entry = json.loads(fetch_uniprot('P04637', 'json'))

accession = entry['primaryAccession']
gene_name = entry['genes'][0]['geneName']['value']
protein_name = entry['proteinDescription']['recommendedName']['fullName']['value']
sequence = entry['sequence']['value']
length = entry['sequence']['length']

# GO terms
go_terms = [ref for ref in entry.get('uniProtKBCrossReferences', [])
            if ref['database'] == 'GO']

# Domains (InterPro)
domains = [ref for ref in entry.get('uniProtKBCrossReferences', [])
           if ref['database'] == 'InterPro']

# PDB structures
pdb_refs = [ref for ref in entry.get('uniProtKBCrossReferences', [])
            if ref['database'] == 'PDB']
```

### Get Specific Fields (TSV)

```python
def get_fields(query, fields):
    '''Get specific fields as DataFrame'''
    import pandas as pd
    from io import StringIO

    url = 'https://rest.uniprot.org/uniprotkb/search'
    params = {'query': query, 'format': 'tsv', 'fields': ','.join(fields), 'size': 500}
    response = requests.get(url, params=params)
    return pd.read_csv(StringIO(response.text), sep='\t')

df = get_fields('organism_id:9606 AND keyword:kinase AND reviewed:true',
                ['accession', 'gene_names', 'protein_name', 'length', 'go_p'])
```

### Available Fields

| Field | Description |
|-------|-------------|
| `accession` | UniProt accession |
| `gene_names` | Gene names |
| `protein_name` | Protein name |
| `organism_name` | Species |
| `length` | Sequence length |
| `mass` | Molecular mass |
| `go_p` | GO biological process |
| `go_c` | GO cellular component |
| `go_f` | GO molecular function |
| `xref_pdb` | PDB cross-references |
| `ft_domain` | Domain features |
| `ft_binding` | Binding sites |

## Biopython Integration

```python
from Bio import SeqIO
from io import StringIO

fasta_text = fetch_uniprot('P04637', 'fasta')
record = SeqIO.read(StringIO(fasta_text), 'fasta')
print(record.id, len(record.seq))
```

## Related Skills

- database-access/entrez-fetch - NCBI protein access
- database-access/blast-searches - BLAST against UniProt
- structural-biology/structure-io - Download PDB structures
- structural-biology/alphafold-predictions - AlphaFold structures
More from GPTomics/bioSkills