bio-uniprot-access
$
npx mdskill add GPTomics/bioSkills/bio-uniprot-accessFetch UniProt protein sequences and functional annotations instantly.
- Retrieves protein sequences, GO terms, domains, and interaction data.
- Integrates with UniProt REST API and BioPython libraries.
- Adapts code dynamically when package versions differ from requirements.
- Returns raw sequence text or structured JSON responses directly.
SKILL.md
.github/skills/bio-uniprot-accessView on GitHub ↗
---
name: bio-uniprot-access
description: Access UniProt protein database for sequences, annotations, and functional information. Use when retrieving protein data, GO terms, domain annotations, or protein-protein interactions.
tool_type: python
primary_tool: requests
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# UniProt Access
Query UniProt for protein sequences, functional annotations, and cross-references.
**"Get protein information from UniProt"** → Fetch sequences, GO terms, domains, and cross-references for a protein accession.
- Python: `requests.get(f'https://rest.uniprot.org/uniprotkb/{acc}.json')` (UniProt REST API)
- Python: `ExPASy.get_sprot_raw(acc)` + `SwissProt.read()` (BioPython)
## UniProt REST API
### Fetch Single Entry
```python
import requests
def fetch_uniprot(accession, format='fasta'):
'''Fetch UniProt entry. Formats: fasta, json, txt, xml, gff'''
url = f'https://rest.uniprot.org/uniprotkb/{accession}.{format}'
response = requests.get(url)
response.raise_for_status()
return response.text
sequence = fetch_uniprot('P53_HUMAN', 'fasta')
entry_json = fetch_uniprot('P04637', 'json')
```
### Search UniProt
**Goal:** Find UniProt protein entries matching gene name, organism, or functional criteria.
**Approach:** Query the UniProt REST search endpoint with structured query syntax and parse the JSON results for accessions and protein descriptions.
```python
def search_uniprot(query, format='json', size=25):
'''Search UniProt with query syntax'''
url = 'https://rest.uniprot.org/uniprotkb/search'
params = {'query': query, 'format': format, 'size': size}
response = requests.get(url, params=params)
response.raise_for_status()
return response.json() if format == 'json' else response.text
results = search_uniprot('gene:BRCA1 AND organism_id:9606')
for entry in results['results']:
print(entry['primaryAccession'], entry['proteinDescription']['recommendedName']['fullName']['value'])
```
### Query Syntax
| Query | Description |
|-------|-------------|
| `gene:TP53` | Gene name |
| `organism_id:9606` | Human (NCBI taxonomy) |
| `reviewed:true` | Swiss-Prot only |
| `length:[100 TO 500]` | Sequence length range |
| `go:0006915` | GO term (apoptosis) |
| `keyword:kinase` | Keyword |
| `ec:2.7.1.1` | Enzyme classification |
| `database:pdb` | Has PDB structure |
### Combine Queries
```python
# Human kinases with structures
query = 'organism_id:9606 AND keyword:kinase AND database:pdb AND reviewed:true'
results = search_uniprot(query, size=100)
```
## Batch Retrieval
### Multiple Accessions
```python
def batch_fetch(accessions, format='fasta'):
'''Fetch multiple entries'''
url = 'https://rest.uniprot.org/uniprotkb/accessions'
params = {'accessions': ','.join(accessions), 'format': format}
response = requests.get(url, params=params)
return response.text
accessions = ['P04637', 'P53_HUMAN', 'Q9Y6K9']
sequences = batch_fetch(accessions)
```
### Stream Large Results
```python
def search_all(query, format='tsv', fields=None):
'''Stream all results for large queries'''
url = 'https://rest.uniprot.org/uniprotkb/stream'
params = {'query': query, 'format': format}
if fields:
params['fields'] = ','.join(fields)
response = requests.get(url, params=params, stream=True)
return response.text
# Get all human proteins as TSV
all_human = search_all('organism_id:9606 AND reviewed:true',
fields=['accession', 'gene_names', 'protein_name'])
```
## ID Mapping
### Map Between Databases
**Goal:** Convert identifiers between databases (e.g., Ensembl gene IDs to UniProt accessions) in batch.
**Approach:** Submit an asynchronous ID mapping job to the UniProt API, poll for completion, then retrieve the mapped results.
```python
import time
def map_ids(ids, from_db, to_db):
'''Map IDs between databases'''
url = 'https://rest.uniprot.org/idmapping/run'
response = requests.post(url, data={'ids': ','.join(ids), 'from': from_db, 'to': to_db})
job_id = response.json()['jobId']
# Poll for results
while True:
status = requests.get(f'https://rest.uniprot.org/idmapping/status/{job_id}')
if 'results' in status.json() or 'failedIds' in status.json():
break
time.sleep(1)
results = requests.get(f'https://rest.uniprot.org/idmapping/results/{job_id}')
return results.json()
# Ensembl gene IDs to UniProt
mapping = map_ids(['ENSG00000141510', 'ENSG00000171862'], 'Ensembl', 'UniProtKB')
for result in mapping['results']:
print(result['from'], '->', result['to']['primaryAccession'])
```
### Common Database Codes
| Code | Database |
|------|----------|
| `UniProtKB` | UniProt accessions |
| `UniProtKB_AC-ID` | UniProt AC or ID |
| `Ensembl` | Ensembl gene ID |
| `RefSeq_Protein` | RefSeq protein |
| `PDB` | PDB ID |
| `GeneID` | NCBI Gene ID |
| `Gene_Name` | Gene symbols |
## Extract Specific Data
### Parse JSON Entry
**Goal:** Extract structured annotations (GO terms, domains, PDB structures) from a UniProt JSON entry.
**Approach:** Fetch the entry in JSON format and navigate the nested structure to pull accession, gene name, sequence, and cross-reference lists filtered by database type.
```python
import json
entry = json.loads(fetch_uniprot('P04637', 'json'))
accession = entry['primaryAccession']
gene_name = entry['genes'][0]['geneName']['value']
protein_name = entry['proteinDescription']['recommendedName']['fullName']['value']
sequence = entry['sequence']['value']
length = entry['sequence']['length']
# GO terms
go_terms = [ref for ref in entry.get('uniProtKBCrossReferences', [])
if ref['database'] == 'GO']
# Domains (InterPro)
domains = [ref for ref in entry.get('uniProtKBCrossReferences', [])
if ref['database'] == 'InterPro']
# PDB structures
pdb_refs = [ref for ref in entry.get('uniProtKBCrossReferences', [])
if ref['database'] == 'PDB']
```
### Get Specific Fields (TSV)
```python
def get_fields(query, fields):
'''Get specific fields as DataFrame'''
import pandas as pd
from io import StringIO
url = 'https://rest.uniprot.org/uniprotkb/search'
params = {'query': query, 'format': 'tsv', 'fields': ','.join(fields), 'size': 500}
response = requests.get(url, params=params)
return pd.read_csv(StringIO(response.text), sep='\t')
df = get_fields('organism_id:9606 AND keyword:kinase AND reviewed:true',
['accession', 'gene_names', 'protein_name', 'length', 'go_p'])
```
### Available Fields
| Field | Description |
|-------|-------------|
| `accession` | UniProt accession |
| `gene_names` | Gene names |
| `protein_name` | Protein name |
| `organism_name` | Species |
| `length` | Sequence length |
| `mass` | Molecular mass |
| `go_p` | GO biological process |
| `go_c` | GO cellular component |
| `go_f` | GO molecular function |
| `xref_pdb` | PDB cross-references |
| `ft_domain` | Domain features |
| `ft_binding` | Binding sites |
## Biopython Integration
```python
from Bio import SeqIO
from io import StringIO
fasta_text = fetch_uniprot('P04637', 'fasta')
record = SeqIO.read(StringIO(fasta_text), 'fasta')
print(record.id, len(record.seq))
```
## Related Skills
- database-access/entrez-fetch - NCBI protein access
- database-access/blast-searches - BLAST against UniProt
- structural-biology/structure-io - Download PDB structures
- structural-biology/alphafold-predictions - AlphaFold structures
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.