bio-blast-searches
$
npx mdskill add GPTomics/bioSkills/bio-blast-searchesRuns remote BLAST searches against NCBI databases to find sequence similarity
- Identifies unknown sequences by comparing them to NCBI's nr/nt databases
- Uses Biopython's Bio.Blast module and NCBIWWW for remote query execution
- Selects appropriate BLAST programs (blastn, blastp, etc.) based on input sequence type
- Returns parsed XML results with E-values, identity scores, and sequence matches
SKILL.md
.github/skills/bio-blast-searchesView on GitHub ↗
---
name: bio-blast-searches
description: Run remote BLAST searches against NCBI databases using Biopython Bio.Blast. Use when identifying unknown sequences, finding homologs, or searching for sequence similarity against NCBI's nr/nt databases.
tool_type: python
primary_tool: Bio.Blast.NCBIWWW
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# BLAST Searches
**"BLAST this sequence against NCBI"** → Submit a query sequence to NCBI BLAST servers, retrieve XML results, and parse hits with E-values and identity scores.
- Python: `NCBIWWW.qblast()` + `NCBIXML.read()` (BioPython)
- CLI: `blastn -remote -db nt -query seq.fa` (BLAST+)
- R: `blast()` (rBLAST, local only)
Run BLAST searches against NCBI databases using Biopython's Bio.Blast module.
## Required Import
```python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
```
## BLAST Programs
| Program | Query | Database | Use Case |
|---------|-------|----------|----------|
| `blastn` | Nucleotide | Nucleotide | DNA/RNA sequence similarity |
| `blastp` | Protein | Protein | Protein sequence similarity |
| `blastx` | Nucleotide | Protein | Find protein hits for DNA query |
| `tblastn` | Protein | Nucleotide | Find DNA encoding protein-like |
| `tblastx` | Nucleotide | Nucleotide | Translated vs translated |
## Core Function
### NCBIWWW.qblast()
Submit a BLAST query to NCBI servers.
```python
from Bio.Blast import NCBIWWW
# Simple BLASTN search
result_handle = NCBIWWW.qblast('blastn', 'nt', sequence)
```
**Key Parameters:**
| Parameter | Description | Example |
|-----------|-------------|---------|
| `program` | BLAST program | `'blastn'`, `'blastp'` |
| `database` | Target database | `'nr'`, `'nt'`, `'refseq_rna'` |
| `sequence` | Query sequence | String or SeqRecord |
| `entrez_query` | Limit by Entrez query | `'Homo sapiens[organism]'` |
| `hitlist_size` | Max hits to return | `50` |
| `expect` | E-value threshold | `0.001` |
| `word_size` | Word size | `11` for blastn |
| `gapcosts` | Gap penalties | `'5 2'` (open, extend) |
| `format_type` | Output format | `'XML'` (default), `'Text'` |
## Common Databases
**Nucleotide:**
| Database | Description |
|----------|-------------|
| `nt` | All GenBank + EMBL + DDBJ |
| `refseq_rna` | RefSeq RNA sequences |
| `refseq_genomic` | RefSeq genomic sequences |
**Protein:**
| Database | Description |
|----------|-------------|
| `nr` | Non-redundant protein |
| `refseq_protein` | RefSeq proteins |
| `swissprot` | SwissProt (curated) |
| `pdb` | Protein structures |
## Parsing Results
### NCBIXML Parser
```python
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST
result_handle = NCBIWWW.qblast('blastn', 'nt', sequence)
# Parse XML results
blast_record = NCBIXML.read(result_handle)
result_handle.close()
# Iterate hits
for alignment in blast_record.alignments:
print(f"Hit: {alignment.title}")
for hsp in alignment.hsps:
print(f" E-value: {hsp.expect}")
print(f" Score: {hsp.score}")
print(f" Identity: {hsp.identities}/{hsp.align_length}")
```
### Alignment/HSP Attributes
```python
# Alignment (hit) attributes
alignment.title # Hit description
alignment.accession # Accession number
alignment.length # Subject sequence length
alignment.hsps # List of HSPs
# HSP (High-scoring Segment Pair) attributes
hsp.score # Raw score
hsp.bits # Bit score
hsp.expect # E-value
hsp.identities # Number of identical positions
hsp.positives # Number of positive-scoring positions
hsp.gaps # Number of gaps
hsp.align_length # Alignment length
hsp.query # Aligned query sequence
hsp.match # Match line (| for identity)
hsp.sbjct # Aligned subject sequence
hsp.query_start # Query start position
hsp.query_end # Query end position
hsp.sbjct_start # Subject start position
hsp.sbjct_end # Subject end position
hsp.strand # Strand (blastn)
hsp.frame # Reading frame (blastx/tblastn)
```
## Code Patterns
### Basic BLASTN
```python
from Bio.Blast import NCBIWWW, NCBIXML
sequence = '''ATGAAAGCAATTTTCGTACTGAAAGGTTGGTGGCGCACTTCCTGA'''
print("Running BLASTN (this may take a minute)...")
result_handle = NCBIWWW.qblast('blastn', 'nt', sequence)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
print(f"\nFound {len(blast_record.alignments)} hits")
for alignment in blast_record.alignments[:5]:
hsp = alignment.hsps[0]
print(f"\n{alignment.title[:70]}...")
print(f" E-value: {hsp.expect:.2e}")
print(f" Identity: {hsp.identities}/{hsp.align_length} ({100*hsp.identities/hsp.align_length:.1f}%)")
```
### BLASTP with Organism Filter
```python
from Bio.Blast import NCBIWWW, NCBIXML
protein_seq = '''MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH'''
result_handle = NCBIWWW.qblast(
'blastp',
'nr',
protein_seq,
entrez_query='Mammalia[organism]',
hitlist_size=20,
expect=0.001
)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
for alignment in blast_record.alignments[:10]:
hsp = alignment.hsps[0]
print(f"{alignment.accession}: E={hsp.expect:.2e} - {alignment.title[:50]}...")
```
### BLAST from FASTA File
```python
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
record = SeqIO.read('query.fasta', 'fasta')
result_handle = NCBIWWW.qblast('blastn', 'nt', record.seq)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
for alignment in blast_record.alignments[:5]:
print(f"{alignment.accession}: {alignment.title[:60]}...")
```
### Save Results to File
```python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast('blastn', 'nt', sequence)
# Save XML for later parsing
with open('blast_results.xml', 'w') as out:
out.write(result_handle.read())
result_handle.close()
# Parse saved file
from Bio.Blast import NCBIXML
with open('blast_results.xml') as f:
blast_record = NCBIXML.read(f)
```
### Extract Top Hits
**Goal:** Run a BLAST search and return a structured list of top hits with identity, coverage, and E-values.
**Approach:** Submit query with `qblast`, parse the XML result, and extract key metrics from each alignment's best HSP.
**Reference (BioPython 1.83+):**
```python
def get_top_hits(sequence, program='blastn', database='nt', num_hits=10, evalue=0.01):
result_handle = NCBIWWW.qblast(
program, database, sequence,
hitlist_size=num_hits,
expect=evalue
)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
hits = []
for alignment in blast_record.alignments:
hsp = alignment.hsps[0]
hits.append({
'accession': alignment.accession,
'title': alignment.title,
'evalue': hsp.expect,
'identity': hsp.identities / hsp.align_length,
'coverage': hsp.align_length / blast_record.query_length
})
return hits
hits = get_top_hits(my_sequence, num_hits=5)
for hit in hits:
print(f"{hit['accession']}: {hit['identity']:.1%} identity, E={hit['evalue']:.2e}")
```
### BLASTX (DNA to Protein)
```python
from Bio.Blast import NCBIWWW, NCBIXML
dna_sequence = '''ATGAAAGCAATTTTCGTACTGAAAGGTTGGTGGCGCACTTCCTGA'''
result_handle = NCBIWWW.qblast('blastx', 'nr', dna_sequence)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
for alignment in blast_record.alignments[:5]:
hsp = alignment.hsps[0]
print(f"{alignment.accession}: frame {hsp.frame}, E={hsp.expect:.2e}")
print(f" {alignment.title[:60]}...")
```
### Multiple Sequences
```python
from Bio import SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
import time
def blast_multiple(fasta_file, program='blastn', database='nt'):
results = {}
for record in SeqIO.parse(fasta_file, 'fasta'):
print(f"BLASTing {record.id}...")
result_handle = NCBIWWW.qblast(program, database, str(record.seq), hitlist_size=5)
blast_record = NCBIXML.read(result_handle)
result_handle.close()
results[record.id] = blast_record
time.sleep(5) # Be nice to NCBI servers
return results
```
### Filter by Identity/Coverage
**Goal:** Retain only BLAST hits meeting minimum identity and query coverage thresholds.
**Approach:** Iterate all alignments and HSPs, compute identity and coverage fractions, and collect those passing both cutoffs.
**Reference (BioPython 1.83+):**
```python
def filter_blast_hits(blast_record, min_identity=0.9, min_coverage=0.8):
query_length = blast_record.query_length
filtered = []
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
identity = hsp.identities / hsp.align_length
coverage = hsp.align_length / query_length
if identity >= min_identity and coverage >= min_coverage:
filtered.append({
'accession': alignment.accession,
'title': alignment.title,
'identity': identity,
'coverage': coverage,
'evalue': hsp.expect
})
return filtered
```
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| Timeout | NCBI servers busy | Retry later, use smaller query |
| No hits | Wrong database or program | Check program/database match |
| Empty results | E-value too stringent | Increase expect parameter |
| URLError | Network issue | Check connection, retry |
## Timing Notes
- Remote BLAST can take 30 seconds to several minutes
- Longer queries take longer
- Peak times (US business hours) may be slower
- For many queries, consider local BLAST
## Decision Tree
```
Need to BLAST a sequence?
├── DNA query?
│ ├── Against DNA database? → blastn
│ └── Against protein database? → blastx
├── Protein query?
│ ├── Against protein database? → blastp
│ └── Against DNA database? → tblastn
├── Want to limit organisms?
│ └── Use entrez_query parameter
├── Need many searches?
│ └── Consider local-blast skill
└── Need results fast?
└── Use local-blast skill
```
## Related Skills
- local-blast - Run BLAST locally for faster, unlimited searches
- entrez-fetch - Fetch full records for BLAST hits
- sequence-io - Read query sequences from files
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.