bio-sequence-similarity
$
npx mdskill add GPTomics/bioSkills/bio-sequence-similarityDetects distant homologs using advanced sequence analysis
- Identifies orthologs and protein families missed by standard BLAST.
- Depends on NCBI BLAST+, HMMER, BioPython, and PSI-BLAST tools.
- Decides methods based on query type and required sensitivity level.
- Delivers homologous sequences via CLI commands or Python API calls.
SKILL.md
.github/skills/bio-sequence-similarityView on GitHub ↗
---
name: bio-sequence-similarity
description: Find homologous sequences using iterative BLAST (PSI-BLAST), profile HMMs (HMMER), and reciprocal best hit analysis. Use when identifying orthologs, distant homologs, or protein family members where standard BLAST is not sensitive enough.
tool_type: mixed
primary_tool: BLAST+
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Sequence Similarity Searches
Advanced methods for finding homologous sequences beyond standard BLAST.
**"Find distant homologs"** → Use iterative search (PSI-BLAST) or profile HMMs (HMMER) to detect remote sequence similarity that standard BLAST misses.
- CLI: `psiblast -query seq.fa -db nr -num_iterations 3` or `hmmsearch profile.hmm seqdb`
- Python: `NcbipsiblastCommandline()` (BioPython)
## PSI-BLAST (Position-Specific Iterated BLAST)
Builds a position-specific scoring matrix (PSSM) through iterations to find distant homologs.
### Basic PSI-BLAST
```bash
psiblast -query protein.fasta -db nr -out results.txt -num_iterations 3
```
### Save PSSM for Reuse
```bash
psiblast -query protein.fasta -db nr \
-out results.txt \
-out_pssm pssm.asn \
-out_ascii_pssm pssm.txt \
-num_iterations 5
```
### Use Existing PSSM
```bash
psiblast -in_pssm pssm.asn -db nr -out results.txt
```
### Output Format
```bash
psiblast -query protein.fasta -db nr \
-out results.txt \
-outfmt 6 \
-num_iterations 3 \
-inclusion_ethresh 0.001
```
### Key Parameters
```bash
psiblast -query protein.fasta -db nr \
-num_iterations 5 \
-inclusion_ethresh 0.001 \
-evalue 0.01 \
-num_threads 8 \
-out results.txt
```
### PSI-BLAST Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| -num_iterations | 1 | Number of iterations |
| -inclusion_ethresh | 0.002 | E-value for PSSM inclusion |
| -evalue | 10 | E-value threshold for reporting |
| -num_threads | 1 | CPU threads |
## HMMER for Profile Searches
HMMER uses profile hidden Markov models for sensitive sequence searches.
### Search with Single Sequence
```bash
jackhmmer -o results.txt -A aligned.sto --cpu 8 query.fasta database.fasta
```
### Build Profile from Alignment
```bash
hmmbuild profile.hmm alignment.sto
```
### Search Database with Profile
```bash
hmmsearch -o results.txt --tblout hits.tbl profile.hmm database.fasta
hmmsearch -o results.txt --domtblout domains.tbl profile.hmm database.fasta
```
### Download Pfam Profiles
```bash
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm
```
### Scan Sequence Against Pfam
```bash
hmmscan --tblout pfam_hits.tbl --domtblout domains.tbl Pfam-A.hmm query.fasta
```
### Parse HMMER Output
```bash
grep -v "^#" hits.tbl | head
awk '$5 < 1e-10' hits.tbl
```
### HMMER Output Columns (--tblout)
| Column | Description |
|--------|-------------|
| 1 | Target name |
| 2 | Accession |
| 3 | Query name |
| 4 | Query accession |
| 5 | E-value (full sequence) |
| 6 | Score (full sequence) |
| 7 | Bias |
| 8 | E-value (best domain) |
| 9 | Score (best domain) |
## Reciprocal Best Hit (RBH) Analysis
Find orthologs using bidirectional best hits.
### Create BLAST Databases
```bash
makeblastdb -in species_A.fasta -dbtype prot -out species_A_db
makeblastdb -in species_B.fasta -dbtype prot -out species_B_db
```
### Bidirectional BLAST
```bash
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > A_vs_B.txt
blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > B_vs_A.txt
```
### Find Reciprocal Best Hits
```bash
awk 'FNR==NR {a[$1]=$2; next} $2 in a && a[$2]==$1 {print $1"\t"$2}' \
A_vs_B.txt B_vs_A.txt > reciprocal_best_hits.txt
```
### Python RBH Script
**Goal:** Identify orthologous gene pairs between two species using the reciprocal best hit criterion.
**Approach:** Parse forward and reverse BLAST results to extract the top hit per query, then retain only pairs where each sequence is the other's best match.
```python
def find_rbh(forward_blast, reverse_blast):
'''Find reciprocal best hits from BLAST results'''
forward = {}
with open(forward_blast) as f:
for line in f:
parts = line.strip().split('\t')
query, subject = parts[0], parts[1]
if query not in forward:
forward[query] = subject
reverse = {}
with open(reverse_blast) as f:
for line in f:
parts = line.strip().split('\t')
query, subject = parts[0], parts[1]
if query not in reverse:
reverse[query] = subject
rbh = []
for a, b in forward.items():
if b in reverse and reverse[b] == a:
rbh.append((a, b))
return rbh
rbh_pairs = find_rbh('A_vs_B.txt', 'B_vs_A.txt')
for a, b in rbh_pairs:
print(f'{a}\t{b}')
```
## Delta-BLAST
Uses conserved domain database for more sensitive initial search.
```bash
deltablast -query protein.fasta -db nr -rpsdb cdd_delta -out results.txt
```
## PHI-BLAST (Pattern-Hit Initiated)
Search with a pattern plus sequence.
```bash
phi_pattern="G-x(2)-[ST]-x-[RK]"
phiblast -query protein.fasta -db nr -pattern "$phi_pattern" -out results.txt
```
## Iterative Search with Biopython
```python
from Bio.Blast import NCBIWWW, NCBIXML
with open('query.fasta') as f:
query = f.read()
result_handle = NCBIWWW.qblast('psiblast', 'nr', query, expect=0.001, word_size=3)
with open('psiblast_result.xml', 'w') as out:
out.write(result_handle.read())
result_handle.close()
with open('psiblast_result.xml') as f:
records = NCBIXML.parse(f)
for record in records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-10:
print(f'{alignment.hit_def[:50]}: E={hsp.expect}')
```
## HMMER with Biopython
```python
from Bio import SearchIO
results = SearchIO.parse('hmmsearch_output.txt', 'hmmer3-text')
for query_result in results:
print(f'Query: {query_result.id}')
for hit in query_result:
print(f' Hit: {hit.id}, E-value: {hit.evalue}')
for hsp in hit:
print(f' Domain: {hsp.bitscore} bits')
```
## Jackhmmer (Iterative HMMER)
Similar to PSI-BLAST but uses HMM profiles.
```bash
jackhmmer -N 5 -o results.txt --tblout hits.tbl query.fasta database.fasta
jackhmmer -N 5 -A iterations.sto --chkhmm checkpoint query.fasta database.fasta
```
## OrthoFinder for Multi-Species Orthologs
```bash
orthofinder -f proteomes/ -t 8
orthofinder -f proteomes/ -t 8 -M msa
```
### Prepare Input
```bash
mkdir proteomes
cp species_*.fasta proteomes/
```
### Output Files
| File | Content |
|------|---------|
| Orthogroups.tsv | All orthogroups |
| Orthogroups_SingleCopyOrthologues.txt | 1:1 orthologs |
| Species_Tree/ | Inferred species tree |
| Gene_Trees/ | Individual gene trees |
## E-value vs Bit Score
| E-value | Interpretation |
|---------|----------------|
| < 1e-50 | Highly significant, likely homolog |
| 1e-50 to 1e-10 | Significant, probable homolog |
| 1e-10 to 1e-3 | Marginal, possible remote homolog |
| > 0.01 | Not significant |
## Complete Ortholog Finding Pipeline
**Goal:** Run an end-to-end reciprocal best hit ortholog analysis from two proteome FASTA files.
**Approach:** Build BLAST databases for both species, run bidirectional best-hit searches, and extract reciprocal pairs using awk.
```bash
#!/bin/bash
SPECIES_A=$1
SPECIES_B=$2
EVALUE=1e-10
THREADS=8
echo "Building databases..."
makeblastdb -in $SPECIES_A -dbtype prot -out db_A
makeblastdb -in $SPECIES_B -dbtype prot -out db_B
echo "Running forward BLAST..."
blastp -query $SPECIES_A -db db_B -outfmt 6 -evalue $EVALUE \
-max_target_seqs 1 -num_threads $THREADS > forward.txt
echo "Running reverse BLAST..."
blastp -query $SPECIES_B -db db_A -outfmt 6 -evalue $EVALUE \
-max_target_seqs 1 -num_threads $THREADS > reverse.txt
echo "Finding reciprocal best hits..."
awk 'FNR==NR {best[$1]=$2; next}
$2 in best && best[$2]==$1 {print $1"\t"$2}' \
forward.txt reverse.txt > orthologs.txt
echo "Found $(wc -l < orthologs.txt) ortholog pairs"
rm -f db_A.* db_B.*
```
## Related Skills
- blast-searches - Basic remote BLAST
- local-blast - Local BLAST databases
- entrez-fetch - Download sequences
- alignment - Align identified homologs
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.