bio-metagenomics-strain-tracking
$
npx mdskill add GPTomics/bioSkills/bio-metagenomics-strain-trackingTrack bacterial strains at sub-species resolution using genome comparison tools
- Identify and monitor strain-level variation in metagenomic samples
- Uses MASH, sourmash, fastANI, and inStrain for genome comparison and profiling
- Analyzes distance metrics, ANI, and within-sample strain dynamics to detect variation
- Delivers strain tracking insights via CLI outputs and comparative reports
SKILL.md
.github/skills/bio-metagenomics-strain-trackingView on GitHub ↗
---
name: bio-metagenomics-strain-tracking
description: Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.
tool_type: cli
primary_tool: MASH
---
## Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, MetaPhlAn 4.1+, numpy 1.26+, pandas 2.2+, samtools 1.19+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Strain Tracking
**"Track bacterial strains across my samples"** → Resolve sub-species variation using genome sketching (Mash/sourmash), average nucleotide identity (fastANI), or within-sample strain profiling (inStrain) for outbreak tracking and transmission analysis.
- CLI: `mash dist`, `sourmash compare`, `fastANI`, `inStrain profile`
Identify and track bacterial strains at sub-species resolution.
## Tool Comparison
| Tool | Method | Best For |
|------|--------|----------|
| MASH | MinHash sketches | Fast distance estimation |
| sourmash | MinHash + containment | Metagenome comparisons |
| fastANI | ANI calculation | Accurate species/strain ID |
| inStrain | SNV profiling | Strain dynamics in metagenomes |
## MASH
### Installation
```bash
conda install -c bioconda mash
```
### Create Sketch
```bash
# Single genome
mash sketch -o genome.msh genome.fasta
# Multiple genomes
mash sketch -o reference_db.msh genomes/*.fasta
# From reads (with coverage)
mash sketch -m 2 -r -o reads.msh reads.fastq.gz
```
### Calculate Distance
```bash
# Pairwise distance
mash dist genome1.fasta genome2.fasta
# Query against database
mash dist reference_db.msh query.fasta > distances.tsv
# Screen for containment (metagenome)
mash screen reference_db.msh reads.fastq.gz > screen_results.tsv
```
### Interpret MASH Distance
| Distance | Interpretation |
|----------|----------------|
| < 0.05 | Same species/strain |
| 0.05-0.15 | Same species |
| 0.15-0.25 | Same genus |
| > 0.25 | Different genus |
### Cluster Genomes
```bash
# All-vs-all distances
mash triangle genomes/*.fasta > distances.phylip
# Build tree
mash triangle -E genomes/*.fasta > distances.tsv
```
## sourmash
### Installation
```bash
conda install -c bioconda sourmash
```
### Create Signatures
```bash
# Genome signature
sourmash sketch dna -p scaled=1000,k=31 genome.fasta -o genome.sig
# Multiple genomes
sourmash sketch dna -p scaled=1000,k=31 genomes/*.fasta -o genomes.sig
# Protein signatures
sourmash sketch protein -p scaled=100,k=10 proteins.faa -o proteins.sig
```
### Compare Signatures
```bash
# Pairwise comparison
sourmash compare *.sig -o comparison.npy --csv comparison.csv
# Search against database
sourmash search query.sig database.sig --threshold 0.8
# Gather (metagenome decomposition)
sourmash gather metagenome.sig database.sig -o gather_results.csv
```
### Taxonomy Assignment
```bash
# Download taxonomy database
sourmash database download gtdb-rs214-k31.zip
# Classify
sourmash lca classify --db gtdb-rs214-k31.lca.json.gz --query query.sig
# Summarize metagenome
sourmash lca summarize --db gtdb-rs214-k31.lca.json.gz --query metagenome.sig
```
## fastANI
### Installation
```bash
conda install -c bioconda fastani
```
### Calculate ANI
```bash
# Single pair
fastANI -q query.fasta -r reference.fasta -o ani_result.txt
# Query vs multiple references
fastANI -q query.fasta --rl reference_list.txt -o ani_results.txt
# All-vs-all
fastANI --ql genome_list.txt --rl genome_list.txt -o all_vs_all.txt --matrix
```
### Interpret ANI
| ANI | Interpretation |
|-----|----------------|
| >99% | Same strain |
| 95-99% | Same species |
| <95% | Different species |
## inStrain
For strain-level analysis in metagenomes.
### Installation
```bash
conda install -c bioconda instrain
```
### Profile Strains
```bash
# Map reads to reference
bowtie2 -x reference -1 reads_1.fq -2 reads_2.fq | \
samtools sort -o mapped.bam
# Profile with inStrain
inStrain profile mapped.bam reference.fasta -o instrain_output -p 8
```
### Compare Samples
```bash
# Profile multiple samples
for bam in sample*.bam; do
inStrain profile $bam reference.fasta -o ${bam%.bam}_IS -p 8
done
# Compare strain populations
inStrain compare -i sample*_IS -o comparison_IS -p 8
```
### Key Outputs
```bash
# SNV table
cat instrain_output/output/SNVs.tsv
# Gene-level info
cat instrain_output/output/gene_info.tsv
# Genome info
cat instrain_output/output/genome_info.tsv
```
## Complete Workflow: Outbreak Tracking
**Goal:** Identify potential outbreak clusters by computing pairwise genomic distances across isolate genomes using multiple complementary methods.
**Approach:** Sketch genomes with MASH for fast distance estimation, compute ANI with fastANI for accurate species-level resolution, compare sourmash signatures for containment analysis, and cluster close matches to identify transmission pairs.
```bash
#!/bin/bash
set -euo pipefail
GENOMES_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
echo "=== MASH sketching ==="
mash sketch -o $OUTPUT_DIR/genomes.msh $GENOMES_DIR/*.fasta
echo "=== MASH distances ==="
mash dist $OUTPUT_DIR/genomes.msh $OUTPUT_DIR/genomes.msh > $OUTPUT_DIR/mash_distances.tsv
echo "=== fastANI ==="
ls $GENOMES_DIR/*.fasta > $OUTPUT_DIR/genome_list.txt
fastANI --ql $OUTPUT_DIR/genome_list.txt \
--rl $OUTPUT_DIR/genome_list.txt \
-o $OUTPUT_DIR/fastani_results.txt \
--matrix
echo "=== sourmash signatures ==="
sourmash sketch dna -p scaled=1000,k=31 $GENOMES_DIR/*.fasta -o $OUTPUT_DIR/all.sig
sourmash compare $OUTPUT_DIR/all.sig -o $OUTPUT_DIR/sourmash.npy --csv $OUTPUT_DIR/sourmash.csv
echo "=== Identify clusters ==="
python3 << 'EOF'
import pandas as pd
import numpy as np
# Load MASH distances
mash = pd.read_csv('${OUTPUT_DIR}/mash_distances.tsv', sep='\t', header=None,
names=['ref', 'query', 'distance', 'pvalue', 'shared'])
# Filter for close matches (potential outbreak cluster)
close = mash[(mash['distance'] < 0.001) & (mash['ref'] != mash['query'])]
print("Potential outbreak pairs (MASH distance < 0.001):")
print(close[['ref', 'query', 'distance']])
EOF
echo "=== Complete ==="
```
## Python Analysis
```python
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
# Load MASH distances
mash = pd.read_csv('mash_distances.tsv', sep='\t', header=None,
names=['ref', 'query', 'dist', 'pval', 'shared'])
# Pivot to matrix
samples = sorted(set(mash['ref'].tolist()))
dist_matrix = mash.pivot(index='ref', columns='query', values='dist').fillna(0)
dist_matrix = dist_matrix.loc[samples, samples]
# Cluster
condensed = squareform(dist_matrix.values)
Z = linkage(condensed, method='average')
# Cut tree at species level (0.05)
clusters = fcluster(Z, t=0.05, criterion='distance')
cluster_df = pd.DataFrame({'sample': samples, 'cluster': clusters})
print(cluster_df.groupby('cluster').size())
```
## Related Skills
- metagenomics/kraken-classification - Taxonomic classification
- genome-assembly/contamination-detection - Contamination screening
- phylogenetics/modern-tree-inference - Phylogenetic analysis
- metagenomics/metaphlan-profiling - Species profiling
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.