bio-workflows-metagenomics-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-metagenomics-pipelineOrchestrate metagenomic analysis from FASTQ to functional profiles.
- Converts raw sequencing reads into taxonomic and functional datasets.
- Depends on Kraken2, Bracken, and HUMAnN for core analysis.
- Executes steps only when quality thresholds are met at checkpoints.
- Outputs structured taxonomic and functional profile reports.
SKILL.md
.github/skills/bio-workflows-metagenomics-pipelineView on GitHub ↗
---
name: bio-workflows-metagenomics-pipeline
description: End-to-end metagenomics workflow from FASTQ to taxonomic and functional profiles. Covers Kraken2 classification, Bracken abundance estimation, and HUMAnN functional profiling. Use when profiling metagenomic samples.
tool_type: cli
primary_tool: Kraken2
workflow: true
depends_on:
- read-qc/fastp-workflow
- metagenomics/kraken-classification
- metagenomics/metaphlan-profiling
- metagenomics/abundance-estimation
- metagenomics/functional-profiling
- metagenomics/metagenome-visualization
qc_checkpoints:
- after_qc: "Q30 >80%, host reads removed"
- after_classification: "Classification rate >60%, known taxa dominant"
- after_functional: "Pathway coverage reasonable, unmapped <50%"
---
## Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, Bracken 2.9+, HUMAnN 3.8+, Kraken2 2.1+, MetaPhlAn 4.1+, fastp 0.23+, matplotlib 3.8+, pandas 2.2+, seaborn 0.13+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Metagenomics Pipeline
**"Analyze my metagenomic samples from FASTQ to taxonomic and functional profiles"** → Orchestrate host depletion, Kraken2/Bracken taxonomic classification, MetaPhlAn profiling, HUMAnN3 functional analysis, and AMR gene detection.
Complete workflow from metagenomic FASTQ to taxonomic and functional profiles.
## Workflow Overview
```
FASTQ files
|
v
[1. QC & Host Removal] --> fastp + Bowtie2
|
v
[2. Taxonomic Classification]
|
+---> Kraken2 + Bracken (fast, database-dependent)
|
+---> MetaPhlAn (marker-based, standardized)
|
v
[3. Functional Profiling] --> HUMAnN
|
v
Taxonomic profiles + Pathway abundances
```
## Primary Path: Kraken2 + Bracken + HUMAnN
### Step 1: Quality Control and Host Removal
```bash
# QC with fastp
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--html qc/${sample}_fastp.html
done
# Remove host reads (human example)
for sample in sample1 sample2 sample3; do
bowtie2 -p 8 -x human_index \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
--un-conc-gz host_removed/${sample}_R%.fq.gz \
> /dev/null 2> qc/${sample}_host_removal.log
done
```
### Step 2A: Kraken2 Classification
```bash
# Classify reads
for sample in sample1 sample2 sample3; do
kraken2 --db kraken2_db \
--threads 8 \
--paired \
--report kraken/${sample}.report \
--output kraken/${sample}.output \
host_removed/${sample}_R1.fq.gz \
host_removed/${sample}_R2.fq.gz
done
```
### Step 2B: Bracken Abundance Estimation
```bash
# Estimate species abundance
for sample in sample1 sample2 sample3; do
bracken -d kraken2_db \
-i kraken/${sample}.report \
-o bracken/${sample}.species.txt \
-r 150 \
-l S \
-t 10
done
# Combine samples into abundance matrix
combine_bracken_outputs.py \
--files bracken/*.species.txt \
-o bracken/combined_species.txt
```
### Step 2C: Alternative - MetaPhlAn Profiling
```bash
# Profile with MetaPhlAn 4
for sample in sample1 sample2 sample3; do
metaphlan host_removed/${sample}_R1.fq.gz,host_removed/${sample}_R2.fq.gz \
--bowtie2out metaphlan/${sample}.bowtie2.bz2 \
--input_type fastq \
--nproc 8 \
-o metaphlan/${sample}_profile.txt
done
# Merge profiles
merge_metaphlan_tables.py metaphlan/*_profile.txt > metaphlan/merged_abundance.txt
```
### Step 3: Functional Profiling with HUMAnN
```bash
# Run HUMAnN
for sample in sample1 sample2 sample3; do
# Concatenate paired reads
cat host_removed/${sample}_R1.fq.gz host_removed/${sample}_R2.fq.gz > \
host_removed/${sample}_concat.fq.gz
humann --input host_removed/${sample}_concat.fq.gz \
--output humann/${sample} \
--threads 8 \
--metaphlan-options "--bowtie2db metaphlan_db"
done
# Normalize and join tables
humann_renorm_table --input humann/sample1/sample1_pathabundance.tsv \
--output humann/sample1/sample1_pathabundance_cpm.tsv \
--units cpm
humann_join_tables --input humann \
--output humann/merged_pathabundance.tsv \
--file_name pathabundance
```
### Visualization
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load Bracken species table
species = pd.read_csv('bracken/combined_species.txt', sep='\t', index_col=0)
# Top 20 species heatmap
top20 = species.sum(axis=1).nlargest(20).index
plt.figure(figsize=(12, 8))
sns.heatmap(species.loc[top20], cmap='viridis', annot=False)
plt.title('Top 20 Species Abundance')
plt.tight_layout()
plt.savefig('top20_species_heatmap.pdf')
# Stacked bar plot
species_norm = species.div(species.sum()) * 100
top10 = species_norm.sum(axis=1).nlargest(10).index
other = species_norm.loc[~species_norm.index.isin(top10)].sum()
plot_data = species_norm.loc[top10].T
plot_data['Other'] = other
plot_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.ylabel('Relative Abundance (%)')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.savefig('species_barplot.pdf')
```
## Parameter Recommendations
| Step | Parameter | Value |
|------|-----------|-------|
| fastp | --length_required | 50 (metagenomic reads) |
| Kraken2 | --confidence | 0.0 (default) or 0.1 |
| Bracken | -r | Read length (e.g., 150) |
| Bracken | -l | S (species) or G (genus) |
| Bracken | -t | 10 (min reads threshold) |
| MetaPhlAn | --min_cu_len | 2000 (default) |
| HUMAnN | --threads | 8+ |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Low classification rate | Database mismatch, novel organisms | Try different database, check sample type |
| High unclassified | Novel microbes, host contamination | Remove host, use larger database |
| High host reads | Incomplete host removal | Use multiple host reference genomes |
| HUMAnN slow | Large files | Increase threads, pre-filter reads |
## Complete Pipeline Script
```bash
#!/bin/bash
set -e
THREADS=8
KRAKEN_DB="kraken2_standard_db"
HOST_INDEX="human_bt2_index"
SAMPLES="sample1 sample2 sample3"
OUTDIR="metagenomics_results"
mkdir -p ${OUTDIR}/{trimmed,host_removed,kraken,bracken,metaphlan,humann,qc}
# Step 1: QC
echo "=== QC ==="
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--length_required 50 \
--html ${OUTDIR}/qc/${sample}_fastp.html -w ${THREADS}
done
# Host removal
echo "=== Host Removal ==="
for sample in $SAMPLES; do
bowtie2 -p ${THREADS} -x ${HOST_INDEX} \
-1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--un-conc-gz ${OUTDIR}/host_removed/${sample}_R%.fq.gz \
> /dev/null 2> ${OUTDIR}/qc/${sample}_host.log
done
# Step 2: Kraken2
echo "=== Kraken2 ==="
for sample in $SAMPLES; do
kraken2 --db ${KRAKEN_DB} --threads ${THREADS} --paired \
--report ${OUTDIR}/kraken/${sample}.report \
--output ${OUTDIR}/kraken/${sample}.output \
${OUTDIR}/host_removed/${sample}_R1.fq.gz \
${OUTDIR}/host_removed/${sample}_R2.fq.gz
done
# Bracken
echo "=== Bracken ==="
for sample in $SAMPLES; do
bracken -d ${KRAKEN_DB} \
-i ${OUTDIR}/kraken/${sample}.report \
-o ${OUTDIR}/bracken/${sample}.species.txt \
-r 150 -l S -t 10
done
echo "=== Pipeline Complete ==="
echo "Kraken reports: ${OUTDIR}/kraken/"
echo "Bracken abundances: ${OUTDIR}/bracken/"
```
## Related Skills
- metagenomics/kraken-classification - Kraken2 details
- metagenomics/metaphlan-profiling - MetaPhlAn parameters
- metagenomics/abundance-estimation - Bracken options
- metagenomics/functional-profiling - HUMAnN workflow
- metagenomics/metagenome-visualization - Plotting functions
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.