bio-genome-assembly-metagenome-assembly
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-metagenome-assemblyRecover microbial genomes from complex metagenome samples.
- Reconstruct MAGs and resolve strain variation in mixed communities.
- Depends on metaFlye, metaSPAdes, QUAST, and SPAdes tools.
- Executes assembly pipelines based on read type and coverage.
- Outputs contigs and binning results for downstream analysis.
SKILL.md
.github/skills/bio-genome-assembly-metagenome-assemblyView on GitHub ↗
---
name: bio-genome-assembly-metagenome-assembly
description: Metagenome assembly from long reads using metaFlye and metaSPAdes with binning strategies. Use when reconstructing genomes from microbial communities, recovering metagenome-assembled genomes (MAGs), or resolving strain-level variation in complex samples.
tool_type: cli
primary_tool: metaFlye
---
## Version Compatibility
Reference examples tested with: QUAST 5.2+, SPAdes 3.15+, minimap2 2.26+, pandas 2.2+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Metagenome Assembly
**"Assemble genomes from my metagenome data"** → Reconstruct individual microbial genomes (MAGs) from mixed community sequencing reads using metagenome-aware assemblers and binning.
- CLI: `flye --meta --nano-raw reads.fq` (long-read), `metaspades.py -1 R1.fq -2 R2.fq` (short-read)
## Overview
Metagenome assembly reconstructs genomes from mixed microbial communities. Long reads enable recovery of complete circular genomes and resolution of strain-level differences.
## metaFlye (Long Reads)
**Goal:** Assemble metagenome contigs from long reads handling uneven coverage across species.
**Approach:** Run Flye in --meta mode which accounts for varying coverage depths in mixed communities.
```bash
# ONT metagenome assembly
flye --nano-raw reads.fastq.gz \
--meta \
--out-dir flye_meta \
--threads 32
# PacBio HiFi metagenome
flye --pacbio-hifi reads.hifi.fastq.gz \
--meta \
--out-dir flye_meta_hifi \
--threads 32
# Key output files:
# assembly.fasta - assembled contigs
# assembly_graph.gfa - assembly graph
# assembly_info.txt - contig statistics
```
## metaSPAdes (Short Reads)
**Goal:** Assemble metagenome contigs from Illumina paired-end reads.
**Approach:** Run metaSPAdes which uses multi-kmer de Bruijn graph assembly optimized for metagenomes.
```bash
# Illumina paired-end metagenome
metaspades.py -1 R1.fastq.gz -2 R2.fastq.gz \
-o spades_meta \
-t 32 \
-m 500
# With multiple libraries
metaspades.py \
--pe1-1 lib1_R1.fq.gz --pe1-2 lib1_R2.fq.gz \
--pe2-1 lib2_R1.fq.gz --pe2-2 lib2_R2.fq.gz \
-o spades_meta -t 32
```
## Hybrid Assembly
**Goal:** Combine long-read contiguity with short-read accuracy in metagenome assembly.
**Approach:** Assemble with metaFlye from long reads, then polish the assembly with Pilon using short reads.
```bash
# Combine short and long reads
flye --nano-raw ont_reads.fastq.gz \
--meta \
--out-dir flye_hybrid \
--threads 32
# Polish with short reads
pilon --genome flye_hybrid/assembly.fasta \
--frags short_reads.bam \
--output polished \
--threads 16
```
## Key Parameters
### metaFlye
| Parameter | Description |
|-----------|-------------|
| --meta | Metagenome mode (handles uneven coverage) |
| --min-overlap | Minimum overlap for assembly (default: auto) |
| --genome-size | Estimated total size (optional for meta) |
| --iterations | Polishing iterations (default: 1) |
| --keep-haplotypes | Preserve strain variants |
### metaSPAdes
| Parameter | Description |
|-----------|-------------|
| -m | Memory limit in GB |
| --only-assembler | Skip error correction |
| -k | K-mer sizes (auto-selected by default) |
| --phred-offset | Quality encoding (33 or 64) |
## Binning Workflow
**Goal:** Recover individual genomes (MAGs) from a metagenome assembly.
**Approach:** Map reads back to the assembly for coverage, compute per-contig depth, bin with MetaBAT2, and assess quality with CheckM2.
"Bin the contigs from my metagenome assembly into individual genomes" --> Map reads for coverage, cluster contigs by composition and coverage, then evaluate bins.
```bash
# Step 1: Map reads back to assembly
minimap2 -ax map-ont -t 32 assembly.fasta reads.fastq.gz | \
samtools sort -o mapped.bam -
# Step 2: Generate depth file
jgi_summarize_bam_contig_depths --outputDepth depth.txt mapped.bam
# Step 3: Bin with MetaBAT2
metabat2 -i assembly.fasta -a depth.txt -o bins/bin -t 32
# Step 4: Assess bin quality with CheckM2
checkm2 predict --input bins/ --output-directory checkm2_out -x fa --threads 32
```
## SemiBin2 (Deep Learning Binning)
**Goal:** Improve MAG recovery using deep learning-based contig binning.
**Approach:** Run SemiBin2 which trains a neural network on contig composition and coverage for more accurate bin assignments.
```bash
# Single-sample binning
SemiBin2 single_easy_bin \
-i assembly.fasta \
-b mapped.bam \
-o semibin_out \
--environment global
# Multi-sample binning (better for time-series)
SemiBin2 multi_easy_bin \
-i assembly.fasta \
-b sample1.bam sample2.bam sample3.bam \
-o semibin_multi
```
## Quality Assessment
**Goal:** Evaluate assembly contiguity, bin completeness, and taxonomic composition.
**Approach:** Run seqkit for basic stats, CheckM2 for bin quality, GTDB-Tk for taxonomy, and MetaQUAST for assembly metrics.
```bash
# Assembly stats
seqkit stats assembly.fasta
# CheckM2 for bin completeness
checkm2 predict -i bins/ -o checkm2_out -x fa -t 32
# GTDB-Tk for taxonomic classification
gtdbtk classify_wf --genome_dir bins/ --out_dir gtdbtk_out --cpus 32
# QUAST for assembly metrics
metaquast.py -o metaquast_out assembly.fasta -t 32
```
## Circular Genome Detection
**Goal:** Identify complete circular genomes (e.g., bacterial chromosomes, plasmids) in the assembly.
**Approach:** Parse Flye's assembly_info.txt for circularity flags and extract matching contigs.
```bash
# Flye marks circular contigs in assembly_info.txt
grep "Y" flye_meta/assembly_info.txt | cut -f1 > circular_contigs.txt
# Extract circular contigs
seqkit grep -f circular_contigs.txt assembly.fasta > circular_genomes.fasta
```
## Python Pipeline
**Goal:** Provide a reusable Python workflow from metagenome assembly through binning to quality assessment.
**Approach:** Chain metaFlye assembly, MetaBAT2 binning, and CheckM2 quality filtering, returning high-quality MAGs.
```python
import subprocess
from pathlib import Path
import pandas as pd
def run_metaflye(reads, output_dir, read_type='nano-raw', threads=32):
cmd = ['flye', f'--{read_type}', reads, '--meta', '--out-dir', output_dir, '--threads', str(threads)]
subprocess.run(cmd, check=True)
return Path(output_dir) / 'assembly.fasta'
def run_binning(assembly, bam, output_dir, threads=32):
depth_file = Path(output_dir) / 'depth.txt'
subprocess.run(['jgi_summarize_bam_contig_depths', '--outputDepth', str(depth_file), bam], check=True)
bins_dir = Path(output_dir) / 'bins'
bins_dir.mkdir(exist_ok=True)
subprocess.run(['metabat2', '-i', assembly, '-a', str(depth_file), '-o', str(bins_dir / 'bin'), '-t', str(threads)], check=True)
return bins_dir
def assess_bins(bins_dir, output_dir, threads=32):
subprocess.run(['checkm2', 'predict', '--input', str(bins_dir), '--output-directory', output_dir, '-x', 'fa', '--threads', str(threads)], check=True)
results = pd.read_csv(Path(output_dir) / 'quality_report.tsv', sep='\t')
high_quality = results[(results['Completeness'] > 90) & (results['Contamination'] < 5)]
return high_quality
# Example workflow
assembly = run_metaflye('ont_reads.fq.gz', 'flye_out')
bins = run_binning(str(assembly), 'mapped.bam', 'binning_out')
hq_bins = assess_bins(bins, 'checkm2_out')
print(f'High-quality MAGs: {len(hq_bins)}')
```
## Expected Outputs
| Metric | Good Assembly |
|--------|---------------|
| N50 | >50 kb |
| Largest contig | >1 Mb |
| HQ MAGs (>90% complete, <5% contam) | Varies by sample |
| Circular genomes | Sample dependent |
## Troubleshooting
| Issue | Solution |
|-------|----------|
| Few long contigs | Increase read depth or length |
| High chimeric rate | Use --keep-haplotypes in Flye |
| Poor binning | Add more samples for differential coverage |
| Missing taxa | Check read QC; consider targeted enrichment |
## Related Skills
- genome-assembly/contamination-detection - CheckM2/GUNC
- metagenomics/taxonomic-profiling - Kraken2/Bracken
- metagenomics/functional-profiling - HUMAnN
- long-read-sequencing/read-qc - Input quality control
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.