bio-genome-assembly-metagenome-assembly

Name: bio-genome-assembly-metagenome-assembly
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-genome-assembly-metagenome-assembly
Recover microbial genomes from complex metagenome samples.
Reconstruct MAGs and resolve strain variation in mixed communities.
Depends on metaFlye, metaSPAdes, QUAST, and SPAdes tools.
Executes assembly pipelines based on read type and coverage.
Outputs contigs and binning results for downstream analysis.
SKILL.md
.github/skills/bio-genome-assembly-metagenome-assemblyView on GitHub ↗
---
name: bio-genome-assembly-metagenome-assembly
description: Metagenome assembly from long reads using metaFlye and metaSPAdes with binning strategies. Use when reconstructing genomes from microbial communities, recovering metagenome-assembled genomes (MAGs), or resolving strain-level variation in complex samples.
tool_type: cli
primary_tool: metaFlye
---

## Version Compatibility

Reference examples tested with: QUAST 5.2+, SPAdes 3.15+, minimap2 2.26+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Metagenome Assembly

**"Assemble genomes from my metagenome data"** → Reconstruct individual microbial genomes (MAGs) from mixed community sequencing reads using metagenome-aware assemblers and binning.
- CLI: `flye --meta --nano-raw reads.fq` (long-read), `metaspades.py -1 R1.fq -2 R2.fq` (short-read)

## Overview

Metagenome assembly reconstructs genomes from mixed microbial communities. Long reads enable recovery of complete circular genomes and resolution of strain-level differences.

## metaFlye (Long Reads)

**Goal:** Assemble metagenome contigs from long reads handling uneven coverage across species.

**Approach:** Run Flye in --meta mode which accounts for varying coverage depths in mixed communities.

```bash
# ONT metagenome assembly
flye --nano-raw reads.fastq.gz \
    --meta \
    --out-dir flye_meta \
    --threads 32

# PacBio HiFi metagenome
flye --pacbio-hifi reads.hifi.fastq.gz \
    --meta \
    --out-dir flye_meta_hifi \
    --threads 32

# Key output files:
# assembly.fasta - assembled contigs
# assembly_graph.gfa - assembly graph
# assembly_info.txt - contig statistics
```

## metaSPAdes (Short Reads)

**Goal:** Assemble metagenome contigs from Illumina paired-end reads.

**Approach:** Run metaSPAdes which uses multi-kmer de Bruijn graph assembly optimized for metagenomes.

```bash
# Illumina paired-end metagenome
metaspades.py -1 R1.fastq.gz -2 R2.fastq.gz \
    -o spades_meta \
    -t 32 \
    -m 500

# With multiple libraries
metaspades.py \
    --pe1-1 lib1_R1.fq.gz --pe1-2 lib1_R2.fq.gz \
    --pe2-1 lib2_R1.fq.gz --pe2-2 lib2_R2.fq.gz \
    -o spades_meta -t 32
```

## Hybrid Assembly

**Goal:** Combine long-read contiguity with short-read accuracy in metagenome assembly.

**Approach:** Assemble with metaFlye from long reads, then polish the assembly with Pilon using short reads.

```bash
# Combine short and long reads
flye --nano-raw ont_reads.fastq.gz \
    --meta \
    --out-dir flye_hybrid \
    --threads 32

# Polish with short reads
pilon --genome flye_hybrid/assembly.fasta \
    --frags short_reads.bam \
    --output polished \
    --threads 16
```

## Key Parameters

### metaFlye

| Parameter | Description |
|-----------|-------------|
| --meta | Metagenome mode (handles uneven coverage) |
| --min-overlap | Minimum overlap for assembly (default: auto) |
| --genome-size | Estimated total size (optional for meta) |
| --iterations | Polishing iterations (default: 1) |
| --keep-haplotypes | Preserve strain variants |

### metaSPAdes

| Parameter | Description |
|-----------|-------------|
| -m | Memory limit in GB |
| --only-assembler | Skip error correction |
| -k | K-mer sizes (auto-selected by default) |
| --phred-offset | Quality encoding (33 or 64) |

## Binning Workflow

**Goal:** Recover individual genomes (MAGs) from a metagenome assembly.

**Approach:** Map reads back to the assembly for coverage, compute per-contig depth, bin with MetaBAT2, and assess quality with CheckM2.

"Bin the contigs from my metagenome assembly into individual genomes" --> Map reads for coverage, cluster contigs by composition and coverage, then evaluate bins.

```bash
# Step 1: Map reads back to assembly
minimap2 -ax map-ont -t 32 assembly.fasta reads.fastq.gz | \
    samtools sort -o mapped.bam -

# Step 2: Generate depth file
jgi_summarize_bam_contig_depths --outputDepth depth.txt mapped.bam

# Step 3: Bin with MetaBAT2
metabat2 -i assembly.fasta -a depth.txt -o bins/bin -t 32

# Step 4: Assess bin quality with CheckM2
checkm2 predict --input bins/ --output-directory checkm2_out -x fa --threads 32
```

## SemiBin2 (Deep Learning Binning)

**Goal:** Improve MAG recovery using deep learning-based contig binning.

**Approach:** Run SemiBin2 which trains a neural network on contig composition and coverage for more accurate bin assignments.

```bash
# Single-sample binning
SemiBin2 single_easy_bin \
    -i assembly.fasta \
    -b mapped.bam \
    -o semibin_out \
    --environment global

# Multi-sample binning (better for time-series)
SemiBin2 multi_easy_bin \
    -i assembly.fasta \
    -b sample1.bam sample2.bam sample3.bam \
    -o semibin_multi
```

## Quality Assessment

**Goal:** Evaluate assembly contiguity, bin completeness, and taxonomic composition.

**Approach:** Run seqkit for basic stats, CheckM2 for bin quality, GTDB-Tk for taxonomy, and MetaQUAST for assembly metrics.

```bash
# Assembly stats
seqkit stats assembly.fasta

# CheckM2 for bin completeness
checkm2 predict -i bins/ -o checkm2_out -x fa -t 32

# GTDB-Tk for taxonomic classification
gtdbtk classify_wf --genome_dir bins/ --out_dir gtdbtk_out --cpus 32

# QUAST for assembly metrics
metaquast.py -o metaquast_out assembly.fasta -t 32
```

## Circular Genome Detection

**Goal:** Identify complete circular genomes (e.g., bacterial chromosomes, plasmids) in the assembly.

**Approach:** Parse Flye's assembly_info.txt for circularity flags and extract matching contigs.

```bash
# Flye marks circular contigs in assembly_info.txt
grep "Y" flye_meta/assembly_info.txt | cut -f1 > circular_contigs.txt

# Extract circular contigs
seqkit grep -f circular_contigs.txt assembly.fasta > circular_genomes.fasta
```

## Python Pipeline

**Goal:** Provide a reusable Python workflow from metagenome assembly through binning to quality assessment.

**Approach:** Chain metaFlye assembly, MetaBAT2 binning, and CheckM2 quality filtering, returning high-quality MAGs.

```python
import subprocess
from pathlib import Path
import pandas as pd

def run_metaflye(reads, output_dir, read_type='nano-raw', threads=32):
    cmd = ['flye', f'--{read_type}', reads, '--meta', '--out-dir', output_dir, '--threads', str(threads)]
    subprocess.run(cmd, check=True)
    return Path(output_dir) / 'assembly.fasta'

def run_binning(assembly, bam, output_dir, threads=32):
    depth_file = Path(output_dir) / 'depth.txt'
    subprocess.run(['jgi_summarize_bam_contig_depths', '--outputDepth', str(depth_file), bam], check=True)

    bins_dir = Path(output_dir) / 'bins'
    bins_dir.mkdir(exist_ok=True)
    subprocess.run(['metabat2', '-i', assembly, '-a', str(depth_file), '-o', str(bins_dir / 'bin'), '-t', str(threads)], check=True)

    return bins_dir

def assess_bins(bins_dir, output_dir, threads=32):
    subprocess.run(['checkm2', 'predict', '--input', str(bins_dir), '--output-directory', output_dir, '-x', 'fa', '--threads', str(threads)], check=True)

    results = pd.read_csv(Path(output_dir) / 'quality_report.tsv', sep='\t')
    high_quality = results[(results['Completeness'] > 90) & (results['Contamination'] < 5)]
    return high_quality

# Example workflow
assembly = run_metaflye('ont_reads.fq.gz', 'flye_out')
bins = run_binning(str(assembly), 'mapped.bam', 'binning_out')
hq_bins = assess_bins(bins, 'checkm2_out')
print(f'High-quality MAGs: {len(hq_bins)}')
```

## Expected Outputs

| Metric | Good Assembly |
|--------|---------------|
| N50 | >50 kb |
| Largest contig | >1 Mb |
| HQ MAGs (>90% complete, <5% contam) | Varies by sample |
| Circular genomes | Sample dependent |

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Few long contigs | Increase read depth or length |
| High chimeric rate | Use --keep-haplotypes in Flye |
| Poor binning | Add more samples for differential coverage |
| Missing taxa | Check read QC; consider targeted enrichment |

## Related Skills

- genome-assembly/contamination-detection - CheckM2/GUNC
- metagenomics/taxonomic-profiling - Kraken2/Bracken
- metagenomics/functional-profiling - HUMAnN
- long-read-sequencing/read-qc - Input quality control