bio-genome-assembly-metagenome-assembly

$npx mdskill add GPTomics/bioSkills/bio-genome-assembly-metagenome-assembly

Recover microbial genomes from complex metagenome samples.

  • Reconstruct MAGs and resolve strain variation in mixed communities.
  • Depends on metaFlye, metaSPAdes, QUAST, and SPAdes tools.
  • Executes assembly pipelines based on read type and coverage.
  • Outputs contigs and binning results for downstream analysis.
SKILL.md
.github/skills/bio-genome-assembly-metagenome-assemblyView on GitHub ↗
---
name: bio-genome-assembly-metagenome-assembly
description: Metagenome assembly from long reads using metaFlye and metaSPAdes with binning strategies. Use when reconstructing genomes from microbial communities, recovering metagenome-assembled genomes (MAGs), or resolving strain-level variation in complex samples.
tool_type: cli
primary_tool: metaFlye
---

## Version Compatibility

Reference examples tested with: QUAST 5.2+, SPAdes 3.15+, minimap2 2.26+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Metagenome Assembly

**"Assemble genomes from my metagenome data"** → Reconstruct individual microbial genomes (MAGs) from mixed community sequencing reads using metagenome-aware assemblers and binning.
- CLI: `flye --meta --nano-raw reads.fq` (long-read), `metaspades.py -1 R1.fq -2 R2.fq` (short-read)

## Overview

Metagenome assembly reconstructs genomes from mixed microbial communities. Long reads enable recovery of complete circular genomes and resolution of strain-level differences.

## metaFlye (Long Reads)

**Goal:** Assemble metagenome contigs from long reads handling uneven coverage across species.

**Approach:** Run Flye in --meta mode which accounts for varying coverage depths in mixed communities.

```bash
# ONT metagenome assembly
flye --nano-raw reads.fastq.gz \
    --meta \
    --out-dir flye_meta \
    --threads 32

# PacBio HiFi metagenome
flye --pacbio-hifi reads.hifi.fastq.gz \
    --meta \
    --out-dir flye_meta_hifi \
    --threads 32

# Key output files:
# assembly.fasta - assembled contigs
# assembly_graph.gfa - assembly graph
# assembly_info.txt - contig statistics
```

## metaSPAdes (Short Reads)

**Goal:** Assemble metagenome contigs from Illumina paired-end reads.

**Approach:** Run metaSPAdes which uses multi-kmer de Bruijn graph assembly optimized for metagenomes.

```bash
# Illumina paired-end metagenome
metaspades.py -1 R1.fastq.gz -2 R2.fastq.gz \
    -o spades_meta \
    -t 32 \
    -m 500

# With multiple libraries
metaspades.py \
    --pe1-1 lib1_R1.fq.gz --pe1-2 lib1_R2.fq.gz \
    --pe2-1 lib2_R1.fq.gz --pe2-2 lib2_R2.fq.gz \
    -o spades_meta -t 32
```

## Hybrid Assembly

**Goal:** Combine long-read contiguity with short-read accuracy in metagenome assembly.

**Approach:** Assemble with metaFlye from long reads, then polish the assembly with Pilon using short reads.

```bash
# Combine short and long reads
flye --nano-raw ont_reads.fastq.gz \
    --meta \
    --out-dir flye_hybrid \
    --threads 32

# Polish with short reads
pilon --genome flye_hybrid/assembly.fasta \
    --frags short_reads.bam \
    --output polished \
    --threads 16
```

## Key Parameters

### metaFlye

| Parameter | Description |
|-----------|-------------|
| --meta | Metagenome mode (handles uneven coverage) |
| --min-overlap | Minimum overlap for assembly (default: auto) |
| --genome-size | Estimated total size (optional for meta) |
| --iterations | Polishing iterations (default: 1) |
| --keep-haplotypes | Preserve strain variants |

### metaSPAdes

| Parameter | Description |
|-----------|-------------|
| -m | Memory limit in GB |
| --only-assembler | Skip error correction |
| -k | K-mer sizes (auto-selected by default) |
| --phred-offset | Quality encoding (33 or 64) |

## Binning Workflow

**Goal:** Recover individual genomes (MAGs) from a metagenome assembly.

**Approach:** Map reads back to the assembly for coverage, compute per-contig depth, bin with MetaBAT2, and assess quality with CheckM2.

"Bin the contigs from my metagenome assembly into individual genomes" --> Map reads for coverage, cluster contigs by composition and coverage, then evaluate bins.

```bash
# Step 1: Map reads back to assembly
minimap2 -ax map-ont -t 32 assembly.fasta reads.fastq.gz | \
    samtools sort -o mapped.bam -

# Step 2: Generate depth file
jgi_summarize_bam_contig_depths --outputDepth depth.txt mapped.bam

# Step 3: Bin with MetaBAT2
metabat2 -i assembly.fasta -a depth.txt -o bins/bin -t 32

# Step 4: Assess bin quality with CheckM2
checkm2 predict --input bins/ --output-directory checkm2_out -x fa --threads 32
```

## SemiBin2 (Deep Learning Binning)

**Goal:** Improve MAG recovery using deep learning-based contig binning.

**Approach:** Run SemiBin2 which trains a neural network on contig composition and coverage for more accurate bin assignments.

```bash
# Single-sample binning
SemiBin2 single_easy_bin \
    -i assembly.fasta \
    -b mapped.bam \
    -o semibin_out \
    --environment global

# Multi-sample binning (better for time-series)
SemiBin2 multi_easy_bin \
    -i assembly.fasta \
    -b sample1.bam sample2.bam sample3.bam \
    -o semibin_multi
```

## Quality Assessment

**Goal:** Evaluate assembly contiguity, bin completeness, and taxonomic composition.

**Approach:** Run seqkit for basic stats, CheckM2 for bin quality, GTDB-Tk for taxonomy, and MetaQUAST for assembly metrics.

```bash
# Assembly stats
seqkit stats assembly.fasta

# CheckM2 for bin completeness
checkm2 predict -i bins/ -o checkm2_out -x fa -t 32

# GTDB-Tk for taxonomic classification
gtdbtk classify_wf --genome_dir bins/ --out_dir gtdbtk_out --cpus 32

# QUAST for assembly metrics
metaquast.py -o metaquast_out assembly.fasta -t 32
```

## Circular Genome Detection

**Goal:** Identify complete circular genomes (e.g., bacterial chromosomes, plasmids) in the assembly.

**Approach:** Parse Flye's assembly_info.txt for circularity flags and extract matching contigs.

```bash
# Flye marks circular contigs in assembly_info.txt
grep "Y" flye_meta/assembly_info.txt | cut -f1 > circular_contigs.txt

# Extract circular contigs
seqkit grep -f circular_contigs.txt assembly.fasta > circular_genomes.fasta
```

## Python Pipeline

**Goal:** Provide a reusable Python workflow from metagenome assembly through binning to quality assessment.

**Approach:** Chain metaFlye assembly, MetaBAT2 binning, and CheckM2 quality filtering, returning high-quality MAGs.

```python
import subprocess
from pathlib import Path
import pandas as pd

def run_metaflye(reads, output_dir, read_type='nano-raw', threads=32):
    cmd = ['flye', f'--{read_type}', reads, '--meta', '--out-dir', output_dir, '--threads', str(threads)]
    subprocess.run(cmd, check=True)
    return Path(output_dir) / 'assembly.fasta'

def run_binning(assembly, bam, output_dir, threads=32):
    depth_file = Path(output_dir) / 'depth.txt'
    subprocess.run(['jgi_summarize_bam_contig_depths', '--outputDepth', str(depth_file), bam], check=True)

    bins_dir = Path(output_dir) / 'bins'
    bins_dir.mkdir(exist_ok=True)
    subprocess.run(['metabat2', '-i', assembly, '-a', str(depth_file), '-o', str(bins_dir / 'bin'), '-t', str(threads)], check=True)

    return bins_dir

def assess_bins(bins_dir, output_dir, threads=32):
    subprocess.run(['checkm2', 'predict', '--input', str(bins_dir), '--output-directory', output_dir, '-x', 'fa', '--threads', str(threads)], check=True)

    results = pd.read_csv(Path(output_dir) / 'quality_report.tsv', sep='\t')
    high_quality = results[(results['Completeness'] > 90) & (results['Contamination'] < 5)]
    return high_quality

# Example workflow
assembly = run_metaflye('ont_reads.fq.gz', 'flye_out')
bins = run_binning(str(assembly), 'mapped.bam', 'binning_out')
hq_bins = assess_bins(bins, 'checkm2_out')
print(f'High-quality MAGs: {len(hq_bins)}')
```

## Expected Outputs

| Metric | Good Assembly |
|--------|---------------|
| N50 | >50 kb |
| Largest contig | >1 Mb |
| HQ MAGs (>90% complete, <5% contam) | Varies by sample |
| Circular genomes | Sample dependent |

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Few long contigs | Increase read depth or length |
| High chimeric rate | Use --keep-haplotypes in Flye |
| Poor binning | Add more samples for differential coverage |
| Missing taxa | Check read QC; consider targeted enrichment |

## Related Skills

- genome-assembly/contamination-detection - CheckM2/GUNC
- metagenomics/taxonomic-profiling - Kraken2/Bracken
- metagenomics/functional-profiling - HUMAnN
- long-read-sequencing/read-qc - Input quality control
More from GPTomics/bioSkills