bio-genome-assembly-long-read-assembly
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-long-read-assemblyAssembles bacterial genomes from long-read sequencing data using Flye and Canu
- Solves de novo genome assembly for Oxford Nanopore or PacBio reads
- Uses Flye, Canu, Wtdbg2, and Hifiasm for assembly workflows
- Chooses tools based on read type and assembly quality requirements
- Delivers contiguous assemblies as output files for downstream analysis
SKILL.md
.github/skills/bio-genome-assembly-long-read-assemblyView on GitHub ↗
---
name: bio-genome-assembly-long-read-assembly
description: De novo genome assembly from Oxford Nanopore or PacBio long reads using Flye and Canu. Produces highly contiguous assemblies suitable for complete bacterial genomes and resolving complex regions. Use when assembling genomes from ONT or PacBio reads.
tool_type: cli
primary_tool: Flye
---
## Version Compatibility
Reference examples tested with: Canu 2.2+, Flye 2.9+, hifiasm 0.19+, wtdbg2 2.5+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Long-Read Assembly
**"Assemble a genome from long reads"** → Build a contiguous de novo assembly from ONT or PacBio reads, producing complete or near-complete chromosomes.
- CLI: `flye --nano-raw reads.fq -o output` (ONT), `canu -p asm -d output -nanopore reads.fq` (ONT/PacBio)
## Tool Comparison
| Tool | Speed | Memory | Best For |
|------|-------|--------|----------|
| Flye | Fast | Moderate | General purpose, bacteria, ONT |
| Canu | Slow | High | High accuracy, complex genomes |
| Wtdbg2 | Very fast | Low | Draft assemblies |
> **Note:** For PacBio HiFi data, see the dedicated **hifi-assembly** skill which covers hifiasm.
## Flye
### Installation
```bash
conda install -c bioconda flye
```
### Basic Usage
```bash
# Oxford Nanopore
flye --nano-raw reads.fastq.gz --out-dir flye_output --threads 16
# PacBio CLR
flye --pacbio-raw reads.fastq.gz --out-dir flye_output --threads 16
# PacBio HiFi
flye --pacbio-hifi reads.fastq.gz --out-dir flye_output --threads 16
```
### Read Type Options
| Option | Read Type |
|--------|-----------|
| `--nano-raw` | ONT regular reads |
| `--nano-corr` | ONT corrected reads |
| `--nano-hq` | ONT Q20+ reads (Guppy 5+) |
| `--pacbio-raw` | PacBio CLR |
| `--pacbio-corr` | PacBio corrected |
| `--pacbio-hifi` | PacBio HiFi/CCS |
### Key Options
| Option | Description |
|--------|-------------|
| `--out-dir` | Output directory |
| `--threads` | Number of threads |
| `--genome-size` | Estimated genome size (e.g., 5m, 100m) |
| `--iterations` | Polishing iterations (default: 1) |
| `--meta` | Metagenome mode |
| `--plasmids` | Recover plasmids |
| `--keep-haplotypes` | Don't collapse haplotypes |
| `--scaffold` | Enable scaffolding |
### Genome Size Estimation
```bash
# Estimate if unknown
flye --nano-raw reads.fq.gz --out-dir output --genome-size 5m
# Size formats: 1000, 1k, 1m, 1g
```
### Output Files
```
flye_output/
├── assembly.fasta # Final assembly
├── assembly_graph.gfa # Assembly graph
├── assembly_info.txt # Contig statistics
└── flye.log # Log file
```
### Bacterial Assembly
```bash
flye \
--nano-raw bacteria.fastq.gz \
--out-dir bacteria_assembly \
--genome-size 5m \
--threads 16
```
### Metagenome Assembly
```bash
flye \
--nano-raw metagenome.fastq.gz \
--out-dir meta_assembly \
--meta \
--threads 32
```
### With Plasmid Recovery
```bash
flye \
--nano-raw isolate.fastq.gz \
--out-dir assembly \
--plasmids \
--threads 16
```
## Canu
### Installation
```bash
conda install -c bioconda canu
```
### Basic Usage
```bash
# ONT reads
canu -p assembly -d canu_output genomeSize=5m -nanopore reads.fastq.gz
# PacBio HiFi
canu -p assembly -d canu_output genomeSize=5m -pacbio-hifi reads.fastq.gz
```
### Key Options
| Option | Description |
|--------|-------------|
| `-p` | Assembly prefix |
| `-d` | Output directory |
| `genomeSize=` | Estimated size (required) |
| `maxThreads=` | Max threads |
| `maxMemory=` | Max memory (e.g., 64g) |
| `useGrid=false` | Disable grid execution |
| `correctedErrorRate=` | Expected error rate |
### Read Type Options
| Option | Read Type |
|--------|-----------|
| `-nanopore` | ONT reads |
| `-nanopore-raw` | ONT raw (deprecated) |
| `-pacbio` | PacBio CLR |
| `-pacbio-hifi` | PacBio HiFi/CCS |
### Fast Mode
```bash
canu -p asm -d output genomeSize=5m \
-nanopore reads.fq.gz \
useGrid=false \
maxThreads=16 \
maxMemory=32g
```
### High-Quality Mode (PacBio HiFi)
```bash
canu -p asm -d output genomeSize=5m \
-pacbio-hifi reads.fq.gz \
correctedErrorRate=0.01
```
### Output Files
```
canu_output/
├── assembly.contigs.fasta # Contigs
├── assembly.unassembled.fasta
├── assembly.report
└── assembly.seqStore/
```
## Wtdbg2 (Fast Draft)
### Installation
```bash
conda install -c bioconda wtdbg
```
### Basic Usage
```bash
# Assemble
wtdbg2 -x ont -g 5m -t 16 -i reads.fq.gz -o draft
# Consensus
wtpoa-cns -t 16 -i draft.ctg.lay.gz -o draft.ctg.fa
```
### Platform Presets
| Preset | Platform |
|--------|----------|
| `-x ont` | ONT R9 |
| `-x ccs` | PacBio HiFi |
| `-x rs` | PacBio CLR |
| `-x sq` | ONT R10 |
## Complete Workflows
**Goal:** Run end-to-end long-read assembly pipelines from raw reads to contigs.
**Approach:** Use Flye for initial assembly, optionally followed by short-read polishing.
### ONT Bacterial Assembly
```bash
#!/bin/bash
set -euo pipefail
READS=$1
OUTDIR=$2
SIZE=${3:-5m}
echo "=== ONT Bacterial Assembly ==="
# Flye assembly
flye \
--nano-raw $READS \
--out-dir ${OUTDIR}/flye \
--genome-size $SIZE \
--threads 16
# Stats
echo "Assembly statistics:"
cat ${OUTDIR}/flye/assembly_info.txt
echo "Assembly: ${OUTDIR}/flye/assembly.fasta"
```
### Hybrid Assembly (Long + Short)
```bash
#!/bin/bash
set -euo pipefail
LONG=$1
SHORT_R1=$2
SHORT_R2=$3
OUTDIR=$4
# 1. Long-read assembly with Flye
flye --nano-raw $LONG --out-dir ${OUTDIR}/flye --genome-size 5m --threads 16
# 2. Polish with short reads (Pilon)
# See assembly-polishing skill
```
## Quality Expectations
| Metric | Bacterial | Eukaryotic |
|--------|-----------|------------|
| Contigs | 1-10 | 100-1000+ |
| N50 | >1 Mb | Variable |
| Complete chromosomes | Often | Rare |
## Troubleshooting
### Low Contiguity
- Check coverage (need >30x)
- Try increasing iterations in Flye
- Consider supplementing with short reads
### Memory Issues
- Use Flye (more memory efficient)
- Reduce threads
- Filter reads by length/quality
### Misassemblies
- Polish with Pilon/medaka
- Validate with short reads
- Check for contamination
## Related Skills
- hifi-assembly - PacBio HiFi assembly with hifiasm
- assembly-polishing - Polish long-read assemblies
- assembly-qc - QUAST and BUSCO assessment
- short-read-assembly - Hybrid with Illumina
- long-read-sequencing - Read QC and alignment
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.