bio-genome-assembly-assembly-polishing
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-assembly-polishingPolish genome assemblies to reduce errors using short or long reads
- Improves accuracy of draft genome assemblies by correcting base-level errors
- Uses tools like Pilon, Racon, and medaka for polishing with short or long reads
- Selects polishing strategy based on read type and assembly source (e.g., ONT, PacBio)
- Returns polished genome files and alignment logs for validation and downstream analysis
SKILL.md
.github/skills/bio-genome-assembly-assembly-polishingView on GitHub ↗
---
name: bio-genome-assembly-assembly-polishing
description: Polish genome assemblies to reduce errors using short reads (Pilon), long reads (Racon), or ONT-specific tools (medaka). Essential for improving long-read assembly accuracy. Use when improving assembly accuracy with polishing tools.
tool_type: cli
primary_tool: Pilon
---
## Version Compatibility
Reference examples tested with: BWA 0.7.17+, QUAST 5.2+, minimap2 2.26+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Assembly Polishing
**"Polish my genome assembly"** → Iteratively correct base-level errors in a draft assembly using short-read or long-read alignments.
- CLI: `pilon --genome draft.fa --frags short_reads.bam` (short-read), `medaka_polish` (ONT), `racon` (long-read)
## Polishing Strategies
| Tool | Input Reads | Best For |
|------|-------------|----------|
| Pilon | Illumina | Final polishing |
| medaka | ONT | ONT assemblies |
| Racon | Long reads | Quick polishing |
| NextPolish | Both | Combined approach |
## Recommended Workflows
### ONT Assembly
1. Racon (2-3 rounds with ONT)
2. medaka (1 round)
3. Pilon (2-3 rounds with Illumina)
### PacBio CLR Assembly
1. Racon (2-3 rounds)
2. Pilon (2-3 rounds with Illumina)
### PacBio HiFi Assembly
- Often no polishing needed (>99% accuracy)
- Optional Pilon if Illumina available
## Pilon (Illumina Polishing)
### Installation
```bash
conda install -c bioconda pilon
```
### Basic Usage
```bash
# Map short reads to assembly
bwa index assembly.fasta
bwa mem -t 16 assembly.fasta R1.fq.gz R2.fq.gz | samtools sort -o aligned.bam
samtools index aligned.bam
# Run Pilon
pilon --genome assembly.fasta --frags aligned.bam --output polished
```
### Key Options
| Option | Description |
|--------|-------------|
| `--genome` | Input assembly |
| `--frags` | Paired-end BAM |
| `--output` | Output prefix |
| `--changes` | Write changes file |
| `--vcf` | Write VCF of changes |
| `--fix` | What to fix (snps, indels, gaps, all) |
| `--threads` | Threads for alignment |
| `--mindepth` | Min depth for correction |
### Multiple Rounds
```bash
#!/bin/bash
ASSEMBLY=$1
R1=$2
R2=$3
ROUNDS=${4:-3}
current=$ASSEMBLY
for i in $(seq 1 $ROUNDS); do
echo "=== Pilon round $i ==="
bwa index $current
bwa mem -t 16 $current $R1 $R2 | samtools sort -o round${i}.bam
samtools index round${i}.bam
pilon --genome $current --frags round${i}.bam --output pilon_round${i} --changes
current=pilon_round${i}.fasta
changes=$(wc -l < pilon_round${i}.changes)
echo "Changes made: $changes"
if [ $changes -eq 0 ]; then
echo "No more changes, stopping"
break
fi
done
cp $current final_polished.fasta
```
### Fix Specific Issues
```bash
# Only fix SNPs and small indels
pilon --genome assembly.fa --frags aligned.bam --output polished --fix snps,indels
# Only fill gaps
pilon --genome assembly.fa --frags aligned.bam --output polished --fix gaps
```
## medaka (ONT Polishing)
### Installation
```bash
conda install -c bioconda medaka
```
### Basic Usage
```bash
medaka_consensus -i reads.fastq.gz -d assembly.fasta -o medaka_output -t 8
```
### Key Options
| Option | Description |
|--------|-------------|
| `-i` | Input reads |
| `-d` | Draft assembly |
| `-o` | Output directory |
| `-t` | Threads |
| `-m` | Model name |
### Model Selection
```bash
# List available models
medaka tools list_models
# Use specific model (match your basecaller)
medaka_consensus -i reads.fq.gz -d assembly.fa -o output -m r1041_e82_400bps_sup_v5.1.0
```
### Models for Common Chemistries
| Chemistry | Model |
|-----------|-------|
| R10.4.1 + SUP | r1041_e82_400bps_sup_* |
| R10.4.1 + HAC | r1041_e82_400bps_hac_* |
| R9.4.1 + SUP | r941_sup_* |
### Output
```
medaka_output/
├── consensus.fasta # Polished assembly
├── calls_to_draft.bam # Alignments
└── *.hdf # Intermediate files
```
## Racon (Long-Read Polishing)
### Installation
```bash
conda install -c bioconda racon
```
### Basic Usage
```bash
# Map reads to assembly
minimap2 -ax map-ont assembly.fasta reads.fastq.gz > aligned.sam
# Polish
racon -t 16 reads.fastq.gz aligned.sam assembly.fasta > polished.fasta
```
### Multiple Rounds
```bash
#!/bin/bash
ASSEMBLY=$1
READS=$2
ROUNDS=${3:-3}
current=$ASSEMBLY
for i in $(seq 1 $ROUNDS); do
echo "=== Racon round $i ==="
minimap2 -ax map-ont $current $READS > round${i}.sam
racon -t 16 $READS round${i}.sam $current > racon_round${i}.fasta
current=racon_round${i}.fasta
done
cp $current racon_polished.fasta
```
### Key Options
| Option | Description |
|--------|-------------|
| `-t` | Threads |
| `-m` | Match score (default: 3) |
| `-x` | Mismatch score (default: -5) |
| `-g` | Gap penalty (default: -4) |
| `-w` | Window size (default: 500) |
## Complete Polishing Workflow
**Goal:** Maximize assembly accuracy through iterative multi-tool polishing.
**Approach:** Apply Racon with long reads, then medaka for ONT-specific error correction, then Pilon with short reads for final accuracy.
### ONT Assembly Polishing
```bash
#!/bin/bash
set -euo pipefail
ASSEMBLY=$1 # Flye assembly
ONT_READS=$2 # ONT reads
ILLUMINA_R1=$3 # Illumina R1
ILLUMINA_R2=$4 # Illumina R2
OUTDIR=$5
mkdir -p $OUTDIR
# Step 1: Racon polishing (2 rounds)
echo "=== Racon Polishing ==="
current=$ASSEMBLY
for i in 1 2; do
minimap2 -ax map-ont $current $ONT_READS > ${OUTDIR}/racon_${i}.sam
racon -t 16 $ONT_READS ${OUTDIR}/racon_${i}.sam $current > ${OUTDIR}/racon_${i}.fasta
current=${OUTDIR}/racon_${i}.fasta
done
# Step 2: medaka polishing
echo "=== medaka Polishing ==="
medaka_consensus -i $ONT_READS -d $current -o ${OUTDIR}/medaka -t 8
current=${OUTDIR}/medaka/consensus.fasta
# Step 3: Pilon polishing (2 rounds)
echo "=== Pilon Polishing ==="
for i in 1 2; do
bwa index $current
bwa mem -t 16 $current $ILLUMINA_R1 $ILLUMINA_R2 | samtools sort -o ${OUTDIR}/pilon_${i}.bam
samtools index ${OUTDIR}/pilon_${i}.bam
pilon --genome $current --frags ${OUTDIR}/pilon_${i}.bam --output ${OUTDIR}/pilon_${i}
current=${OUTDIR}/pilon_${i}.fasta
done
cp $current ${OUTDIR}/final_polished.fasta
echo "Done: ${OUTDIR}/final_polished.fasta"
```
## NextPolish (Combined Approach)
### Installation
```bash
conda install -c bioconda nextpolish
```
### Usage
```bash
# Create config file
cat > run.cfg << EOF
[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 2
multithread_jobs = 8
genome = assembly.fasta
genome_size = auto
workdir = ./01_rundir
[lgs_option]
lgs_fofn = lgs.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont
[sgs_option]
sgs_fofn = sgs.fofn
sgs_options = -max_depth 100
EOF
# File of filenames
ls reads.fastq.gz > lgs.fofn
ls R1.fq.gz R2.fq.gz > sgs.fofn
# Run
nextPolish run.cfg
```
## Quality Assessment
**Goal:** Measure the accuracy improvement from polishing.
**Approach:** Compare original and polished assemblies with QUAST against a reference, and check alignment error rates.
After polishing, assess improvement:
```bash
# Compare to reference (if available)
quast.py -r reference.fa original.fa polished.fa -o quast_comparison
# Check error rate
minimap2 -ax map-ont polished.fa reads.fq.gz | samtools stats | grep "error rate"
```
## Related Skills
- long-read-assembly - Initial assembly
- short-read-assembly - Source of polishing reads
- assembly-qc - Assess polishing improvement
- long-read-sequencing - medaka variant calling
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.