bio-genome-assembly-assembly-polishing

$npx mdskill add GPTomics/bioSkills/bio-genome-assembly-assembly-polishing

Polish genome assemblies to reduce errors using short or long reads

  • Improves accuracy of draft genome assemblies by correcting base-level errors
  • Uses tools like Pilon, Racon, and medaka for polishing with short or long reads
  • Selects polishing strategy based on read type and assembly source (e.g., ONT, PacBio)
  • Returns polished genome files and alignment logs for validation and downstream analysis
SKILL.md
.github/skills/bio-genome-assembly-assembly-polishingView on GitHub ↗
---
name: bio-genome-assembly-assembly-polishing
description: Polish genome assemblies to reduce errors using short reads (Pilon), long reads (Racon), or ONT-specific tools (medaka). Essential for improving long-read assembly accuracy. Use when improving assembly accuracy with polishing tools.
tool_type: cli
primary_tool: Pilon
---

## Version Compatibility

Reference examples tested with: BWA 0.7.17+, QUAST 5.2+, minimap2 2.26+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Assembly Polishing

**"Polish my genome assembly"** → Iteratively correct base-level errors in a draft assembly using short-read or long-read alignments.
- CLI: `pilon --genome draft.fa --frags short_reads.bam` (short-read), `medaka_polish` (ONT), `racon` (long-read)

## Polishing Strategies

| Tool | Input Reads | Best For |
|------|-------------|----------|
| Pilon | Illumina | Final polishing |
| medaka | ONT | ONT assemblies |
| Racon | Long reads | Quick polishing |
| NextPolish | Both | Combined approach |

## Recommended Workflows

### ONT Assembly
1. Racon (2-3 rounds with ONT)
2. medaka (1 round)
3. Pilon (2-3 rounds with Illumina)

### PacBio CLR Assembly
1. Racon (2-3 rounds)
2. Pilon (2-3 rounds with Illumina)

### PacBio HiFi Assembly
- Often no polishing needed (>99% accuracy)
- Optional Pilon if Illumina available

## Pilon (Illumina Polishing)

### Installation

```bash
conda install -c bioconda pilon
```

### Basic Usage

```bash
# Map short reads to assembly
bwa index assembly.fasta
bwa mem -t 16 assembly.fasta R1.fq.gz R2.fq.gz | samtools sort -o aligned.bam
samtools index aligned.bam

# Run Pilon
pilon --genome assembly.fasta --frags aligned.bam --output polished
```

### Key Options

| Option | Description |
|--------|-------------|
| `--genome` | Input assembly |
| `--frags` | Paired-end BAM |
| `--output` | Output prefix |
| `--changes` | Write changes file |
| `--vcf` | Write VCF of changes |
| `--fix` | What to fix (snps, indels, gaps, all) |
| `--threads` | Threads for alignment |
| `--mindepth` | Min depth for correction |

### Multiple Rounds

```bash
#!/bin/bash
ASSEMBLY=$1
R1=$2
R2=$3
ROUNDS=${4:-3}

current=$ASSEMBLY

for i in $(seq 1 $ROUNDS); do
    echo "=== Pilon round $i ==="
    bwa index $current
    bwa mem -t 16 $current $R1 $R2 | samtools sort -o round${i}.bam
    samtools index round${i}.bam

    pilon --genome $current --frags round${i}.bam --output pilon_round${i} --changes

    current=pilon_round${i}.fasta

    changes=$(wc -l < pilon_round${i}.changes)
    echo "Changes made: $changes"

    if [ $changes -eq 0 ]; then
        echo "No more changes, stopping"
        break
    fi
done

cp $current final_polished.fasta
```

### Fix Specific Issues

```bash
# Only fix SNPs and small indels
pilon --genome assembly.fa --frags aligned.bam --output polished --fix snps,indels

# Only fill gaps
pilon --genome assembly.fa --frags aligned.bam --output polished --fix gaps
```

## medaka (ONT Polishing)

### Installation

```bash
conda install -c bioconda medaka
```

### Basic Usage

```bash
medaka_consensus -i reads.fastq.gz -d assembly.fasta -o medaka_output -t 8
```

### Key Options

| Option | Description |
|--------|-------------|
| `-i` | Input reads |
| `-d` | Draft assembly |
| `-o` | Output directory |
| `-t` | Threads |
| `-m` | Model name |

### Model Selection

```bash
# List available models
medaka tools list_models

# Use specific model (match your basecaller)
medaka_consensus -i reads.fq.gz -d assembly.fa -o output -m r1041_e82_400bps_sup_v5.1.0
```

### Models for Common Chemistries

| Chemistry | Model |
|-----------|-------|
| R10.4.1 + SUP | r1041_e82_400bps_sup_* |
| R10.4.1 + HAC | r1041_e82_400bps_hac_* |
| R9.4.1 + SUP | r941_sup_* |

### Output

```
medaka_output/
├── consensus.fasta    # Polished assembly
├── calls_to_draft.bam # Alignments
└── *.hdf              # Intermediate files
```

## Racon (Long-Read Polishing)

### Installation

```bash
conda install -c bioconda racon
```

### Basic Usage

```bash
# Map reads to assembly
minimap2 -ax map-ont assembly.fasta reads.fastq.gz > aligned.sam

# Polish
racon -t 16 reads.fastq.gz aligned.sam assembly.fasta > polished.fasta
```

### Multiple Rounds

```bash
#!/bin/bash
ASSEMBLY=$1
READS=$2
ROUNDS=${3:-3}

current=$ASSEMBLY

for i in $(seq 1 $ROUNDS); do
    echo "=== Racon round $i ==="
    minimap2 -ax map-ont $current $READS > round${i}.sam
    racon -t 16 $READS round${i}.sam $current > racon_round${i}.fasta
    current=racon_round${i}.fasta
done

cp $current racon_polished.fasta
```

### Key Options

| Option | Description |
|--------|-------------|
| `-t` | Threads |
| `-m` | Match score (default: 3) |
| `-x` | Mismatch score (default: -5) |
| `-g` | Gap penalty (default: -4) |
| `-w` | Window size (default: 500) |

## Complete Polishing Workflow

**Goal:** Maximize assembly accuracy through iterative multi-tool polishing.

**Approach:** Apply Racon with long reads, then medaka for ONT-specific error correction, then Pilon with short reads for final accuracy.

### ONT Assembly Polishing

```bash
#!/bin/bash
set -euo pipefail

ASSEMBLY=$1      # Flye assembly
ONT_READS=$2     # ONT reads
ILLUMINA_R1=$3   # Illumina R1
ILLUMINA_R2=$4   # Illumina R2
OUTDIR=$5

mkdir -p $OUTDIR

# Step 1: Racon polishing (2 rounds)
echo "=== Racon Polishing ==="
current=$ASSEMBLY
for i in 1 2; do
    minimap2 -ax map-ont $current $ONT_READS > ${OUTDIR}/racon_${i}.sam
    racon -t 16 $ONT_READS ${OUTDIR}/racon_${i}.sam $current > ${OUTDIR}/racon_${i}.fasta
    current=${OUTDIR}/racon_${i}.fasta
done

# Step 2: medaka polishing
echo "=== medaka Polishing ==="
medaka_consensus -i $ONT_READS -d $current -o ${OUTDIR}/medaka -t 8
current=${OUTDIR}/medaka/consensus.fasta

# Step 3: Pilon polishing (2 rounds)
echo "=== Pilon Polishing ==="
for i in 1 2; do
    bwa index $current
    bwa mem -t 16 $current $ILLUMINA_R1 $ILLUMINA_R2 | samtools sort -o ${OUTDIR}/pilon_${i}.bam
    samtools index ${OUTDIR}/pilon_${i}.bam
    pilon --genome $current --frags ${OUTDIR}/pilon_${i}.bam --output ${OUTDIR}/pilon_${i}
    current=${OUTDIR}/pilon_${i}.fasta
done

cp $current ${OUTDIR}/final_polished.fasta
echo "Done: ${OUTDIR}/final_polished.fasta"
```

## NextPolish (Combined Approach)

### Installation

```bash
conda install -c bioconda nextpolish
```

### Usage

```bash
# Create config file
cat > run.cfg << EOF
[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 2
multithread_jobs = 8
genome = assembly.fasta
genome_size = auto
workdir = ./01_rundir
[lgs_option]
lgs_fofn = lgs.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont
[sgs_option]
sgs_fofn = sgs.fofn
sgs_options = -max_depth 100
EOF

# File of filenames
ls reads.fastq.gz > lgs.fofn
ls R1.fq.gz R2.fq.gz > sgs.fofn

# Run
nextPolish run.cfg
```

## Quality Assessment

**Goal:** Measure the accuracy improvement from polishing.

**Approach:** Compare original and polished assemblies with QUAST against a reference, and check alignment error rates.

After polishing, assess improvement:

```bash
# Compare to reference (if available)
quast.py -r reference.fa original.fa polished.fa -o quast_comparison

# Check error rate
minimap2 -ax map-ont polished.fa reads.fq.gz | samtools stats | grep "error rate"
```

## Related Skills

- long-read-assembly - Initial assembly
- short-read-assembly - Source of polishing reads
- assembly-qc - Assess polishing improvement
- long-read-sequencing - medaka variant calling
More from GPTomics/bioSkills