bio-genome-assembly-scaffolding

$npx mdskill add GPTomics/bioSkills/bio-genome-assembly-scaffolding

Scaffolds contigs into chromosome-level assemblies using Hi-C data and validation tools

  • Solves the problem of ordering and orienting contigs into larger genomic structures
  • Uses YaHS, 3D-DNA, SALSA2, BWA, pairtools, BUSCO, and QUAST for scaffolding and validation
  • Analyzes Hi-C contact maps and alignment data to infer correct contig order and orientation
  • Delivers scaffolded assemblies and validation reports for chromosome-level accuracy
SKILL.md
.github/skills/bio-genome-assembly-scaffoldingView on GitHub ↗
---
name: bio-genome-assembly-scaffolding
description: Scaffold contigs into chromosome-level assemblies using Hi-C data with YaHS, 3D-DNA, SALSA2, and validate with BUSCO and contact maps. Use when scaffolding contigs to chromosome-level assemblies.
tool_type: cli
primary_tool: YaHS
---

## Version Compatibility

Reference examples tested with: BUSCO 5.5+, BWA 0.7.17+, QUAST 5.2+, bedtools 2.31+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Genome Scaffolding

**"Scaffold my contigs to chromosome level"** → Order and orient contigs into chromosome-scale scaffolds using Hi-C proximity ligation data.
- CLI: `yahs contigs.fa hic_alignments.bam`, `3d-dna`, `salsa2`

## Hi-C Data Preprocessing

**Goal:** Prepare Hi-C read alignments for scaffolding tools.

**Approach:** Align Hi-C reads with BWA mem using Hi-C-specific flags (-5SP), then filter and deduplicate contacts with pairtools.

```bash
# Align Hi-C reads to draft assembly
bwa index draft_assembly.fa
bwa mem -5SP -t 16 draft_assembly.fa hic_R1.fq.gz hic_R2.fq.gz | \
    samtools view -@ 8 -bhS - > aligned.bam

# Filter for Hi-C contacts (pairtools)
pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 \
    --nproc-in 8 --nproc-out 8 --chroms-path draft_assembly.fa.fai aligned.bam | \
    pairtools sort --nproc 8 | \
    pairtools dedup --nproc 8 --mark-dups | \
    pairtools split --output-pairs contacts.pairs.gz
```

## YaHS Scaffolding (Recommended)

**Goal:** Order and orient contigs into chromosome-scale scaffolds using Hi-C contact data.

**Approach:** Convert alignments to BED format and run YaHS, which uses an iterative scaffolding algorithm with contact frequency optimization.

```bash
# Index assembly
samtools faidx draft_assembly.fa

# Convert BAM to BED
bedtools bamtobed -i aligned.bam | sort -k4 > aligned.bed

# Run YaHS
yahs draft_assembly.fa aligned.bed -o scaffolds

# Output files:
# scaffolds_scaffolds_final.fa - final scaffolds
# scaffolds_scaffolds_final.agp - AGP file
# scaffolds.bin - contact matrix
```

## YaHS with Error Correction

**Goal:** Generate scaffolds with contact maps suitable for manual curation in Juicebox.

**Approach:** Run YaHS with error correction, then create a .hic file from the scaffolding output for visualization.

```bash
# Run with error correction
yahs draft_assembly.fa aligned.bed -o scaffolds --no-contig-ec

# Generate contact map for juicebox
juicer pre scaffolds.bin scaffolds_scaffolds_final.agp draft_assembly.fa.fai | \
    sort -k2,2d -k6,6d -T ./ --parallel=8 -S 50G | \
    awk 'NF' > scaffolds.pre.txt

# Create .hic file
java -Xmx48G -jar juicer_tools.jar pre scaffolds.pre.txt scaffolds.hic \
    <(cut -f1,2 scaffolds_scaffolds_final.fa.fai)
```

## 3D-DNA Pipeline

**Goal:** Scaffold contigs using the 3D-DNA automated pipeline with Juicer integration.

**Approach:** Run the 3D-DNA assembly pipeline on Juicer-processed Hi-C data, then optionally refine with Juicebox review.

```bash
# Prepare input (requires Juicer aligned data)
# Run Juicer first to get merged_nodups.txt

# Run 3D-DNA
run-asm-pipeline.sh -r 2 draft_assembly.fa merged_nodups.txt

# Output: draft_assembly.final.fasta

# Generate review assembly for Juicebox
run-asm-pipeline-post-review.sh -r draft_assembly.final.review.assembly \
    draft_assembly.final.fasta merged_nodups.txt
```

## SALSA2 Scaffolding

**Goal:** Scaffold contigs using SALSA2's proximity-guided assembly algorithm.

**Approach:** Run SALSA2 with Hi-C alignments and restriction enzyme sites for scaffolding decisions.

```bash
# Run SALSA2
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
    -b aligned.bed -e GATC -o salsa_output -m yes

# With multiple restriction enzymes
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
    -b aligned.bed -e GATC,GANTC -o salsa_output -m yes -p yes
```

## Generate Contact Map

**Goal:** Create multi-resolution contact matrices for visualization and validation.

**Approach:** Load contacts into cooler format, balance the matrix, and generate a multi-resolution .mcool file.

```bash
# Using cooler
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 \
    draft_assembly.fa.fai:10000 contacts.pairs.gz scaffolds.cool

# Balance matrix
cooler balance scaffolds.cool

# Multi-resolution (mcool)
cooler zoomify scaffolds.cool -o scaffolds.mcool
```

## Visualize with HiGlass

**Goal:** Interactively visualize Hi-C contact maps to verify scaffolding correctness.

**Approach:** Convert data to HiGlass-compatible formats and serve via Docker container.

```bash
# Convert to higlass format
clodius aggregate bedfile --chromsizes-filename chrom.sizes \
    --output-file scaffolds.beddb scaffold_boundaries.bed

# Load into higlass server
docker run --detach --publish 8888:80 \
    --volume ~/hg-data:/data \
    higlass/higlass-docker:latest
```

## Manual Curation (Juicebox)

```bash
# Load .hic file in Juicebox Assembly Tools (JBAT)
# Perform manual corrections:
# - Break misjoins
# - Order/orient scaffolds
# - Merge scaffolds

# Export corrected assembly
# File -> Export Assembly -> FASTA
```

## Post-Scaffolding Gap Filling

```bash
# TGS-GapCloser for long-read gap filling
tgsgapcloser --scaff scaffolds.fa --reads ont_reads.fq.gz \
    --output filled --thread 16 --ne

# LR_Gapcloser alternative
LR_Gapcloser.sh -i scaffolds.fa -l ont_reads.fq.gz -t 16 -o gapclosed.fa
```

## Validate Scaffolding

```bash
# Check chromosome-scale contiguity
seqkit stats scaffolds.fa

# BUSCO on scaffolds
busco -i scaffolds.fa -l eukaryota_odb10 -o busco_scaffolds -m genome -c 16

# N50/L50 statistics
assembly-stats scaffolds.fa

# Compare pre/post scaffolding
quast.py draft_assembly.fa scaffolds.fa -o quast_comparison
```

## Check Telomeres

```bash
# Find telomeric repeats (vertebrate TTAGGG)
seqkit locate -i -p 'TTAGGG{10,}' scaffolds.fa > telomeres_forward.bed
seqkit locate -i -p 'CCCTAA{10,}' scaffolds.fa > telomeres_reverse.bed

# Count chromosomes with telomeres on both ends
awk '$2 < 1000' telomeres_forward.bed | cut -f1 | sort -u > left_telomeres.txt
awk -v OFS='\t' 'NR==FNR{len[$1]=$2;next} $3 > len[$1]-1000' \
    scaffolds.fa.fai telomeres_reverse.bed | cut -f1 | sort -u > right_telomeres.txt
comm -12 left_telomeres.txt right_telomeres.txt > complete_chromosomes.txt
```

## Rename to Chromosomes

```bash
# After manual curation, rename scaffolds to chromosomes
awk '/^>/{print ">chr" ++i; next}{print}' scaffolds.fa > chromosomes.fa

# Or with mapping file
seqkit replace -p '(.+)' -r '{kv}' -k name_mapping.tsv scaffolds.fa > chromosomes.fa
```

## Related Skills

- genome-assembly/long-read-assembly - Generate initial contigs
- genome-assembly/assembly-polishing - Polish before scaffolding
- genome-assembly/assembly-qc - Validate final assembly
- hi-c-analysis/hic-data-io - Hi-C data processing
More from GPTomics/bioSkills