bio-genome-assembly-scaffolding
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-scaffoldingScaffolds contigs into chromosome-level assemblies using Hi-C data and validation tools
- Solves the problem of ordering and orienting contigs into larger genomic structures
- Uses YaHS, 3D-DNA, SALSA2, BWA, pairtools, BUSCO, and QUAST for scaffolding and validation
- Analyzes Hi-C contact maps and alignment data to infer correct contig order and orientation
- Delivers scaffolded assemblies and validation reports for chromosome-level accuracy
SKILL.md
.github/skills/bio-genome-assembly-scaffoldingView on GitHub ↗
---
name: bio-genome-assembly-scaffolding
description: Scaffold contigs into chromosome-level assemblies using Hi-C data with YaHS, 3D-DNA, SALSA2, and validate with BUSCO and contact maps. Use when scaffolding contigs to chromosome-level assemblies.
tool_type: cli
primary_tool: YaHS
---
## Version Compatibility
Reference examples tested with: BUSCO 5.5+, BWA 0.7.17+, QUAST 5.2+, bedtools 2.31+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Genome Scaffolding
**"Scaffold my contigs to chromosome level"** → Order and orient contigs into chromosome-scale scaffolds using Hi-C proximity ligation data.
- CLI: `yahs contigs.fa hic_alignments.bam`, `3d-dna`, `salsa2`
## Hi-C Data Preprocessing
**Goal:** Prepare Hi-C read alignments for scaffolding tools.
**Approach:** Align Hi-C reads with BWA mem using Hi-C-specific flags (-5SP), then filter and deduplicate contacts with pairtools.
```bash
# Align Hi-C reads to draft assembly
bwa index draft_assembly.fa
bwa mem -5SP -t 16 draft_assembly.fa hic_R1.fq.gz hic_R2.fq.gz | \
samtools view -@ 8 -bhS - > aligned.bam
# Filter for Hi-C contacts (pairtools)
pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 \
--nproc-in 8 --nproc-out 8 --chroms-path draft_assembly.fa.fai aligned.bam | \
pairtools sort --nproc 8 | \
pairtools dedup --nproc 8 --mark-dups | \
pairtools split --output-pairs contacts.pairs.gz
```
## YaHS Scaffolding (Recommended)
**Goal:** Order and orient contigs into chromosome-scale scaffolds using Hi-C contact data.
**Approach:** Convert alignments to BED format and run YaHS, which uses an iterative scaffolding algorithm with contact frequency optimization.
```bash
# Index assembly
samtools faidx draft_assembly.fa
# Convert BAM to BED
bedtools bamtobed -i aligned.bam | sort -k4 > aligned.bed
# Run YaHS
yahs draft_assembly.fa aligned.bed -o scaffolds
# Output files:
# scaffolds_scaffolds_final.fa - final scaffolds
# scaffolds_scaffolds_final.agp - AGP file
# scaffolds.bin - contact matrix
```
## YaHS with Error Correction
**Goal:** Generate scaffolds with contact maps suitable for manual curation in Juicebox.
**Approach:** Run YaHS with error correction, then create a .hic file from the scaffolding output for visualization.
```bash
# Run with error correction
yahs draft_assembly.fa aligned.bed -o scaffolds --no-contig-ec
# Generate contact map for juicebox
juicer pre scaffolds.bin scaffolds_scaffolds_final.agp draft_assembly.fa.fai | \
sort -k2,2d -k6,6d -T ./ --parallel=8 -S 50G | \
awk 'NF' > scaffolds.pre.txt
# Create .hic file
java -Xmx48G -jar juicer_tools.jar pre scaffolds.pre.txt scaffolds.hic \
<(cut -f1,2 scaffolds_scaffolds_final.fa.fai)
```
## 3D-DNA Pipeline
**Goal:** Scaffold contigs using the 3D-DNA automated pipeline with Juicer integration.
**Approach:** Run the 3D-DNA assembly pipeline on Juicer-processed Hi-C data, then optionally refine with Juicebox review.
```bash
# Prepare input (requires Juicer aligned data)
# Run Juicer first to get merged_nodups.txt
# Run 3D-DNA
run-asm-pipeline.sh -r 2 draft_assembly.fa merged_nodups.txt
# Output: draft_assembly.final.fasta
# Generate review assembly for Juicebox
run-asm-pipeline-post-review.sh -r draft_assembly.final.review.assembly \
draft_assembly.final.fasta merged_nodups.txt
```
## SALSA2 Scaffolding
**Goal:** Scaffold contigs using SALSA2's proximity-guided assembly algorithm.
**Approach:** Run SALSA2 with Hi-C alignments and restriction enzyme sites for scaffolding decisions.
```bash
# Run SALSA2
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
-b aligned.bed -e GATC -o salsa_output -m yes
# With multiple restriction enzymes
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
-b aligned.bed -e GATC,GANTC -o salsa_output -m yes -p yes
```
## Generate Contact Map
**Goal:** Create multi-resolution contact matrices for visualization and validation.
**Approach:** Load contacts into cooler format, balance the matrix, and generate a multi-resolution .mcool file.
```bash
# Using cooler
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 \
draft_assembly.fa.fai:10000 contacts.pairs.gz scaffolds.cool
# Balance matrix
cooler balance scaffolds.cool
# Multi-resolution (mcool)
cooler zoomify scaffolds.cool -o scaffolds.mcool
```
## Visualize with HiGlass
**Goal:** Interactively visualize Hi-C contact maps to verify scaffolding correctness.
**Approach:** Convert data to HiGlass-compatible formats and serve via Docker container.
```bash
# Convert to higlass format
clodius aggregate bedfile --chromsizes-filename chrom.sizes \
--output-file scaffolds.beddb scaffold_boundaries.bed
# Load into higlass server
docker run --detach --publish 8888:80 \
--volume ~/hg-data:/data \
higlass/higlass-docker:latest
```
## Manual Curation (Juicebox)
```bash
# Load .hic file in Juicebox Assembly Tools (JBAT)
# Perform manual corrections:
# - Break misjoins
# - Order/orient scaffolds
# - Merge scaffolds
# Export corrected assembly
# File -> Export Assembly -> FASTA
```
## Post-Scaffolding Gap Filling
```bash
# TGS-GapCloser for long-read gap filling
tgsgapcloser --scaff scaffolds.fa --reads ont_reads.fq.gz \
--output filled --thread 16 --ne
# LR_Gapcloser alternative
LR_Gapcloser.sh -i scaffolds.fa -l ont_reads.fq.gz -t 16 -o gapclosed.fa
```
## Validate Scaffolding
```bash
# Check chromosome-scale contiguity
seqkit stats scaffolds.fa
# BUSCO on scaffolds
busco -i scaffolds.fa -l eukaryota_odb10 -o busco_scaffolds -m genome -c 16
# N50/L50 statistics
assembly-stats scaffolds.fa
# Compare pre/post scaffolding
quast.py draft_assembly.fa scaffolds.fa -o quast_comparison
```
## Check Telomeres
```bash
# Find telomeric repeats (vertebrate TTAGGG)
seqkit locate -i -p 'TTAGGG{10,}' scaffolds.fa > telomeres_forward.bed
seqkit locate -i -p 'CCCTAA{10,}' scaffolds.fa > telomeres_reverse.bed
# Count chromosomes with telomeres on both ends
awk '$2 < 1000' telomeres_forward.bed | cut -f1 | sort -u > left_telomeres.txt
awk -v OFS='\t' 'NR==FNR{len[$1]=$2;next} $3 > len[$1]-1000' \
scaffolds.fa.fai telomeres_reverse.bed | cut -f1 | sort -u > right_telomeres.txt
comm -12 left_telomeres.txt right_telomeres.txt > complete_chromosomes.txt
```
## Rename to Chromosomes
```bash
# After manual curation, rename scaffolds to chromosomes
awk '/^>/{print ">chr" ++i; next}{print}' scaffolds.fa > chromosomes.fa
# Or with mapping file
seqkit replace -p '(.+)' -r '{kv}' -k name_mapping.tsv scaffolds.fa > chromosomes.fa
```
## Related Skills
- genome-assembly/long-read-assembly - Generate initial contigs
- genome-assembly/assembly-polishing - Polish before scaffolding
- genome-assembly/assembly-qc - Validate final assembly
- hi-c-analysis/hic-data-io - Hi-C data processing
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.