bio-genome-assembly-hifi-assembly
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-hifi-assemblyAssembles high-quality diploid genomes from PacBio HiFi reads using hifiasm
- Solves the task of generating reference-quality diploid genome assemblies
- Relies on hifiasm and supports phasing with trio or Hi-C data
- Uses HiFi read data to produce phased haplotype-resolved assemblies
- Delivers primary and alternate contig files in GFA format for downstream analysis
SKILL.md
.github/skills/bio-genome-assembly-hifi-assemblyView on GitHub ↗
---
name: bio-genome-assembly-hifi-assembly
description: High-quality genome assembly from PacBio HiFi reads using hifiasm with phasing support. Use when building reference-quality diploid assemblies from HiFi data, especially with trio or Hi-C phasing for fully resolved haplotypes.
tool_type: cli
primary_tool: hifiasm
---
## Version Compatibility
Reference examples tested with: BUSCO 5.5+, QUAST 5.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# HiFi Assembly
**"Assemble a genome from HiFi reads"** → Build a phased, reference-quality diploid assembly from PacBio HiFi reads with optional trio or Hi-C phasing for full haplotype resolution.
- CLI: `hifiasm -o output reads.fq.gz`
## Basic Assembly
**Goal:** Produce a primary contig assembly from PacBio HiFi reads.
**Approach:** Run hifiasm with default parameters and convert GFA output to FASTA.
```bash
# Primary assembly (single haplotype consensus)
hifiasm -o output_prefix -t 32 reads.hifi.fastq.gz
# Output files:
# output_prefix.bp.p_ctg.gfa - Primary contigs
# output_prefix.bp.a_ctg.gfa - Alternate contigs
# output_prefix.bp.hap1.p_ctg.gfa - Haplotype 1 (if phased)
# output_prefix.bp.hap2.p_ctg.gfa - Haplotype 2 (if phased)
# Convert GFA to FASTA
awk '/^S/{print ">"$2;print $3}' output_prefix.bp.p_ctg.gfa > assembly.fasta
```
## Trio-Binned Phasing
**Goal:** Generate fully phased haplotype assemblies using parental short-read data.
**Approach:** Build k-mer databases from parental reads with yak, then supply them to hifiasm for trio binning.
```bash
# With parental short reads for trio binning
hifiasm -o trio_asm -t 32 \
-1 paternal.yak \
-2 maternal.yak \
child.hifi.fastq.gz
# Create yak databases from parental Illumina reads first
yak count -b37 -t16 -o paternal.yak paternal_R1.fq.gz paternal_R2.fq.gz
yak count -b37 -t16 -o maternal.yak maternal_R1.fq.gz maternal_R2.fq.gz
```
## Hi-C Phasing
**Goal:** Phase haplotypes without parental data using chromatin proximity.
**Approach:** Supply Hi-C paired-end reads to hifiasm for contact-based haplotype resolution.
```bash
# Use Hi-C reads for phasing (no parents needed)
hifiasm -o hic_asm -t 32 \
--h1 hic_R1.fastq.gz \
--h2 hic_R2.fastq.gz \
reads.hifi.fastq.gz
# Produces fully phased hap1 and hap2 assemblies
```
## Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| -t | 1 | Threads |
| -l | 0 | Purge level (0=none, 1=light, 2=aggressive) |
| -s | 0.55 | Similarity threshold for duplicate detection |
| --primary | - | Output primary contigs only (no alternates) |
| --n-hap | 2 | Expected number of haplotypes |
| -D | 5.0 | Drop reads with depth > D*average |
| -N | 100 | Consider up to N overlaps for each read |
## Purge Duplicates
**Goal:** Control haplotype duplication levels in the primary assembly.
**Approach:** Adjust the hifiasm purge level parameter based on sample heterozygosity.
```bash
# Aggressive purging for high heterozygosity
hifiasm -o asm -t 32 -l 2 reads.hifi.fastq.gz
# Minimal purging for inbred samples
hifiasm -o asm -t 32 -l 0 reads.hifi.fastq.gz
```
## Ultra-Long ONT Integration
**Goal:** Improve contiguity across complex repeats by supplementing HiFi with ultra-long ONT reads.
**Approach:** Supply ONT reads via the --ul flag to span regions that HiFi reads alone cannot resolve.
```bash
# Combine HiFi accuracy with ONT length
hifiasm -o hybrid_asm -t 32 \
--ul ont_ultralong.fastq.gz \
hifi_reads.fastq.gz
# UL reads help span complex repeats
```
## Assembly Stats
**Goal:** Evaluate assembly contiguity, completeness, and correctness.
**Approach:** Compute summary statistics with seqkit/assembly-stats, run QUAST for contiguity metrics, and BUSCO for gene completeness.
```bash
# Quick stats with seqkit
seqkit stats assembly.fasta
# Detailed with assembly-stats
assembly-stats assembly.fasta
# QUAST assessment
quast.py -o quast_output assembly.fasta
# BUSCO completeness
busco -i assembly.fasta -l mammalia_odb10 -o busco_out -m genome
```
## Memory and Runtime
| Genome Size | HiFi Coverage | RAM | Time (32 cores) |
|-------------|---------------|-----|-----------------|
| 3 Gb | 30x | ~200 GB | 12-24 hours |
| 3 Gb | 60x | ~400 GB | 24-48 hours |
| 500 Mb | 40x | ~64 GB | 2-4 hours |
## Python Wrapper
**Goal:** Provide a reusable Python interface for running hifiasm with various phasing modes.
**Approach:** Wrap the hifiasm CLI call with optional Hi-C and ultra-long ONT parameters, then convert GFA output to FASTA.
```python
import subprocess
from pathlib import Path
def run_hifiasm(hifi_reads, output_prefix, threads=32, purge_level=0,
hic_r1=None, hic_r2=None, ul_reads=None):
cmd = ['hifiasm', '-o', output_prefix, '-t', str(threads), '-l', str(purge_level)]
if hic_r1 and hic_r2:
cmd.extend(['--h1', hic_r1, '--h2', hic_r2])
if ul_reads:
cmd.extend(['--ul', ul_reads])
cmd.append(hifi_reads)
subprocess.run(cmd, check=True)
gfa = Path(f'{output_prefix}.bp.p_ctg.gfa')
fasta = Path(f'{output_prefix}.fasta')
with open(fasta, 'w') as out:
with open(gfa) as f:
for line in f:
if line.startswith('S'):
parts = line.strip().split('\t')
out.write(f'>{parts[1]}\n{parts[2]}\n')
return fasta
# Example
assembly = run_hifiasm('sample.hifi.fq.gz', 'sample_asm', threads=48, hic_r1='hic_R1.fq.gz', hic_r2='hic_R2.fq.gz')
```
## Troubleshooting
| Issue | Solution |
|-------|----------|
| High duplication | Increase purge level (-l 2) |
| Missing haplotypes | Add Hi-C or trio data for phasing |
| Memory errors | Reduce -D parameter or downsample reads |
| Fragmented assembly | Check read quality; consider UL ONT addition |
## Related Skills
- genome-assembly/assembly-qc - QUAST and BUSCO
- genome-assembly/scaffolding - YaHS Hi-C scaffolding
- genome-assembly/contamination-detection - CheckM2 decontamination
- long-read-sequencing/read-qc - HiFi quality control
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.