bio-genome-assembly-contamination-detection
$
npx mdskill add GPTomics/bioSkills/bio-genome-assembly-contamination-detectionDetects contamination and evaluates genome quality using bioinformatics tools
- Identifies contamination in metagenome-assembled genomes and isolate assemblies
- Uses CheckM, CheckM2, GTDB-Tk, and GUNC for analysis
- Evaluates completeness, contamination, and coding density via marker genes and chimeric detection
- Generates quality reports with metrics like Completeness and Contamination
SKILL.md
.github/skills/bio-genome-assembly-contamination-detectionView on GitHub ↗
---
name: bio-genome-assembly-contamination-detection
description: Detect contamination and assess genome quality using CheckM, CheckM2, GTDB-Tk, and GUNC for metagenome-assembled genomes and isolate assemblies. Use when checking assemblies for contamination.
tool_type: cli
primary_tool: CheckM2
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Contamination Detection
**"Check my assembly for contamination"** → Evaluate genome completeness and detect contaminating sequences using marker gene sets or chimeric contig detection.
- CLI: `checkm2 predict --input assembly.fa`, `gunc run`, `gtdbtk classify_wf`
## CheckM2 (Recommended)
```bash
# Run CheckM2 on single genome
checkm2 predict --input assembly.fa --output-directory checkm2_output --threads 16
# Run on multiple genomes (directory of FASTAs)
checkm2 predict --input genomes/ --output-directory checkm2_output \
--threads 16 --extension fa
# Output: quality_report.tsv with Completeness, Contamination, Coding_Density
```
## Interpret CheckM2 Results
```bash
# quality_report.tsv columns:
# Name, Completeness, Contamination, Completeness_Model_Used,
# Translation_Table_Used, Coding_Density, Contig_N50, Average_Gene_Length,
# Genome_Size, GC_Content, Total_Coding_Sequences
# Filter high-quality genomes (MIMAG standards)
awk -F'\t' 'NR==1 || ($2 > 90 && $3 < 5)' quality_report.tsv > high_quality_mags.tsv
# Medium quality
awk -F'\t' 'NR==1 || ($2 >= 50 && $3 < 10)' quality_report.tsv > medium_quality_mags.tsv
```
## CheckM (Original)
```bash
# Run CheckM lineage workflow
checkm lineage_wf -t 16 -x fa genomes/ checkm_output/
# Generate summary
checkm qa checkm_output/lineage.ms checkm_output/ -o 2 -f checkm_summary.tsv --tab_table
# Extended report with marker genes
checkm qa checkm_output/lineage.ms checkm_output/ -o 2 --tab_table \
-f checkm_extended.tsv
```
## CheckM Plots
```bash
# Completeness vs Contamination plot
checkm bin_qa_plot -x fa checkm_output/ genomes/ plots/
# GC and coding density
checkm coding_plot -x fa checkm_output/ genomes/ plots/
# Marker gene positions
checkm marker_plot -x fa checkm_output/ genomes/ plots/
```
## GTDB-Tk Taxonomic Classification
```bash
# Classify genomes
gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output \
--extension fa --cpus 16
# With species-level ANI
gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output \
--extension fa --cpus 16 --skip_ani_screen
# Output files:
# gtdbtk.bac120.summary.tsv - bacterial classifications
# gtdbtk.ar53.summary.tsv - archaeal classifications
```
## GTDB-Tk De Novo Workflow
```bash
# When genomes may include novel taxa
gtdbtk de_novo_wf --genome_dir genomes/ --out_dir gtdbtk_denovo \
--bacteria --extension fa --cpus 16
```
## GUNC Chimerism Detection
```bash
# Run GUNC
gunc run -d genomes/ -o gunc_output -t 16 -e .fa
# Output: GUNC.progenomes_2.1.maxCSS_level.tsv
# Key columns: pass.GUNC (true/false), contamination_portion, clade_separation_score
# Filter chimeric genomes
awk -F'\t' '$8 == "False"' GUNC.progenomes_2.1.maxCSS_level.tsv > chimeric_genomes.tsv
```
## GUNC Interpretation
```bash
# GUNC flags genomes as chimeric if:
# - clade_separation_score (CSS) > 0.45
# - contamination_portion > 0.05
# - reference_representation_score > 0.5
# Combine with CheckM2 for full QC
join -t$'\t' -1 1 -2 1 \
<(sort checkm2_output/quality_report.tsv) \
<(sort gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv) \
> combined_qc.tsv
```
## Comprehensive QC Pipeline
**Goal:** Run a multi-tool quality assessment on genome assemblies combining completeness, contamination, chimerism, and taxonomic classification.
**Approach:** Execute CheckM2 for completeness/contamination, GUNC for chimerism detection, and GTDB-Tk for taxonomic assignment in sequence, producing complementary QC reports.
```bash
#!/bin/bash
GENOMES_DIR=$1
OUTPUT_DIR=$2
THREADS=${3:-16}
mkdir -p "$OUTPUT_DIR"
# Run CheckM2
echo "Running CheckM2..."
checkm2 predict --input "$GENOMES_DIR" --output-directory "$OUTPUT_DIR/checkm2" \
--threads "$THREADS" --extension fa
# Run GUNC
echo "Running GUNC..."
gunc run -d "$GENOMES_DIR" -o "$OUTPUT_DIR/gunc" -t "$THREADS" -e .fa
# Run GTDB-Tk
echo "Running GTDB-Tk..."
gtdbtk classify_wf --genome_dir "$GENOMES_DIR" --out_dir "$OUTPUT_DIR/gtdbtk" \
--extension fa --cpus "$THREADS"
echo "QC complete!"
```
## Filter by Quality Standards
**Goal:** Classify assembled genomes into MIMAG quality tiers (high/medium) by combining CheckM2 and GUNC results.
**Approach:** Merge CheckM2 completeness/contamination scores with GUNC chimerism flags, then apply MIMAG thresholds (>90% complete, <5% contamination, not chimeric for high quality).
```python
import pandas as pd
checkm = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t')
gunc = pd.read_csv('gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv', sep='\t')
merged = checkm.merge(gunc, left_on='Name', right_on='genome', how='left')
# MIMAG High Quality: >90% complete, <5% contamination, not chimeric
hq = merged[(merged['Completeness'] > 90) &
(merged['Contamination'] < 5) &
(merged['pass.GUNC'] == True)]
# MIMAG Medium Quality: >50% complete, <10% contamination
mq = merged[(merged['Completeness'] >= 50) &
(merged['Contamination'] < 10)]
hq.to_csv('high_quality_genomes.tsv', sep='\t', index=False)
mq.to_csv('medium_quality_genomes.tsv', sep='\t', index=False)
```
## Remove Contamination
```bash
# Use MAGpurify to remove contaminating contigs
magpurify phylo-markers genome.fa magpurify_output
magpurify clade-markers genome.fa magpurify_output
magpurify conspecific genome.fa magpurify_output
magpurify tetra-freq genome.fa magpurify_output
magpurify gc-content genome.fa magpurify_output
magpurify known-contam genome.fa magpurify_output
magpurify clean-bin genome.fa magpurify_output cleaned_genome.fa
```
## Detect Foreign Contigs
```bash
# Contig-level taxonomy with CAT
CAT contigs -c assembly.fa -d CAT_database -t CAT_taxonomy \
-o cat_output -n 16
# Parse results
CAT add_names -i cat_output.contig2classification.txt \
-o cat_output.contig2classification.named.txt \
-t CAT_taxonomy --only_official
# Flag contigs with different taxonomy than majority
awk -F'\t' '{print $1, $NF}' cat_output.contig2classification.named.txt | \
sort | uniq -c | sort -rn
```
## Decontaminate with BlobTools
```bash
# Create BlobDB
blobtools create -i assembly.fa -b aligned.bam -t blast_hits.txt \
-o blobtools_output
# Generate plots
blobtools plot -i blobtools_output.blobDB.json
# Filter by taxonomy
blobtools view -i blobtools_output.blobDB.json -r all -o filtered
```
## Related Skills
- genome-assembly/assembly-qc - BUSCO and other QC
- genome-assembly/long-read-assembly - Assembly methods
- metagenomics/taxonomic-profiling - Metagenome analysis
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.