bio-tumor-fraction-estimation
$
npx mdskill add GPTomics/bioSkills/bio-tumor-fraction-estimationEstimates circulating tumor DNA fraction from shallow whole-genome sequencing data
- Quantifies tumor burden from liquid biopsy samples using sWGS data
- Relies on ichorCNA for HMM-based copy number alteration detection
- Analyzes sWGS data with 0.1-1x coverage to estimate ctDNA percentage
- Returns tumor fraction metrics for monitoring treatment response
SKILL.md
.github/skills/bio-tumor-fraction-estimationView on GitHub ↗
---
name: bio-tumor-fraction-estimation
description: Estimates circulating tumor DNA fraction from shallow whole-genome sequencing using ichorCNA. Detects copy number alterations via HMM segmentation and calculates ctDNA percentage. Requires 0.1-1x sWGS coverage. Use when quantifying tumor burden from liquid biopsy or monitoring treatment response.
tool_type: r
primary_tool: ichorCNA
---
## Version Compatibility
Reference examples tested with: CNVkit 0.9+, ichorCNA 0.5+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Tumor Fraction Estimation
**"Estimate tumor fraction from my cfDNA data"** → Calculate the proportion of tumor-derived DNA in a liquid biopsy sample using copy number aberrations from shallow whole-genome sequencing.
- R: `ichorCNA` for tumor fraction and CNA estimation from sWGS
Estimate ctDNA tumor fraction from shallow whole-genome sequencing.
## ichorCNA Overview
ichorCNA (GavinHaLab fork, v0.5.1+) detects copy number alterations and estimates tumor fraction from sWGS (0.1-1x coverage).
**Sensitivity:** 97-100% detection at >= 3% tumor fraction (2024 validation)
## Input Requirements
| Requirement | Specification |
|-------------|---------------|
| Data type | sWGS (NOT targeted panel) |
| Coverage | 0.1-1x (0.5x recommended) |
| Input | BAM files |
| Output | Tumor fraction, ploidy, CNA segments |
## Running ichorCNA
```r
library(ichorCNA)
# Step 1: Generate read counts in bins
# Run from command line or use HMMcopy
# readCounter --window 1000000 --quality 20 sample.bam > sample.wig
# Step 2: Run ichorCNA
runIchorCNA(
WIG = 'sample.wig',
gcWig = 'gc_hg38_1mb.wig',
mapWig = 'mappability_hg38_1mb.wig',
normalPanel = 'pon_median_1mb.rds',
centromere = 'centromeres_hg38.txt',
outDir = 'ichor_results/',
id = 'sample_id',
# Tumor fraction estimation parameters
normal = c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99),
ploidy = c(2, 3),
maxCN = 5,
# Subclonality
estimateScPrevalence = TRUE,
scStates = c(1, 3),
# Segmentation
txnE = 0.9999,
txnStrength = 10000,
# Chromosomes
chrs = paste0('chr', c(1:22, 'X'))
)
```
## Batch Processing
**Goal:** Run ichorCNA tumor fraction estimation on a cohort of sWGS samples in parallel, collecting results and handling failures gracefully.
**Approach:** Apply the ichorCNA pipeline to each sample's WIG file using mclapply for parallelization, wrapping each call in tryCatch to report per-sample success or failure.
```r
library(ichorCNA)
library(parallel)
process_sample <- function(wig_file, params) {
sample_id <- basename(wig_file)
sample_id <- gsub('.wig$', '', sample_id)
tryCatch({
runIchorCNA(
WIG = wig_file,
gcWig = params$gcWig,
mapWig = params$mapWig,
normalPanel = params$normalPanel,
centromere = params$centromere,
outDir = params$outDir,
id = sample_id,
normal = c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99),
ploidy = c(2, 3),
maxCN = 5
)
return(list(sample = sample_id, status = 'success'))
}, error = function(e) {
return(list(sample = sample_id, status = 'failed', error = e$message))
})
}
# Run in parallel
wig_files <- list.files('wig/', pattern = '.wig$', full.names = TRUE)
params <- list(
gcWig = 'gc_hg38_1mb.wig',
mapWig = 'mappability_hg38_1mb.wig',
normalPanel = 'pon_median_1mb.rds',
centromere = 'centromeres_hg38.txt',
outDir = 'ichor_results/'
)
results <- mclapply(wig_files, process_sample, params = params, mc.cores = 4)
```
## Parsing Results
```r
parse_ichor_results <- function(results_dir) {
# Find results files
param_files <- list.files(results_dir, pattern = '.params.txt$',
full.names = TRUE, recursive = TRUE)
results <- data.frame()
for (f in param_files) {
params <- read.table(f, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
sample_id <- gsub('.params.txt$', '', basename(f))
results <- rbind(results, data.frame(
sample = sample_id,
tumor_fraction = 1 - params$n[1], # n is normal fraction
ploidy = params$phi[1],
log_likelihood = params$loglik[1]
))
}
return(results)
}
# Parse all results
tf_results <- parse_ichor_results('ichor_results/')
print(tf_results)
```
## Python Wrapper
```python
import subprocess
import pandas as pd
from pathlib import Path
def run_ichorcna(wig_file, output_dir, gc_wig, map_wig, normal_panel, centromere):
'''Run ichorCNA from Python.'''
sample_id = Path(wig_file).stem
cmd = f'''
Rscript -e "
library(ichorCNA)
runIchorCNA(
WIG = '{wig_file}',
gcWig = '{gc_wig}',
mapWig = '{map_wig}',
normalPanel = '{normal_panel}',
centromere = '{centromere}',
outDir = '{output_dir}',
id = '{sample_id}',
normal = c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99),
ploidy = c(2, 3),
maxCN = 5
)
"
'''
subprocess.run(cmd, shell=True, check=True)
def parse_tumor_fraction(params_file):
'''Parse tumor fraction from ichorCNA output.'''
df = pd.read_csv(params_file, sep='\t')
return {
'tumor_fraction': 1 - df['n'].iloc[0],
'ploidy': df['phi'].iloc[0],
'log_likelihood': df['loglik'].iloc[0]
}
```
## Interpretation
| Tumor Fraction | Interpretation |
|----------------|----------------|
| >= 10% | High ctDNA, reliable detection |
| 3-10% | Moderate ctDNA, detectable |
| < 3% | Low ctDNA, at detection limit |
| 0% | No detectable ctDNA or below LOD |
## Related Skills
- cfdna-preprocessing - Preprocess BAMs before ichorCNA
- fragment-analysis - Complementary fragmentomics analysis
- ctdna-mutation-detection - Mutation detection from panel data
- copy-number/cnvkit-analysis - CNV concepts
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.