bio-cfdna-preprocessing
$
npx mdskill add GPTomics/bioSkills/bio-cfdna-preprocessingPreprocesses cell-free DNA sequencing data for downstream analysis using fgbio
- Solves preprocessing of plasma cfDNA sequencing data with UMI-aware deduplication
- Uses fgbio, BWA, samtools, and Python libraries like numpy and pysam
- Applies cfDNA-specific thresholds, alignment optimization, and fragment filtering
- Delivers cleaned BAM files and quality metrics for variant detection workflows
SKILL.md
.github/skills/bio-cfdna-preprocessingView on GitHub ↗
---
name: bio-cfdna-preprocessing
description: Preprocesses cell-free DNA sequencing data including adapter trimming, alignment optimized for short fragments, and UMI-aware duplicate removal using fgbio. Applies cfDNA-specific quality thresholds and fragment length filtering. Use when processing plasma cfDNA sequencing data before downstream analysis.
tool_type: python
primary_tool: fgbio
---
## Version Compatibility
Reference examples tested with: BWA 0.7.17+, fgbio 2.1+, matplotlib 3.8+, numpy 1.26+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# cfDNA Preprocessing
**"Preprocess my cfDNA sequencing data"** → Process cell-free DNA reads with UMI extraction, consensus calling, and error suppression for sensitive variant detection.
- CLI: `fgbio FastqToBam` → `fgbio GroupReadsByUmi` → `fgbio CallMolecularConsensusReads`
Preprocess cell-free DNA sequencing data with UMI-aware deduplication.
## Pre-Analytical Considerations
| Factor | Requirement | Rationale |
|--------|-------------|-----------|
| Collection tube | Streck (7 days) or EDTA (6 hrs) | Prevents cell lysis |
| Processing time | ASAP or per tube specs | Minimizes genomic DNA contamination |
| Hemolysis | Avoid | Releases cellular DNA |
| Storage | -80C after extraction | Prevents degradation |
## UMI-Aware Pipeline with fgbio
```bash
# fgbio 3.0+ (actively maintained)
# Step 1: Extract UMIs from reads and annotate
fgbio ExtractUmisFromBam \
--input raw.bam \
--output with_umis.bam \
--read-structure 3M2S+T 3M2S+T \
--molecular-index-tags ZA ZB \
--single-tag RX
# Step 2: Align with BWA-MEM
# Use -Y for soft-clipping (preserves UMIs)
bwa mem -t 8 -Y reference.fa with_umis.bam | \
samtools view -bS - > aligned.bam
# Step 3: Group reads by UMI
fgbio GroupReadsByUmi \
--input aligned.bam \
--output grouped.bam \
--strategy adjacency \
--edits 1 \
--min-map-q 20
# Step 4: Call molecular consensus reads
fgbio CallMolecularConsensusReads \
--input grouped.bam \
--output consensus.bam \
--min-reads 2 \
--min-input-base-quality 20
# Step 5: Filter consensus reads
fgbio FilterConsensusReads \
--input consensus.bam \
--output filtered_consensus.bam \
--ref reference.fa \
--min-reads 2 \
--max-read-error-rate 0.05 \
--min-base-quality 30
```
## Python Implementation
**Goal:** Run the complete cfDNA UMI-consensus pipeline from raw BAM to error-suppressed consensus reads in a single Python function call.
**Approach:** Chain fgbio operations (UMI extraction, grouping, consensus calling, filtering) with BWA alignment, handling intermediate files and cleanup within the function.
```python
import subprocess
import pysam
from pathlib import Path
def preprocess_cfdna(input_bam, output_bam, reference, read_structure='3M2S+T 3M2S+T',
min_reads=2, threads=8):
'''
Full cfDNA preprocessing pipeline with fgbio.
Args:
input_bam: Input BAM with UMIs in reads
output_bam: Output consensus BAM
reference: Reference FASTA path
read_structure: UMI read structure
min_reads: Minimum reads per UMI group
threads: CPU threads
'''
work_dir = Path(output_bam).parent
prefix = Path(output_bam).stem
# Extract UMIs
with_umis = work_dir / f'{prefix}_umis.bam'
subprocess.run([
'fgbio', 'ExtractUmisFromBam',
'--input', input_bam,
'--output', str(with_umis),
'--read-structure', read_structure,
'--single-tag', 'RX'
], check=True)
# Align
aligned = work_dir / f'{prefix}_aligned.bam'
cmd = f'bwa mem -t {threads} -Y {reference} {with_umis} | samtools view -bS - > {aligned}'
subprocess.run(cmd, shell=True, check=True)
# Sort
sorted_bam = work_dir / f'{prefix}_sorted.bam'
pysam.sort('-@', str(threads), '-o', str(sorted_bam), str(aligned))
# Group by UMI
grouped = work_dir / f'{prefix}_grouped.bam'
subprocess.run([
'fgbio', 'GroupReadsByUmi',
'--input', str(sorted_bam),
'--output', str(grouped),
'--strategy', 'adjacency',
'--edits', '1'
], check=True)
# Consensus calling
consensus = work_dir / f'{prefix}_consensus.bam'
subprocess.run([
'fgbio', 'CallMolecularConsensusReads',
'--input', str(grouped),
'--output', str(consensus),
'--min-reads', str(min_reads)
], check=True)
# Filter consensus
subprocess.run([
'fgbio', 'FilterConsensusReads',
'--input', str(consensus),
'--output', output_bam,
'--ref', reference,
'--min-reads', str(min_reads)
], check=True)
return output_bam
```
## Fragment Size Analysis
```python
import pysam
import numpy as np
import matplotlib.pyplot as plt
def analyze_fragment_sizes(bam_path, max_size=500):
'''Analyze cfDNA fragment size distribution.'''
bam = pysam.AlignmentFile(bam_path, 'rb')
sizes = []
for read in bam.fetch():
if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
if read.template_length <= max_size:
sizes.append(read.template_length)
bam.close()
# cfDNA signature: peak at ~167bp (mononucleosome)
# Shorter fragments (90-150bp) enriched in ctDNA
sizes = np.array(sizes)
print(f'Fragments analyzed: {len(sizes)}')
print(f'Median size: {np.median(sizes):.0f} bp')
print(f'Mode: {np.bincount(sizes).argmax()} bp')
return sizes
```
## Quality Thresholds
| Metric | Threshold | Notes |
|--------|-----------|-------|
| Modal fragment size | 150-180 bp | Peak ~167 bp indicates good cfDNA |
| UMI families >= 2 reads | > 50% | Sufficient for consensus |
| Mean base quality | >= 30 | After consensus |
| Mapping quality | >= 20 | Exclude multi-mappers |
## Related Skills
- fragment-analysis - Analyze fragmentomics after preprocessing
- tumor-fraction-estimation - Estimate ctDNA from sWGS
- ctdna-mutation-detection - Detect mutations from panel data
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.