bio-metabolomics-xcms-preprocessing
$
npx mdskill add GPTomics/bioSkills/bio-metabolomics-xcms-preprocessingProcesses raw LC-MS data into a feature table for untargeted metabolomics using XCMS3.
- Solves preprocessing of LC-MS/MS data for metabolomics analysis.
- Uses R with xcms 4.0+, MSnbase 2.28+, and Bioconductor.
- Applies peak detection, alignment, grouping, and gap filling workflows.
- Generates a sample-by-feature intensity matrix for downstream analysis.
SKILL.md
.github/skills/bio-metabolomics-xcms-preprocessingView on GitHub ↗
---
name: bio-metabolomics-xcms-preprocessing
description: XCMS3 workflow for LC-MS/MS metabolomics preprocessing. Covers peak detection, retention time alignment, correspondence (grouping), and gap filling. Use when processing raw LC-MS data into a feature table for untargeted metabolomics.
tool_type: r
primary_tool: xcms
---
## Version Compatibility
Reference examples tested with: MSnbase 2.28+, scanpy 1.10+, xcms 4.0+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# XCMS Metabolomics Preprocessing
Requires Bioconductor 3.18+ with xcms 4.0+ and MSnbase 2.28+.
## Load Raw Data
**Goal:** Import raw LC-MS files into R for downstream peak detection and alignment.
**Approach:** Read mzML/mzXML files into an OnDiskMSnExp object using MSnbase for memory-efficient access.
**"Process my raw LC-MS data into a feature table"** → Detect chromatographic peaks, align retention times across samples, group corresponding peaks, and fill missing values to produce a sample-by-feature intensity matrix.
```r
library(xcms)
library(MSnbase)
# Read mzML/mzXML files
raw_files <- list.files('raw_data', pattern = '\\.(mzML|mzXML)$', full.names = TRUE)
# Create OnDiskMSnExp object
raw_data <- readMSData(raw_files, mode = 'onDisk')
# Check data
raw_data
table(msLevel(raw_data))
```
## Define Sample Groups
**Goal:** Attach sample metadata (group labels, injection order) to the raw data object.
**Approach:** Create a data frame of sample information and assign it to the phenoData slot.
```r
# Sample metadata
sample_info <- data.frame(
sample_name = basename(raw_files),
sample_group = c(rep('Control', 5), rep('Treatment', 5), rep('QC', 3)),
injection_order = 1:length(raw_files)
)
# Assign to phenoData
pData(raw_data) <- sample_info
```
## Peak Detection (Centroided)
**Goal:** Identify chromatographic peaks in centroided LC-MS data.
**Approach:** Use the CentWave algorithm which detects peaks by continuous wavelet transform on regions of interest defined by m/z and RT.
```r
# CentWave algorithm for centroided data
cwp <- CentWaveParam(
peakwidth = c(5, 30), # Peak width range in seconds
ppm = 15, # m/z tolerance
snthresh = 10, # Signal-to-noise threshold
prefilter = c(3, 1000), # Min peaks and intensity
mzdiff = 0.01, # Minimum m/z difference
noise = 1000, # Noise level
integrate = 1 # Integration method
)
# Run peak detection
xdata <- findChromPeaks(raw_data, param = cwp)
# Summary
head(chromPeaks(xdata))
cat('Peaks found:', nrow(chromPeaks(xdata)), '\n')
```
## Peak Detection (Profile Data)
**Goal:** Detect peaks in profile (non-centroided) LC-MS data.
**Approach:** Use the MatchedFilter algorithm designed for continuum data, which convolves with a Gaussian model peak.
```r
# MatchedFilter for profile/continuum data
mfp <- MatchedFilterParam(
binSize = 0.1,
fwhm = 30,
snthresh = 10,
step = 0.1,
mzdiff = 0.8
)
xdata_profile <- findChromPeaks(raw_data, param = mfp)
```
## Retention Time Alignment
**Goal:** Correct retention time drift across samples to enable peak correspondence.
**Approach:** Apply Obiwarp alignment which uses dynamic time warping on the TIC profiles to compute sample-wise RT corrections.
```r
# Obiwarp alignment (recommended)
obp <- ObiwarpParam(
binSize = 0.5,
response = 1,
distFun = 'cor_opt',
gapInit = 0.3,
gapExtend = 2.4
)
xdata <- adjustRtime(xdata, param = obp)
# Check alignment
plotAdjustedRtime(xdata)
```
## Peak Correspondence (Grouping)
**Goal:** Group corresponding chromatographic peaks across samples into consensus features.
**Approach:** Use peak density-based grouping which models the RT distribution of peaks in m/z slices to identify features present across samples.
```r
# Group peaks across samples
pdp <- PeakDensityParam(
sampleGroups = pData(xdata)$sample_group,
bw = 5, # RT bandwidth
minFraction = 0.5, # Min fraction of samples
minSamples = 1, # Min samples per group
binSize = 0.025 # m/z bin size
)
xdata <- groupChromPeaks(xdata, param = pdp)
# Check feature definitions
featureDefinitions(xdata)
cat('Features:', nrow(featureDefinitions(xdata)), '\n')
```
## Gap Filling
**Goal:** Recover signal for features that were missed during initial peak detection in some samples.
**Approach:** Integrate intensity in the expected m/z-RT region for features with missing values using ChromPeakAreaParam.
```r
# Fill in missing peaks
fpp <- ChromPeakAreaParam()
xdata <- fillChromPeaks(xdata, param = fpp)
# Alternative: FillChromPeaksParam for more control
fpp2 <- FillChromPeaksParam(
expandMz = 0,
expandRt = 0,
ppm = 0
)
```
## Extract Feature Table
**Goal:** Generate a samples-by-features intensity matrix with m/z and RT annotations for downstream analysis.
**Approach:** Extract feature values and definitions from the processed XCMSnExp object and combine into an exportable table.
```r
# Get feature values (intensity matrix)
feature_values <- featureValues(xdata, method = 'maxint', value = 'into')
# Feature definitions (m/z, RT)
feature_defs <- featureDefinitions(xdata)
feature_defs <- as.data.frame(feature_defs)
feature_defs$feature_id <- rownames(feature_defs)
# Combine
feature_table <- cbind(feature_defs[, c('feature_id', 'mzmed', 'rtmed')], feature_values)
rownames(feature_table) <- feature_table$feature_id
# Save
write.csv(feature_table, 'feature_table.csv', row.names = FALSE)
```
## Quality Control
**Goal:** Assess preprocessing quality through TIC plots, peak counts, RT correction, and PCA.
**Approach:** Visualize total ion chromatograms, per-sample peak counts, RT adjustment, and PCA of the feature matrix.
```r
# TIC for each sample
tic <- chromatogram(raw_data, aggregationFun = 'sum')
plot(tic)
# Peak count per sample
peak_counts <- table(chromPeaks(xdata)[, 'sample'])
barplot(peak_counts, main = 'Peaks per sample')
# Check RT correction
par(mfrow = c(1, 2))
plotAdjustedRtime(xdata, col = pData(xdata)$sample_group)
# PCA of features
library(pcaMethods)
log_values <- log2(feature_values + 1)
log_values[is.na(log_values)] <- 0
pca <- pca(t(log_values), nPcs = 3, method = 'ppca')
plotPcs(pca, col = as.factor(pData(xdata)$sample_group))
```
## CAMERA Annotation (Isotopes/Adducts)
**Goal:** Identify isotope patterns and adduct groups among detected peaks to reduce feature redundancy.
**Approach:** Use CAMERA to group peaks by RT correlation, assign isotope clusters, and annotate adduct types.
```r
library(CAMERA)
# Create CAMERA object
xsa <- xsAnnotate(as(xdata, 'xcmsSet'))
# Group by RT
xsa <- groupFWHM(xsa, perfwhm = 0.6)
# Find isotopes
xsa <- findIsotopes(xsa, mzabs = 0.01, ppm = 10)
# Find adducts
xsa <- findAdducts(xsa, polarity = 'positive')
# Get annotated peak list
camera_results <- getPeaklist(xsa)
```
## Export for MetaboAnalyst
**Goal:** Format the XCMS feature table for import into MetaboAnalyst web or R package.
**Approach:** Transpose the matrix, create M/Z-RT feature names, and prepend sample group information.
```r
# Format for MetaboAnalyst web or R package
export_data <- t(feature_values)
colnames(export_data) <- paste0('M', round(feature_defs$mzmed, 4), 'T', round(feature_defs$rtmed, 1))
# Add sample info
export_df <- data.frame(Sample = rownames(export_data), Group = pData(xdata)$sample_group, export_data)
write.csv(export_df, 'metaboanalyst_input.csv', row.names = FALSE)
```
## Related Skills
- metabolite-annotation - Identify metabolites
- normalization-qc - Normalize feature table
- statistical-analysis - Differential analysis
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.