bio-flow-cytometry-clustering-phenotyping
$
npx mdskill add GPTomics/bioSkills/bio-flow-cytometry-clustering-phenotypingClusters high-dimensional cytometry data to identify cell populations without predefined gates
- Discovers cell types in flow/mass cytometry data using unsupervised methods
- Uses R-based tools like FlowSOM, Phenograph, and CATALYST for clustering
- Analyzes marker expression patterns to group similar cells into clusters
- Returns cluster assignments and metacluster visualizations for downstream analysis
SKILL.md
.github/skills/bio-flow-cytometry-clustering-phenotypingView on GitHub ↗
---
name: bio-flow-cytometry-clustering-phenotyping
description: Unsupervised clustering and cell type identification for flow/mass cytometry. Covers FlowSOM, Phenograph, and CATALYST workflows. Use when discovering cell populations in high-dimensional cytometry data without predefined gates.
tool_type: r
primary_tool: CATALYST
---
## Version Compatibility
Reference examples tested with: FlowSOM 2.10+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Clustering and Phenotyping
**"Cluster my cytometry data to find cell types"** → Discover cell populations in high-dimensional flow/mass cytometry data using unsupervised clustering without predefined gates.
- R: `FlowSOM::FlowSOM()` for self-organizing map clustering
- R: `CATALYST::cluster()` with Phenograph or FlowSOM
## FlowSOM Clustering
**Goal:** Cluster cytometry events into cell populations using self-organizing maps.
**Approach:** Build a FlowSOM grid on marker channels, then extract metacluster assignments per cell.
```r
library(FlowSOM)
# Prepare data
expr <- exprs(fcs)
marker_cols <- grep('CD|HLA', colnames(fcs), value = TRUE)
# Build SOM
fsom <- FlowSOM(fcs,
colsToUse = marker_cols,
xdim = 10, ydim = 10,
nClus = 20,
seed = 42)
# Get cluster assignments
clusters <- GetMetaclusters(fsom)
# Add to flowFrame
exprs(fcs) <- cbind(exprs(fcs), cluster = clusters)
```
## CATALYST Workflow (Full Pipeline)
**Goal:** Run the complete CATALYST clustering pipeline from flowSet to annotated cell populations.
**Approach:** Convert flowSet to SingleCellExperiment with prepData, then cluster on type markers with FlowSOM via CATALYST.
```r
library(CATALYST)
library(SingleCellExperiment)
# Create SCE from flowSet
sce <- prepData(fs, panel, md, transform = TRUE, cofactor = 5)
# Clustering
sce <- cluster(sce,
features = 'type', # Use 'type' markers from panel
xdim = 10, ydim = 10,
maxK = 20,
seed = 42)
# View cluster assignments
table(cluster_ids(sce, 'meta20'))
```
## Phenograph Clustering
**Goal:** Identify cell populations using graph-based community detection on marker expression.
**Approach:** Build a k-nearest-neighbor graph on type markers, then partition with Louvain community detection via Rphenograph.
```r
library(Rphenograph)
# Extract expression matrix
expr <- assay(sce, 'exprs')
# Run Phenograph
pheno_result <- Rphenograph(t(expr[rowData(sce)$marker_class == 'type', ]), k = 30)
# Get clusters
sce$phenograph <- factor(membership(pheno_result[[2]]))
```
## Dimensionality Reduction
**Goal:** Project high-dimensional cytometry data into 2D for visualization of cell populations.
**Approach:** Run UMAP or tSNE on type marker channels using CATALYST's runDR wrapper, then plot colored by cluster.
```r
# UMAP
sce <- runDR(sce, dr = 'UMAP', features = 'type')
# tSNE
sce <- runDR(sce, dr = 'TSNE', features = 'type')
# Plot
plotDR(sce, 'UMAP', color_by = 'meta20')
```
## Cluster Annotation
**Goal:** Assign cell type labels to clusters based on marker expression profiles.
**Approach:** Visualize median marker expression per cluster with a heatmap, then map cluster IDs to cell type names.
```r
# Heatmap of marker expression by cluster
plotExprHeatmap(sce, features = 'type',
by = 'cluster_id', k = 'meta20',
scale = 'first', row_anno = FALSE)
# Manual annotation
cluster_annotation <- c(
'1' = 'CD4 T cells',
'2' = 'CD8 T cells',
'3' = 'B cells',
'4' = 'NK cells',
'5' = 'Monocytes'
)
sce$cell_type <- cluster_annotation[as.character(cluster_ids(sce, 'meta20'))]
```
## Cluster Merging
**Goal:** Reduce overclustering by merging similar clusters into biologically meaningful groups.
**Approach:** Define a mapping table from original to merged cluster IDs, then apply with CATALYST's mergeClusters.
```r
# Merge similar clusters
merging_table <- data.frame(
original = 1:20,
merged = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5,
6, 6, 7, 7, 8, 8, 9, 9, 10, 10)
)
sce <- mergeClusters(sce, k = 'meta20', table = merging_table, id = 'merged')
```
## Abundance Analysis (per sample)
**Goal:** Quantify the relative frequency of each cell population across samples and conditions.
**Approach:** Cross-tabulate cluster assignments by sample ID, convert to proportions, and plot grouped by condition.
```r
# Cluster frequencies per sample
abundances <- table(cluster_ids(sce, 'meta20'), sce$sample_id)
freq <- prop.table(abundances, margin = 2)
# Plot
plotAbundances(sce, k = 'meta20', by = 'cluster_id', group_by = 'condition')
```
## Marker Expression Summary
**Goal:** Summarize and compare marker expression levels across clusters and conditions.
**Approach:** Plot per-cluster median expression with CATALYST's plotClusterExprs and pseudo-bulk expression faceted by cluster.
```r
# Median expression per cluster
plotClusterExprs(sce, k = 'meta20', features = 'type')
# Expression by cluster and condition
plotPbExprs(sce, k = 'meta20', features = 'type', facet_by = 'cluster_id')
```
## Export Results
**Goal:** Save clustering results and annotated SCE object for downstream analysis or sharing.
**Approach:** Extract cluster assignments into colData, export as CSV, and serialize the full SCE as RDS.
```r
# Add cluster info to metadata
colData(sce)$cluster <- cluster_ids(sce, 'meta20')
# Export to CSV
results <- as.data.frame(colData(sce))
write.csv(results, 'clustering_results.csv', row.names = FALSE)
# Save SCE
saveRDS(sce, 'sce_clustered.rds')
```
## Choosing Number of Clusters
**Goal:** Determine the optimal number of metaclusters for the dataset.
**Approach:** Compare normalized reduction stability (NRS) plots and heatmaps at different K values to find where clusters remain distinct.
```r
# Delta area plot
plotNRS(sce, features = 'type')
# Or visual inspection of heatmap at different K
plotExprHeatmap(sce, features = 'type', by = 'cluster_id', k = 'meta10')
plotExprHeatmap(sce, features = 'type', by = 'cluster_id', k = 'meta20')
```
## Batch Integration
**Goal:** Remove batch effects from cytometry data before or after clustering.
**Approach:** Detect batch effects by coloring UMAP by batch variable, then apply MNN correction with batchelor if needed.
```r
# If batch effects present
library(batchelor)
sce <- runDR(sce, dr = 'UMAP', features = 'type')
# Check for batch effects
plotDR(sce, 'UMAP', color_by = 'batch')
# MNN correction if needed
sce_corrected <- fastMNN(sce, batch = sce$batch)
```
## Related Skills
- gating-analysis - Manual alternative
- differential-analysis - Compare clusters between conditions
- single-cell/clustering - Similar concepts for scRNA-seq
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.