bio-single-cell-splicing
$
npx mdskill add GPTomics/bioSkills/bio-single-cell-splicingEvaluate single-cell splicing feasibility and select optimal analysis tools.
- Determine if library chemistry supports per-cell isoform structure detection.
- Integrates with MARVEL, BRIE2, scQuint, SpliZ, Psix, and Sierra.
- Recommends methods based on Smart-seq3, FLASH-seq, or long-read data.
- Outputs tool selection criteria and version compatibility requirements.
SKILL.md
.github/skills/bio-single-cell-splicingView on GitHub ↗
---
name: bio-single-cell-splicing
description: Analyzes alternative splicing at single-cell resolution. The first decision is library chemistry — 10X 3' is fundamentally limited (RT primes from poly-A, R2 falls in 3' UTR, <0.1 junction read per cell per AS event). Plate-based full-length methods (Smart-seq3, FLASH-seq, VASA-seq, STORM-seq) and single-cell long-read (MAS-Iso-seq, scISOr-Seq2) are the chemistries that give per-cell isoform structure. Tools include MARVEL (R, Smart-seq integrated), BRIE2 (Bayesian PSI with regulatory features and ELBO_gain test), scQuint (junction-cluster, plate-based; not for 10X), SpliZ (annotation-free Z-score), Psix (graph-smoothness regulated AS), and Sierra (alternative polyadenylation, often confused with AS). Use when analyzing isoform usage in scRNA-seq, identifying cell-type-specific splicing, or determining whether scRNA-seq chemistry supports splicing analysis at all.
tool_type: python
primary_tool: MARVEL
---
## Version Compatibility
Reference examples tested with: MARVEL 2.0+, BRIE2 0.2.4+, scQuint 0.1+, SpliZ 0.0.1+, Sierra 1.0+, Psix 0.1+, anndata 0.10+, scanpy 1.10+, pandas 2.2+, scipy 1.13+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Single-Cell Splicing Analysis
The fundamental decision is **chemistry**, not tool. Most droplet 3' scRNA-seq cannot support transcriptome-wide splicing inference because reverse transcription primes from the poly(A) tail and most reads land in the 3' UTR — far from CDS-region splicing events. Plate-based full-length methods and single-cell long-read sequencing are the chemistries that give per-cell isoform structure across the gene body.
## The 10X 3' Problem (Quantified)
Three compounding mechanisms make 10X Chromium 3' (v3.1, GEM-X, v4) hostile to splicing:
1. **3' enrichment**: median fragment <1 kb from poly(A); >70% of unique reads fall within 3' UTR.
2. **Short R2 (~91 nt)**: each read straddles at most one junction; usually none, because R2 lands in 3' UTR.
3. **PCR concatemers and TSO artifacts**: pollute junction detection; UMI collapse is gene-level, not isoform-level.
**Quantitative estimate:** Only a small fraction of cassette exons sit close enough to the polyA site to be sampled by 3' chemistry (empirical estimates from APA/3'-end atlases — see Tian & Manley 2017 *Nat Rev Mol Cell Biol* for the 3' UTR isoform landscape). Effective junction read yield from 10X 3' is **<0.1 per cell per AS event** — vs the 5-10 needed for stable per-cell PSI. Most splicing analyses on 10X 3' data report artifacts.
**The 5' kit (10X 5' GEX) does not solve this** — it shifts capture from 3' UTR to 5' UTR / TSS-proximal regions. Marginal improvement; not a transcriptome-wide solution. Note that V(D)J recovery requires the **10X Chromium Single Cell Immune Profiling kit** (with TCR/BCR-specific enrichment), not 5' GEX alone — postdocs designing immune-repertoire experiments must use the dedicated V(D)J kit.
## Decision: Does Your Chemistry Support Splicing Analysis?
| Chemistry | Splicing analysis viable? | Best alternative if no |
|-----------|----------------------------|--------------------------|
| 10X 3' (Chromium v3, GEM-X, v4, Flex) | No (transcriptome-wide); maybe near-3'-end events | Sierra for APA |
| 10X 5' GEX | Limited; near-5'-end events only | Sierra for alternative TSS; switch to MAS-Iso-seq |
| Smart-seq2 | Yes (full transcript) | MARVEL or BRIE2 |
| Smart-seq3 / Smart-seq3xpress | Yes + UMI molecule counting | MARVEL or BRIE2 |
| FLASH-seq | Yes (faster, cheaper Smart-seq3) | MARVEL or BRIE2 |
| VASA-seq | Yes + total RNA (incl. nascent, IR) | MARVEL with IR analysis |
| STORM-seq | Yes + total RNA + ribodepletion | MARVEL with IR analysis |
| MAS-Iso-seq + 10X 5' (PacBio Kinnex) | Yes — full isoforms per cell | FLAMES, scNanoGPS, IsoQuant, see long-read-splicing |
| scISOr-Seq2 (PacBio + 10X) | Yes — full isoforms with cell-typing | FLAMES, IsoQuant |
| ONT direct cDNA scRNA | Yes | FLAMES |
| ONT direct RNA scRNA | Yes + native modifications | FLAMES |
## Tool Selection Matrix
| Tool | Best for | Input | Strengths | Fails when |
|------|----------|-------|-----------|------------|
| MARVEL | Smart-seq plate-based and (v2+) 10X droplet unified workflow | Plate or droplet BAMs + Seurat | SE/A5SS/A3SS/MXE/RI/AFE/ALE; modality classification; native Seurat integration; v2 droplet support | R-only |
| BRIE2 | Plate-based with regulatory feature prior | Plate BAM + GFF3 events | Bayesian variational PSI + ELBO_gain test; principled uncertainty; CLI-driven (`brie-count`, `brie-quant`) | TensorFlow dependency; slow at scale |
| scQuint | Plate-based annotation-free junction-cluster quantification (validated on Smart-seq2) | STAR junctions across cells | Cluster-level junction usage; latent Dirichlet | Authors recommend AGAINST use on 10X 3'/5' data (3'-bias confounds); plate-based only |
| SpliZ | Annotation-free discovery of cell-state-associated splicing | STAR-aligned BAMs | Per-gene Z-score; no event database needed | Annotation-free = power tradeoff |
| Psix | Regulated AS along trajectories | PSI matrix + kNN graph | Tests graph smoothness; robust to dropout | Needs cell-state graph upstream |
| Sierra | APA in 10X 3' (NOT splicing) | 10X BAM + GTF | Peak-calling 3' ends; DEXSeq DTU on UTR isoforms | APA only; not for cassette exons |
| pseudobulk leafcutter / rMATS | Between-cell-type differential splicing | Aggregated BAMs | Bulk-level statistical power | Loses within-cluster heterogeneity |
| MAS-Iso-seq + FLAMES | Full-length single-cell isoforms | 10X 5' + PacBio Kinnex | Full isoforms per cell at scale | Cost; complex pipeline |
## Decision Tree by Goal
| Goal | Recommended approach |
|------|----------------------|
| "Will my 10X 3' data support splicing?" | No transcriptome-wide; consider Sierra for APA. Note: scQuint authors recommend against use on 10X data |
| Cassette exon analysis in cell types from Smart-seq2 | MARVEL with `ComputePSI` + `AssignModality` + `CompareValues` |
| Discover cell-state-associated splicing without an event database | SpliZ |
| Test regulated AS along developmental pseudotime | Psix |
| Per-cell PSI with uncertainty in low-coverage cells | BRIE2 |
| Differential splicing between two well-defined cell types | Pseudobulk leafcutter or rMATS on aggregated BAMs |
| APA (alternative polyadenylation, often confused with AS) | Sierra |
| Full-length single-cell isoforms at scale | MAS-Iso-seq + FLAMES (long-read) |
| Microexons (3-27 nt) | Long-read or aligner with low overhang (uLTRA, deSALT) |
| snRNA-seq (nuclei) — IR question | Library captures nuclear RNA enriched for incomplete splicing — interpret IR cautiously |
## MARVEL Plate-Based Workflow
**Goal:** Run a unified workflow from STAR junctions to cell-type-specific splicing calls.
**Approach:** Build a wide splice-junction count matrix (rows = junctions keyed by `coord.intron`, columns = cells), assemble per-event feature tables, then construct MARVEL object with named slots (`SpliceJunction`, `SplicePheno`, `SpliceFeature`, `IntronCounts`, `GeneFeature`, `Exp`, `GTF`). Quantify PSI per event class, classify modality, test differential splicing.
```r
library(MARVEL); library(Seurat); library(data.table)
seurat_obj <- readRDS('cells.rds')
# Build wide SJ matrix: first column 'coord.intron' (e.g. 'chr1:100007082:100022621'),
# subsequent columns are per-cell sample IDs with junction counts as values.
# This is constructed from STAR SJ.out.tab files (one per cell) merged on intron coord.
sj_files <- list.files('star_pass2/', pattern='SJ.out.tab$', full.names=TRUE)
sj_long <- rbindlist(lapply(sj_files, function(f) {
d <- fread(f, sep='\t', header=FALSE,
col.names=c('chr','start','end','strand','motif','annot','unique','multi','overhang'))
d$coord.intron <- paste(d$chr, d$start, d$end, sep=':')
d$sample <- gsub('_SJ.out.tab$', '', basename(f))
d[, .(coord.intron, sample, unique)]
}))
sj <- dcast(sj_long, coord.intron ~ sample, value.var='unique', fill=0)
# SpliceFeature is a NAMED LIST keyed by event class
df.feature.list <- list(
SE = read.table('events_SE.txt', header=TRUE, sep='\t'),
A5SS = read.table('events_A5SS.txt', header=TRUE, sep='\t'),
A3SS = read.table('events_A3SS.txt', header=TRUE, sep='\t'),
MXE = read.table('events_MXE.txt', header=TRUE, sep='\t'),
RI = read.table('events_RI.txt', header=TRUE, sep='\t')
)
# SplicePheno: per-cell metadata; sample.id column maps to SpliceJunction column names
df.pheno <- seurat_obj@meta.data
df.pheno$sample.id <- rownames(df.pheno)
marvel <- CreateMarvelObject(
SpliceJunction = sj,
SplicePheno = df.pheno,
SpliceFeature = df.feature.list,
GeneFeature = read.table('gene_features.tsv', header=TRUE, sep='\t'),
Exp = read.table('tpm.tsv', header=TRUE, sep='\t', row.names=1),
GTF = rtracklayer::import('annotation.gtf')
)
marvel <- ComputePSI(marvel, CoverageThreshold=10, EventType='SE')
marvel <- AssignModality(marvel, EventType='SE')
marvel <- CompareValues(
marvel,
cell.group.g1 = neurons, cell.group.g2 = glia,
method = 'wilcox', n.cells = 25, psi.delta = 0.1
)
```
For 10X droplet data, MARVEL v2+ provides `CreateMarvelObject.10x()` and `AnnotateSJ.10x()` constructors. Verify the exact API via `?CreateMarvelObject.10x` in installed MARVEL.
MARVEL classifies events into modalities (Song 2017 *Mol Cell*): included (PSI~1), excluded (PSI~0), bimodal (mixture at 0/1), middle (peaked ~0.5), multimodal. Bimodality usually reflects mixed cell states or stochastic monoallelic-like bursting. Mid-modality (peaked at 0.5) can be technical (mixed cells in a droplet) — confirm with full-length data.
## BRIE2 Bayesian PSI
**Goal:** Estimate per-cell PSI with informative regulatory-feature prior; test cell-state association via likelihood-ratio testing on covariate effects.
**Approach:** BRIE2 is a CLI-driven workflow (`brie-count` for read counting, `brie-quant` for variational inference + LRT). Prepare a GFF3 of splicing events, count cell-barcoded junction reads, then fit the model with covariate testing.
```bash
# 1. Count splicing events per cell
brie-count \
-a splicing_events.gff3 \
-S sample_list.tsv \
-o brie_counts/ \
-p 16
# 2. Fit BRIE2 with LRT against the cell-type covariate
brie-quant \
-i brie_counts/brie_count.h5ad \
-c cell_metadata.tsv \
-o brie_quant.h5ad \
--interceptMode gene \
--LRTindex All \
--testBase null \
--MCsize 3 \
--batchSize 1000000 \
-p 16
```
`--interceptMode gene` fits a gene-specific intercept (recommended); `--LRTindex All` tests all covariates; `--testBase null` uses the null model as the LRT reference. Verify exact flag set via `brie-quant -h` in installed BRIE2.
```python
import scanpy as sc
adata_splice = sc.read_h5ad('brie_quant.h5ad')
# Per-event covariate effects, ELBO values, and LRT statistics live in
# adata_splice.varm and adata_splice.var; column names depend on BRIE2 version.
# Inspect with: print(adata_splice); print(adata_splice.varm.keys())
# Per-event significance is typically derived from LRT delta-ELBO.
```
BRIE2 (Huang & Sanguinetti 2021 *Genome Biol*) uses a sequence-derived feature prior (exon length, GC content, splice site strength, motif counts) to regularize PSI estimates in low-coverage cells. The LRT-based covariate test answers "is this event associated with cell state?" without requiring per-cell PSI accuracy. Threshold the delta-ELBO at ~3 (analogous to log-Bayes-factor); confirm against version-specific output keys via the brie-tutorials repo.
## SpliZ for Annotation-Free Discovery
**Goal:** Identify splicing-defined cell populations without an event database.
**Approach:** Compute per-gene splicing Z-score across cells; test for cell-state association via permutation.
```bash
spliz \
--bams sample1.bam sample2.bam \
--metadata cell_metadata.tsv \
--gtf annotation.gtf \
--output spliz_output/ \
--threads 8
```
SpliZ (Olivieri 2022 *Nat Methods*) is robust to dropout because it pools junction information across the gene; particularly useful for discovering splicing diversity in heterogeneous tumor samples.
## Psix for Regulated AS Along Trajectories
**Goal:** Detect AS that varies coherently with cell state along a developmental trajectory, robust to dropout.
**Approach:** Score whether observed PSI is smooth on the cell-cell kNN graph from expression-space embedding.
```python
import psix
import scanpy as sc
adata = sc.read_h5ad('cells.h5ad')
sc.pp.neighbors(adata, n_neighbors=30, use_rep='X_pca')
psix_obj = psix.Psix(adata, psi_matrix_path='psi_matrix.tsv')
psix_obj.run_psix()
regulated = psix_obj.psix_results.query('psix_score > 1.5 and pvalue < 0.05')
```
Psix (Buen Abad Najar 2022 *Cell Syst*) is the principled alternative to imputing PSI: do not impute (it obliterates heterogeneity); test for graph smoothness instead.
## Sierra for APA (Not Splicing)
**Goal:** Detect alternative polyadenylation in 10X 3' data — frequently confounded with AS.
**Approach:** Peak-call read pile-ups at 3' ends, then DEXSeq-style DTU on 3' UTR isoforms.
```r
library(Sierra)
peak_file <- FindPeaks(
output.file = 'peaks.txt',
gtf.file = 'annotation.gtf',
bam.file = 'possorted_genome_bam.bam'
)
counts <- CountPeaks(
peak.sites.file = 'peaks.txt',
gtf.file = 'annotation.gtf',
bamfile = 'possorted_genome_bam.bam',
whitelist.file = 'barcodes.tsv'
)
apa_results <- DUTest(counts, group1 = ctrl_cells, group2 = trt_cells)
```
If you only have 10X 3' data, this is often what you actually want. Distinct UTRs change miRNA targeting, RBP binding, and stability — biologically meaningful but not splicing.
## Pseudobulk for Statistical Power
**Goal:** Recover bulk-level statistical power for differential splicing between cell types.
**Approach:** Sum junction counts across cells of the same cluster, then run leafcutter / rMATS on aggregated counts.
```python
import pandas as pd
import numpy as np
def pseudobulk_junctions(junction_counts, cell_metadata, groupby='cell_type'):
out = {}
for group, cells in cell_metadata.groupby(groupby).groups.items():
mask = junction_counts.columns.isin(cells)
out[group] = junction_counts.loc[:, mask].sum(axis=1)
return pd.DataFrame(out)
```
Use pseudobulk for differential splicing **between** well-defined cell types; use per-cell methods for **within-population heterogeneity** (graded splicing along pseudotime, bimodal cell-state mixtures).
## Single-Cell Long-Read = Future of Single-Cell Splicing
In 2024-2026, full-length single-cell long-read sequencing has become practical and is the recommended chemistry for splicing-focused single-cell experiments:
- **MAS-Iso-seq / PacBio Kinnex**: concatenated full-length cDNA arrays, ~16x throughput vs plain Iso-Seq, compatible with 10X 5' libraries (Al'Khafaji 2024 *Nat Biotech*)
- **scISOr-Seq2**: hybrid 10X + PacBio for cell typing + isoform structure (Joglekar 2024 *Nat Neurosci* mouse cortex atlas)
- **ONT direct cDNA + 10X**: lower cost, similar information content
- **FLAMES**: barcode demultiplexing + isoform quantification + SNV calling for ONT scRNA (Tian 2021 *Nat Methods*)
For splicing-specific full-length single-cell analysis, see `long-read-splicing` skill.
## Per-Tool Failure Modes
### MARVEL: SpliceJunction Matrix Format
**Trigger:** Building the SpliceJunction matrix from STAR SJ.out.tab incorrectly (e.g. long-format instead of wide).
**Mechanism:** MARVEL plate-based `CreateMarvelObject(SpliceJunction = ...)` expects a **wide matrix** with first column `coord.intron` (formatted `chr:start:end`) and subsequent columns being per-cell sample IDs with integer junction counts. Long-format data.frames or missing `coord.intron` column cause runtime errors.
**Symptom:** "no `coord.intron` column found" errors; or empty PSI tables despite junction reads being present.
**Fix:** Verify wide-matrix structure; ensure SJ.out.tabs are merged on the `chr:start:end` key with cells as columns. Use `data.table::dcast` for the long→wide reshape.
### BRIE2: TensorFlow Memory
**Trigger:** Large cohort (>10k cells) with deep coverage.
**Mechanism:** Variational inference loads full count matrix; TensorFlow allocates GPU memory aggressively.
**Symptom:** OOM kills; training stalls.
**Fix:** Reduce `--batchSize` from default (500000) to 100000 or 50000; train per-chromosome batch; use CPU mode for very small cohorts. Note flag is camelCase `--batchSize`, not `--batch_size`.
### scQuint: 3' Data Sparsity
**Trigger:** Running scQuint on 10X 3' v3 data hoping for splicing signal.
**Mechanism:** scQuint's latent Dirichlet model needs junction counts; 10X 3' yields too few junction reads to fit the model robustly.
**Symptom:** All cells assign to one cluster; no informative splicing signal.
**Fix:** Pivot to APA analysis with Sierra; or upgrade chemistry to MAS-Iso-seq.
### Psix: Missing kNN Graph
**Trigger:** Running Psix without precomputed cell-cell graph.
**Mechanism:** Psix tests PSI smoothness on a pre-existing cell-cell graph; without one, no smoothness statistic.
**Symptom:** Empty results or error about missing `connectivities`.
**Fix:** Run `sc.pp.neighbors(adata)` before Psix; ensure `connectivities` is in `adata.obsp`.
### Sierra: Annotation Gaps
**Trigger:** GTF missing 3'UTR annotations.
**Mechanism:** Sierra peak-calls within annotated 3'UTRs; missing annotations mean missed peaks.
**Symptom:** Few peaks detected; gene-level coverage but no APA calls.
**Fix:** Use comprehensive GENCODE annotation; or run de-novo peak calling first.
## Reconciliation: When Single-Cell Tools Disagree
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| MARVEL sig, BRIE2 not | Per-cell PSI noise (BRIE2 conservative); MARVEL pseudobulk-like | Trust MARVEL for cell-type comparisons; BRIE2 for within-cluster |
| BRIE2 sig, MARVEL not | Cell-state effect smoother than cell-type boundary | Test along trajectory with Psix |
| SpliZ sig, MARVEL not | Annotation-free SpliZ catches novel events | Investigate junction structure manually |
| Sierra sig, MARVEL not | Sierra is APA, MARVEL is splicing — different biology | Distinguish in interpretation |
| Pseudobulk sig, per-cell not | Power issue; effect averaged out per-cell | Report at cluster level, not per-cell |
## Quantitative Concepts Unique to Single-Cell
**Per-cell PSI vs pseudobulk PSI:**
- Per-cell PSI: meaningful only when junction coverage exceeds ~10-20 reads per cell per event (plate-based or long-read).
- Pseudobulk PSI: aggregate, recovers bulk-level statistical power, discards within-cluster heterogeneity.
**Modality detection in PSI distributions** (Song 2017 *Mol Cell*):
| Modality | PSI distribution | Biology |
|----------|------------------|---------|
| Included | Peaked at 1 | Constitutive inclusion |
| Excluded | Peaked at 0 | Constitutive skipping |
| Bimodal | Mixture at 0 and 1 | Mixed cell states or monoallelic-like bursting |
| Middle | Peaked ~0.5 | Often technical (well-contamination, doublets, or low-coverage shrinkage to prior); confirm with full-length |
| Multimodal | Multiple peaks | Complex regulation; deserves follow-up |
**Beta-binomial vs binomial models:** with sparse counts, binomial PSI is overdispersed. Beta-binomial models (BRIE2; leafcutter2 as Dirichlet-multinomial cluster-level) handle this. For very sparse droplet data, even beta-binomial fits poorly per cell — collapse to pseudobulk.
**Imputation pitfalls:** naive imputation (MAGIC, scImpute, ALRA) of expression matrices is **not** appropriate for PSI: imputing missing junction counts averages over neighboring cells and obliterates the very heterogeneity you want to study. Psix's approach — testing smoothness of observed PSI on the kNN graph — is the principled alternative.
## Cell-Type-Specific Splicing Biology
| System | Event | Regulator |
|--------|-------|-----------|
| Neural microexons | 3-27 nt exons enriched in brain | SRRM4 (nSR100); SRRM3 in retina (Irimia 2014 *Cell*) |
| Neural differentiation | PTBP1 → PTBP2 switch | miR-124 represses PTBP1; derepresses neural exons (Boutz 2007 *Genes Dev*) |
| T-cell activation | CD45 RA → RO | hnRNP-L, ESRP-mediated |
| Erythropoiesis | EPB41 exon 16 | Splicing factor switching during maturation |
| Cardiac development | TTN N2BA → N2B | MBNL1/CELF1 antagonism |
| EMT | FGFR2 IIIb → IIIc, ENAH exon 11a | ESRP1/2 loss in mesenchymal state (Warzecha 2009 *Mol Cell*) |
| Activated T cell | CD45 isoform shift | Multiple SR/hnRNP regulators |
## Quality Thresholds
| Metric | Recommendation |
|--------|----------------|
| Cells per event with reads | >=50 (per-cell PSI); >=200 cells per cluster (pseudobulk) |
| Junction reads per event per cell | >=5 with coverage; <=1 = unreliable |
| PSI variance for cell-type call | <0.1 within cluster, >0.2 between clusters |
| Library | full-length plate or long-read for transcriptome-wide; 3' for APA only |
| Doublet filtering | Required before splicing analysis (DoubletFinder, Scrublet) |
| Cells per cluster (pseudobulk) | >=100 ideal; >=50 minimum |
| nuclear vs whole-cell | snRNA-seq enriches IR; treat with caution |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `MARVEL: ComputePSI returns empty` | STAR SJ.out.tab missing strand info | Re-run STAR with `--outSJtype Standard` |
| `brie.tl.fit: NaN loss` | Insufficient junction reads per cell | Filter cells with `min_reads=20`; raise threshold |
| `scQuint: convergence not reached` | LDA model fit on too-few junctions | Aggregate by chromosome; or switch chemistry |
| `Psix: missing connectivities` | Neighbors graph not computed | Run `sc.pp.neighbors(adata)` first |
| `Sierra: no peaks called` | GTF missing 3'UTR annotations | Use comprehensive GENCODE; or de-novo peak-call |
| `MARVEL: ggplot error` | Seurat version mismatch | Match MARVEL and Seurat versions |
| `FLAMES: barcode rescue failed` | Short-read 10X output not in expected directory | Verify cellranger output structure |
## Common Pitfalls
- **Treating 10X 3' splicing analysis as legitimate** — the chemistry doesn't support it. Use Sierra for APA or upgrade to MAS-Iso-seq.
- **Imputing PSI matrices** — destroys the heterogeneity you want to detect. Use Psix or BRIE2 instead.
- **Per-cell PSI on droplet data** — typically too sparse for stable estimates. Use pseudobulk first, then drill down to per-cell.
- **Confusing APA with splicing** — Sierra results look like AS but are 3' UTR isoforms. Different machinery, different biology.
- **snRNA-seq IR signal misinterpreted as splicing dysregulation** — nuclear RNA is enriched for incompletely spliced transcripts; baseline IR is high.
- **Trusting per-cell PSI from BRIE2 without ELBO_gain test** — BRIE2's per-cell point estimates are noisy; the principled output is the ELBO_gain cell-state-association statistic.
- **Microexon analysis with default short-read aligners** — anchors >=20 nt miss most microexons; use VAST-TOOLS, MicroExonator, or long-read.
- **Skipping doublet filtering before splicing** — doublets create artificial PSI mid-modality.
## Related Skills
- single-cell/preprocessing - QC and normalization (must run before splicing)
- single-cell/clustering - Cell type annotation prerequisite
- single-cell/doublet-detection - Doublet filtering critical for splicing
- single-cell/data-io - h5ad / Seurat I/O
- splicing-quantification - Bulk RNA-seq comparison context
- long-read-splicing - Full-isoform analysis from MAS-Iso-seq, scISOr-Seq2; future of single-cell splicing
## References
- Huang & Sanguinetti 2021 *Genome Biol* - BRIE2
- Wen et al 2023 *NAR Genomics Bioinform* - MARVEL
- Benegas, Fischer & Song 2022 *eLife* - scQuint (annotation-free single-cell splicing analysis, validated on Smart-seq2)
- Olivieri et al 2022 *Nat Methods* - SpliZ
- Buen Abad Najar et al 2022 *Cell Syst* - Psix
- Patrick et al 2020 *Genome Biol* - Sierra
- Song et al 2017 *Mol Cell* - splicing modality classification
- Picelli et al 2014 *Nat Protoc* - Smart-seq2
- Hagemann-Jensen et al 2020 *Nat Biotech* - Smart-seq3
- Hagemann-Jensen et al 2022 *Nat Biotech* - Smart-seq3xpress
- Hahaut et al 2022 *Nat Biotech* - FLASH-seq
- Salmen et al 2022 *Nat Biotech* - VASA-seq
- Johnson et al 2023 *Nat Commun* - STORM-seq
- Al'Khafaji et al 2024 *Nat Biotech* - MAS-Iso-seq / Kinnex
- Tian et al 2021 *Nat Methods* - FLAMES
- Joglekar et al 2024 *Nat Neurosci* - scISOr-Seq2 mouse cortex atlas
- Irimia et al 2014 *Cell* - neural microexons / SRRM4
- Boutz et al 2007 *Genes Dev* - PTBP1/PTBP2 neural switch
- Tian & Manley 2017 *Nat Rev Mol Cell Biol* - alternative polyadenylation and 3' UTR isoforms
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.