scvelo
$
npx mdskill add K-Dense-AI/scientific-agent-skills/scveloModel cell differentiation trajectories from snapshot RNA-seq data.
- Predicts gene expression changes using unspliced to spliced mRNA ratios.
- Integrates with Scanpy and scVI-tools for trajectory inference.
- Calculates latent time and identifies driver genes automatically.
- Outputs inferred cell states and developmental pathways.
SKILL.md
.github/skills/scveloView on GitHub ↗
---
name: scvelo
description: RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference.
license: BSD-3-Clause
metadata:
skill-author: Kuan-lin Huang
---
# scVelo — RNA Velocity Analysis
## Overview
scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.
**Installation:** `pip install scvelo`
**Key resources:**
- Documentation: https://scvelo.readthedocs.io/
- GitHub: https://github.com/theislab/scvelo
- Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759
## When to Use This Skill
Use scVelo when:
- **Trajectory inference from snapshot data**: Determine which direction cells are differentiating
- **Cell fate prediction**: Identify progenitor cells and their downstream fates
- **Driver gene identification**: Find genes whose dynamics best explain observed trajectories
- **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
- **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics
- **Complement to Scanpy**: Add directional information to UMAP embeddings
## Prerequisites
scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by:
1. **STARsolo** or **kallisto|bustools** with `lamanno` mode
2. **velocyto** CLI: `velocyto run10x` / `velocyto run`
3. **alevin-fry** / **simpleaf** with spliced/unspliced output
Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`.
## Standard RNA Velocity Workflow
### 1. Setup and Data Loading
```python
import scvelo as scv
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
# Configure settings
scv.settings.verbosity = 3 # Show computation steps
scv.settings.presenter_view = True
scv.settings.set_figure_params('scvelo')
# Load data (AnnData with spliced/unspliced layers)
# Option A: Load from loom (velocyto output)
adata = scv.read("cellranger_output.loom", cache=True)
# Option B: Merge velocyto loom with Scanpy-processed AnnData
adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters
adata_velocity = scv.read("velocyto.loom")
adata = scv.utils.merge(adata_processed, adata_velocity)
# Verify layers
print(adata)
# obs × var: N × G
# layers: 'spliced', 'unspliced' (required)
# obsm['X_umap'] (required for visualization)
```
### 2. Preprocessing
```python
# Filter and normalize (follows Scanpy conventions)
scv.pp.filter_and_normalize(
adata,
min_shared_counts=20, # Minimum counts in spliced+unspliced
n_top_genes=2000 # Top highly variable genes
)
# Compute first and second order moments (means and variances)
# knn_connectivities must be computed first
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
scv.pp.moments(
adata,
n_pcs=30,
n_neighbors=30
)
```
### 3. Velocity Estimation — Stochastic Model
The stochastic model is fast and suitable for exploratory analysis:
```python
# Stochastic velocity (faster, less accurate)
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
# Visualize
scv.pl.velocity_embedding_stream(
adata,
basis='umap',
color='leiden',
title="RNA Velocity (Stochastic)"
)
```
### 4. Velocity Estimation — Dynamical Model (Recommended)
The dynamical model fits the full splicing kinetics and is more accurate:
```python
# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
scv.tl.recover_dynamics(adata, n_jobs=4)
# Compute velocity from dynamical model
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
```
### 5. Latent Time
The dynamical model enables computation of a shared latent time (pseudotime):
```python
# Compute latent time
scv.tl.latent_time(adata)
# Visualize latent time on UMAP
scv.pl.scatter(
adata,
color='latent_time',
color_map='gnuplot',
size=80,
title='Latent time'
)
# Identify top genes ordered by latent time
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(
adata,
var_names=top_genes,
sortby='latent_time',
col_color='leiden',
n_convolve=100
)
```
### 6. Driver Gene Analysis
```python
# Identify genes with highest velocity fit
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))
# Speed and coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(
adata,
c=['velocity_length', 'velocity_confidence'],
cmap='coolwarm',
perc=[5, 95]
)
# Phase portraits for specific genes
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
ncols=3, figsize=(16, 4))
```
### 7. Velocity Arrows and Pseudotime
```python
# Arrow plot on UMAP
scv.pl.velocity_embedding(
adata,
arrow_length=3,
arrow_size=2,
color='leiden',
basis='umap'
)
# Stream plot (cleaner visualization)
scv.pl.velocity_embedding_stream(
adata,
basis='umap',
color='leiden',
smooth=0.8,
min_mass=4
)
# Velocity pseudotime (alternative to latent time)
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
```
### 8. PAGA Trajectory Graph
```python
# PAGA graph with velocity-informed transitions
scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')
# Plot PAGA with velocity
scv.pl.paga(
adata,
basis='umap',
size=50,
alpha=0.1,
min_edge_width=2,
node_size_scale=1.5
)
```
## Complete Workflow Script
```python
import scvelo as scv
import scanpy as sc
def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
"""
Complete RNA velocity workflow.
Args:
adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
n_top_genes: Number of top HVGs for velocity
mode: 'stochastic' (fast) or 'dynamical' (accurate)
n_jobs: Parallel jobs for dynamical model
Returns:
Processed AnnData with velocity information
"""
scv.settings.verbosity = 2
# 1. Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
if 'neighbors' not in adata.uns:
sc.pp.neighbors(adata, n_neighbors=30)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
# 2. Velocity estimation
if mode == 'dynamical':
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
scv.tl.velocity(adata, mode=mode)
scv.tl.velocity_graph(adata)
# 3. Downstream analyses
if mode == 'dynamical':
scv.tl.latent_time(adata)
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
scv.tl.velocity_confidence(adata)
scv.tl.velocity_pseudotime(adata)
return adata
```
## Key Output Fields in AnnData
After running the workflow, the following fields are added:
| Location | Key | Description |
|----------|-----|-------------|
| `adata.layers` | `velocity` | RNA velocity per gene per cell |
| `adata.layers` | `fit_t` | Fitted latent time per gene per cell |
| `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP |
| `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity |
| `adata.obs` | `latent_time` | Latent time from dynamical model |
| `adata.obs` | `velocity_length` | Speed of each cell |
| `adata.obs` | `velocity_confidence` | Confidence score per cell |
| `adata.var` | `fit_likelihood` | Gene-level model fit quality |
| `adata.var` | `fit_alpha` | Transcription rate |
| `adata.var` | `fit_beta` | Splicing rate |
| `adata.var` | `fit_gamma` | Degradation rate |
| `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix |
## Velocity Models Comparison
| Model | Speed | Accuracy | When to Use |
|-------|-------|----------|-------------|
| `stochastic` | Fast | Moderate | Exploratory; large datasets |
| `deterministic` | Medium | Moderate | Simple linear kinetics |
| `dynamical` | Slow | High | Publication-quality; identifies driver genes |
## Best Practices
- **Start with stochastic mode** for exploration; switch to dynamical for final analysis
- **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage
- **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells
- **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues
- **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed
- **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
- **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes
## Troubleshooting
| Problem | Solution |
|---------|---------|
| Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` |
| Very few velocity genes | Lower `min_shared_counts`; check sequencing depth |
| Random-looking arrows | Try different `n_neighbors` or velocity model |
| Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` |
| Negative velocity everywhere | Check that spliced/unspliced layers are not swapped |
## Additional Resources
- **scVelo documentation**: https://scvelo.readthedocs.io/
- **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/
- **GitHub**: https://github.com/theislab/scvelo
- **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
- **velocyto** (preprocessing): http://velocyto.org/
- **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
- **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/
More from K-Dense-AI/scientific-agent-skills
- adaptyvHow to use the Adaptyv Bio Foundry API and Python SDK for protein experiment design, submission, and results retrieval. Use this skill whenever the user mentions Adaptyv, Foundry API, protein binding assays, protein screening experiments, BLI/SPR assays, thermostability assays, or wants to submit protein sequences for experimental characterization. Also trigger when code imports `adaptyv`, `adaptyv_sdk`, or `FoundryClient`, or references `foundry-api-public.adaptyvbio.com`.
- aeonThis skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
- anndataData structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.
- arboretoInfer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets.
- astropyComprehensive Python library for astronomy and astrophysics. This skill should be used when working with astronomical data including celestial coordinates, physical units, FITS files, cosmological calculations, time systems, tables, world coordinate systems (WCS), and astronomical data analysis. Use when tasks involve coordinate transformations, unit conversions, FITS file manipulation, cosmological distance calculations, time scale conversions, or astronomical data processing.
- autoskillObserve the user's screen via screenpipe, detect repeated research workflows, match them against existing scientific-agent-skills, and draft new skills (or composition recipes that chain existing ones) for the patterns not yet covered. Use when the user asks to analyze their recent work and propose skills based on what they actually do. Requires the screenpipe daemon (https://github.com/screenpipe/screenpipe) running locally on port 3030 — the skill has no other data source and will refuse to run if screenpipe is unreachable. All detection runs locally; only redacted cluster summaries reach the LLM.
- benchling-integrationBenchling R&D platform integration. Access registry (DNA, proteins), inventory, ELN entries, workflows via API, build Benchling Apps, query Data Warehouse, for lab data management automation.
- bgpt-paper-searchSearch scientific papers and retrieve structured experimental data extracted from full-text studies via the BGPT MCP server. Returns 25+ fields per paper including methods, results, sample sizes, quality scores, and conclusions. Use for literature reviews, evidence synthesis, and finding experimental details not available in abstracts alone.
- biopythonComprehensive molecular biology toolkit. Use for sequence manipulation, file parsing (FASTA/GenBank/PDB), phylogenetics, and programmatic NCBI/PubMed access (Bio.Entrez). Best for batch processing, custom bioinformatics pipelines, BLAST automation. For quick lookups use gget; for multi-service integration use bioservices.
- bioservicesUnified Python interface to 40+ bioinformatics services. Use when querying multiple databases (UniProt, KEGG, ChEMBL, Reactome) in a single workflow with consistent API. Best for cross-database analysis, ID mapping across services. For quick single-database lookups use gget; for sequence/file manipulation use biopython.