bio-machine-learning-atlas-mapping
$
npx mdskill add GPTomics/bioSkills/bio-machine-learning-atlas-mappingMap single-cell data to reference atlases using pre-trained models.
- Annotates new datasets by transferring labels without retraining.
- Integrates scvi-tools, scanpy, and anndata for single-cell analysis.
- Executes surgical fine-tuning to update only query-specific parameters.
- Outputs shared latent embeddings for downstream cell type identification.
SKILL.md
.github/skills/bio-machine-learning-atlas-mappingView on GitHub ↗
---
name: bio-machine-learning-atlas-mapping
description: Maps query single-cell data to reference atlases using scArches transfer learning with scVI and scANVI models. Transfers cell type labels without retraining on combined data. Use when annotating new single-cell datasets using pre-trained reference models.
tool_type: python
primary_tool: scvi-tools
---
## Version Compatibility
Reference examples tested with: anndata 0.10+, scanpy 1.10+, scvi-tools 1.1+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Transfer Learning for Single-Cell Data
**"Map my scRNA-seq data onto a reference atlas"** → Transfer cell type labels from a pre-trained reference model to query cells using architectural surgery (scArches) without retraining.
- Python: `scvi.model.SCVI.load_query_data()` → `get_latent_representation()` → `scanpy.tl.ingest()`
## scVI Reference Mapping (scArches)
**Goal:** Map query single-cell data onto a pre-trained reference model to obtain a shared latent embedding.
**Approach:** Load a pre-trained scVI model, prepare query data with matching gene sets, then perform surgical fine-tuning that updates only query-specific parameters.
```python
import scvi
import scanpy as sc
# Load pre-trained reference model
adata_ref = sc.read_h5ad('reference.h5ad')
# Model must have been saved with save_anndata=True
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_model = scvi.model.SCVI.load('reference_model/', adata=adata_ref)
# Prepare query data
adata_query = sc.read_h5ad('query.h5ad')
# Subset to reference genes
adata_query = adata_query[:, adata_ref.var_names].copy()
# Set up query AnnData using reference setup
scvi.model.SCVI.prepare_query_anndata(adata_query, ref_model)
# Load query into model (creates "surgical" fine-tuned model)
query_model = scvi.model.SCVI.load_query_data(adata_query, ref_model)
# Surgical training: update only query-specific parameters
# weight_decay=0.0: Standard for surgery; prevents reference drift
query_model.train(max_epochs=200, plan_kwargs={'weight_decay': 0.0})
# Get latent representation
adata_query.obsm['X_scVI'] = query_model.get_latent_representation()
```
## scANVI for Label Transfer
**Goal:** Transfer cell type labels from a labeled reference atlas to unlabeled query data.
**Approach:** Train a semi-supervised scANVI model on the reference, then map query cells via surgical fine-tuning and predict labels with confidence scores.
```python
import scvi
import scanpy as sc
# Reference with cell type labels
adata_ref = sc.read_h5ad('reference_labeled.h5ad')
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_vae = scvi.model.SCVI(adata_ref, n_latent=30)
ref_vae.train(max_epochs=100)
# Convert to scANVI (semi-supervised)
scvi.model.SCANVI.setup_anndata(adata_ref, layer='counts', batch_key='batch', labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi = scvi.model.SCANVI.from_scvi_model(ref_vae, labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi.train(max_epochs=50)
ref_scanvi.save('reference_scanvi/')
# Map query data
adata_query = sc.read_h5ad('query.h5ad')
adata_query = adata_query[:, adata_ref.var_names].copy()
scvi.model.SCANVI.prepare_query_anndata(adata_query, ref_scanvi)
query_scanvi = scvi.model.SCANVI.load_query_data(adata_query, ref_scanvi)
query_scanvi.train(max_epochs=100, plan_kwargs={'weight_decay': 0.0})
# Transfer labels
adata_query.obs['predicted_cell_type'] = query_scanvi.predict()
adata_query.obsm['X_scANVI'] = query_scanvi.get_latent_representation()
```
## Prediction Confidence
**Goal:** Assess reliability of transferred labels and flag cells that may represent novel types.
**Approach:** Extract soft prediction probabilities from the scANVI model and identify low-confidence assignments below a threshold.
```python
# Get prediction probabilities
soft_predictions = query_scanvi.predict(soft=True)
adata_query.obs['prediction_confidence'] = soft_predictions.max(axis=1)
# Flag low-confidence predictions
# confidence < 0.5: May be novel cell type or poor mapping
low_conf = adata_query.obs['prediction_confidence'] < 0.5
print(f'Low confidence predictions: {low_conf.sum()} ({low_conf.mean():.1%})')
```
## Joint Embedding Visualization
**Goal:** Visualize reference and query cells together to assess integration quality.
**Approach:** Concatenate reference and query datasets, compute UMAP from the shared latent representation, and color by dataset and cell type.
```python
import scanpy as sc
# Combine reference and query for visualization
adata_combined = adata_ref.concatenate(adata_query, batch_key='dataset', batch_categories=['reference', 'query'])
# Use latent space for neighbors/UMAP
sc.pp.neighbors(adata_combined, use_rep='X_scVI')
sc.tl.umap(adata_combined)
sc.pl.umap(adata_combined, color=['dataset', 'cell_type'], save='_transfer.png')
```
## Pre-trained Reference Atlases
| Atlas | Model | URL |
|-------|-------|-----|
| Human Lung Cell Atlas | scANVI | cellxgene.cziscience.com |
| Tabula Sapiens | scVI | tabula-sapiens-portal.ds.czbiohub.org |
| Mouse Cell Atlas | scVI | bis.zju.edu.cn/MCA |
## Training Parameters
| Parameter | Surgical | Full Retrain | Notes |
|-----------|----------|--------------|-------|
| weight_decay | 0.0 | 0.001 | 0.0 preserves reference |
| max_epochs | 100-200 | 200-400 | Less for surgery |
| early_stopping | True | True | Prevents overfitting |
## Related Skills
- single-cell/cell-annotation - Manual annotation methods
- single-cell/batch-integration - Batch effect correction
- single-cell/preprocessing - Data preparation before transfer
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.