bio-single-cell-data-io
$
npx mdskill add GPTomics/bioSkills/bio-single-cell-data-ioConvert and manage single-cell data formats across Python and R.
- Handles loading, saving, and converting 10X Genomics datasets.
- Integrates with Seurat, Scanpy, AnnData, and h5ad/RDS tools.
- Adapts code dynamically when package versions mismatch.
- Outputs processed objects ready for downstream analysis.
SKILL.md
.github/skills/bio-single-cell-data-ioView on GitHub ↗
---
name: bio-single-cell-data-io
description: Read, write, and create single-cell data objects using Seurat (R) and Scanpy (Python). Use for loading 10X Genomics data, importing/exporting h5ad and RDS files, creating Seurat objects and AnnData objects, and converting between formats. Use when loading, saving, or converting single-cell data formats.
tool_type: mixed
primary_tool: Seurat
---
## Version Compatibility
Reference examples tested with: Cell Ranger 8.0+, anndata 0.10+, numpy 1.26+, pandas 2.2+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Single-Cell Data I/O
Read, write, and create single-cell data objects for analysis.
## Scanpy (Python)
**Goal:** Load, create, and save single-cell data objects using Scanpy and AnnData.
**Approach:** Read 10X Genomics output, CSV, or Loom formats into AnnData objects, manipulate metadata and layers, and write to h5ad format.
**"Load my 10X data"** → Read Cell Ranger output directory or h5 file into an AnnData object with expression matrix, cell barcodes, and gene annotations.
### Required Imports
```python
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
```
### Reading 10X Genomics Data
```python
# Read 10X cellranger output (filtered_feature_bc_matrix directory)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/', var_names='gene_symbols', cache=True)
print(f'Loaded {adata.n_obs} cells x {adata.n_vars} genes')
# Read 10X h5 file directly
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
```
### AnnData Object Structure
```python
# AnnData stores:
# - adata.X: expression matrix (cells x genes)
# - adata.obs: cell metadata (DataFrame)
# - adata.var: gene metadata (DataFrame)
# - adata.uns: unstructured annotations (dict)
# - adata.obsm: cell embeddings (PCA, UMAP)
# - adata.varm: gene embeddings
# - adata.obsp: cell-cell graphs
# - adata.layers: alternative matrices (raw counts, normalized)
print(f'Shape: {adata.shape}')
print(f'Cell metadata: {adata.obs.columns.tolist()}')
print(f'Gene metadata: {adata.var.columns.tolist()}')
```
### Creating AnnData from Matrix
```python
import anndata as ad
import numpy as np
import pandas as pd
counts = np.random.poisson(1, size=(100, 500)) # 100 cells x 500 genes
cell_ids = [f'cell_{i}' for i in range(100)]
gene_ids = [f'gene_{i}' for i in range(500)]
adata = ad.AnnData(
X=counts,
obs=pd.DataFrame(index=cell_ids),
var=pd.DataFrame(index=gene_ids)
)
```
### Reading/Writing h5ad Files
```python
# h5ad is the native AnnData format
adata = sc.read_h5ad('data.h5ad')
# Write to h5ad
adata.write_h5ad('output.h5ad')
# Write compressed
adata.write_h5ad('output.h5ad', compression='gzip')
```
### Reading Other Formats
```python
# CSV/TSV (genes as columns, cells as rows)
adata = sc.read_csv('counts.csv')
# Loom format
adata = sc.read_loom('data.loom')
# Text file (tab-separated)
adata = sc.read_text('counts.txt')
```
### Adding Metadata
```python
# Add cell metadata
adata.obs['sample'] = 'sample_1'
adata.obs['batch'] = ['batch_1'] * 50 + ['batch_2'] * 50
# Add gene metadata
adata.var['gene_type'] = 'protein_coding'
# Add unstructured data
adata.uns['experiment'] = 'PBMC_3k'
```
### Subsetting AnnData
```python
# Subset by cells
adata_subset = adata[adata.obs['batch'] == 'batch_1'].copy()
# Subset by genes
adata_subset = adata[:, adata.var['highly_variable']].copy()
# Boolean indexing
adata_subset = adata[adata.obs['n_genes'] > 200, :].copy()
```
### Storing Raw Counts
```python
# Store raw counts before normalization
adata.raw = adata.copy()
# Access raw counts later
raw_counts = adata.raw.X
# Or use layers
adata.layers['counts'] = adata.X.copy()
```
---
## Seurat (R)
**Goal:** Load, create, and save single-cell data objects using Seurat.
**Approach:** Read 10X Genomics output into Seurat objects, manipulate metadata, merge samples, and serialize with RDS or h5Seurat formats.
### Required Libraries
```r
library(Seurat)
library(Matrix)
```
### Reading 10X Genomics Data
```r
# Read 10X cellranger output
counts <- Read10X(data.dir = 'filtered_feature_bc_matrix/')
# Create Seurat object
seurat_obj <- CreateSeuratObject(counts = counts, project = 'PBMC', min.cells = 3, min.features = 200)
print(seurat_obj)
```
### Reading 10X h5 File
```r
# Read h5 file directly
counts <- Read10X_h5('filtered_feature_bc_matrix.h5')
seurat_obj <- CreateSeuratObject(counts = counts, project = 'PBMC')
```
### Seurat Object Structure (v5)
```r
# Seurat v5 uses layers instead of slots
# - Layers: counts, data, scale.data
# - Metadata: seurat_obj@meta.data
# - Reductions: seurat_obj@reductions
# - Graphs: seurat_obj@graphs
# Access layers (v5 syntax)
counts <- LayerData(seurat_obj, layer = 'counts')
# Or shorthand
counts <- seurat_obj[['RNA']]$counts
# Access metadata
head(seurat_obj@meta.data)
```
### Creating from Matrix
```r
# Create from sparse matrix
counts <- Matrix(rpois(1000 * 500, 1), nrow = 500, ncol = 1000, sparse = TRUE)
rownames(counts) <- paste0('gene_', 1:500)
colnames(counts) <- paste0('cell_', 1:1000)
seurat_obj <- CreateSeuratObject(counts = counts, project = 'MyProject')
```
### Reading/Writing RDS Files
```r
# Save Seurat object
saveRDS(seurat_obj, file = 'seurat_obj.rds')
# Load Seurat object
seurat_obj <- readRDS('seurat_obj.rds')
```
### Adding Metadata
```r
# Add cell metadata
seurat_obj$sample <- 'sample_1'
seurat_obj$batch <- c(rep('batch_1', 500), rep('batch_2', 500))
# Or using AddMetaData
metadata_df <- data.frame(
cell_type = rep('unknown', ncol(seurat_obj)),
row.names = colnames(seurat_obj)
)
seurat_obj <- AddMetaData(seurat_obj, metadata = metadata_df)
```
### Subsetting Seurat Objects
```r
# Subset by metadata
seurat_subset <- subset(seurat_obj, subset = batch == 'batch_1')
# Subset by cells
seurat_subset <- subset(seurat_obj, cells = colnames(seurat_obj)[1:500])
# Subset by features
seurat_subset <- subset(seurat_obj, features = rownames(seurat_obj)[1:100])
```
### Merging Objects
```r
# Merge multiple Seurat objects
merged <- merge(seurat_obj1, y = c(seurat_obj2, seurat_obj3), add.cell.ids = c('S1', 'S2', 'S3'))
# Join layers after merge (v5)
merged <- JoinLayers(merged)
```
---
## Format Conversion
**Goal:** Convert single-cell data objects between Seurat (R) and AnnData (Python) formats.
**Approach:** Use SeuratDisk as an intermediary to convert via h5Seurat/h5ad bridge files.
### Seurat to AnnData
```r
# In R: save as h5Seurat
library(SeuratDisk)
SaveH5Seurat(seurat_obj, filename = 'data.h5seurat')
Convert('data.h5seurat', dest = 'h5ad')
```
```python
# In Python: read converted file
adata = sc.read_h5ad('data.h5ad')
```
### AnnData to Seurat
```python
# In Python: save as h5ad
adata.write_h5ad('data.h5ad')
```
```r
# In R: convert and load
library(SeuratDisk)
Convert('data.h5ad', dest = 'h5seurat')
seurat_obj <- LoadH5Seurat('data.h5seurat')
```
## Common Data Formats
| Format | Extension | Description | Tool |
|--------|-----------|-------------|------|
| 10X MTX | folder | Cellranger output | Both |
| 10X h5 | .h5 | Cellranger HDF5 | Both |
| h5ad | .h5ad | AnnData native | Scanpy |
| RDS | .rds | R serialized | Seurat |
| Loom | .loom | HDF5-based | Both |
| h5Seurat | .h5seurat | Seurat HDF5 | Seurat |
## Related Skills
- preprocessing - QC filtering and normalization after loading
- clustering - Dimensionality reduction and clustering
- markers-annotation - Find marker genes and annotate cell types
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.