bio-expression-matrix-sparse-handling

Name: bio-expression-matrix-sparse-handling
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-expression-matrix-sparse-handling
Efficiently handle sparse expression matrices for bioinformatics data
Solve memory issues with large zero-heavy RNA-seq or single-cell datasets
Uses scipy.sparse, numpy, pandas, and anndata for sparse matrix operations
Analyzes sparsity level to determine if sparse format is appropriate
Converts dense data to sparse formats and provides sparsity metrics
SKILL.md
.github/skills/bio-expression-matrix-sparse-handlingView on GitHub ↗
---
name: bio-expression-matrix-sparse-handling
description: Work with sparse matrices for memory-efficient storage of count data. Use when dealing with single-cell data or large bulk RNA-seq datasets where most values are zero.
tool_type: python
primary_tool: scipy.sparse
---

## Version Compatibility

Reference examples tested with: numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Sparse Matrix Handling

**"Convert counts to sparse matrix"** → Store zero-heavy expression data (especially single-cell) in memory-efficient sparse format.
- Python: `scipy.sparse.csr_matrix(dense_array)`, `anndata.X` stores sparse by default
- Python: `scipy.io.mmread('matrix.mtx')` for Market Matrix format (10x Genomics)

## Check Sparsity

```python
import numpy as np

# Calculate sparsity (proportion of zeros)
def check_sparsity(counts):
    zeros = (counts == 0).sum().sum()
    total = counts.size
    sparsity = zeros / total
    print(f'Sparsity: {sparsity:.1%} ({zeros:,} / {total:,} zeros)')
    return sparsity

# Rule of thumb: use sparse if >50% zeros
```

## Convert Dense to Sparse

```python
import scipy.sparse as sp
import pandas as pd

# From pandas DataFrame
dense_df = pd.read_csv('counts.csv', index_col=0)
sparse_matrix = sp.csr_matrix(dense_df.values)

# Keep row/column names
gene_names = dense_df.index.tolist()
sample_names = dense_df.columns.tolist()

# CSR vs CSC
# CSR (Compressed Sparse Row): efficient row slicing, matrix-vector products
# CSC (Compressed Sparse Column): efficient column slicing
sparse_csr = sp.csr_matrix(dense_df.values)  # Row-oriented
sparse_csc = sp.csc_matrix(dense_df.values)  # Column-oriented
```

## Convert Sparse to Dense

```python
import pandas as pd
import scipy.sparse as sp

# To numpy array
dense_array = sparse_matrix.toarray()

# To pandas DataFrame
dense_df = pd.DataFrame(
    sparse_matrix.toarray(),
    index=gene_names,
    columns=sample_names
)
```

## Memory Comparison

```python
import sys
import scipy.sparse as sp

def compare_memory(dense, sparse):
    dense_mb = dense.nbytes / 1e6
    sparse_mb = (sparse.data.nbytes + sparse.indices.nbytes + sparse.indptr.nbytes) / 1e6
    ratio = dense_mb / sparse_mb
    print(f'Dense:  {dense_mb:.1f} MB')
    print(f'Sparse: {sparse_mb:.1f} MB')
    print(f'Ratio:  {ratio:.1f}x smaller')
    return ratio

sparse = sp.csr_matrix(counts.values)
compare_memory(counts.values, sparse)
```

## Save/Load Sparse Matrices

```python
import scipy.sparse as sp
import numpy as np

# Save sparse matrix
sp.save_npz('counts_sparse.npz', sparse_matrix)

# Save with metadata
np.savez('counts_with_meta.npz',
    data=sparse_matrix.data,
    indices=sparse_matrix.indices,
    indptr=sparse_matrix.indptr,
    shape=sparse_matrix.shape,
    genes=np.array(gene_names),
    samples=np.array(sample_names))

# Load sparse matrix
sparse_matrix = sp.load_npz('counts_sparse.npz')

# Load with metadata
loaded = np.load('counts_with_meta.npz', allow_pickle=True)
sparse_matrix = sp.csr_matrix(
    (loaded['data'], loaded['indices'], loaded['indptr']),
    shape=tuple(loaded['shape']))
gene_names = loaded['genes'].tolist()
```

## AnnData with Sparse Matrices

```python
import anndata as ad
import scipy.sparse as sp
import pandas as pd

# Create AnnData with sparse matrix
adata = ad.AnnData(
    X=sp.csr_matrix(counts.values),
    obs=pd.DataFrame(index=counts.columns),  # Samples
    var=pd.DataFrame(index=counts.index)     # Genes
)

# Note: AnnData transposes so cells/samples are rows
adata = ad.AnnData(
    X=sp.csr_matrix(counts.T.values),  # Transpose for samples-as-rows
    obs=pd.DataFrame(index=counts.columns),
    var=pd.DataFrame(index=counts.index)
)

# Save (automatically handles sparse)
adata.write_h5ad('counts.h5ad')

# Check if stored sparse
adata = ad.read_h5ad('counts.h5ad')
print(f'Matrix type: {type(adata.X)}')
```

## Sparse Operations

```python
import scipy.sparse as sp
import numpy as np

# Row sums (gene totals)
row_sums = np.array(sparse_matrix.sum(axis=1)).flatten()

# Column sums (sample totals)
col_sums = np.array(sparse_matrix.sum(axis=0)).flatten()

# Filter rows by sum (keep genes with >10 total counts)
keep_mask = row_sums > 10
sparse_filtered = sparse_matrix[keep_mask, :]

# Filter columns (keep samples with >1000 counts)
keep_cols = col_sums > 1000
sparse_filtered = sparse_matrix[:, keep_cols]

# Log transform (add pseudocount)
sparse_log = sparse_matrix.copy()
sparse_log.data = np.log1p(sparse_log.data)
```

## Subsetting Sparse Matrices

```python
# Select specific genes
gene_idx = [gene_names.index(g) for g in ['TP53', 'BRCA1', 'MYC'] if g in gene_names]
subset = sparse_matrix[gene_idx, :]

# Select specific samples
sample_idx = [sample_names.index(s) for s in ['sample1', 'sample2']]
subset = sparse_matrix[:, sample_idx]

# Boolean indexing
expressed = row_sums > 0
sparse_expressed = sparse_matrix[expressed, :]
```

## Normalization on Sparse

**Goal:** Apply CPM normalization and log transformation to a sparse count matrix without converting to dense format.

**Approach:** Compute library sizes from column sums, broadcast scaling factors with sparse multiply for CPM, then transform only the nonzero data array in-place with log1p.

```python
import numpy as np
import scipy.sparse as sp

def normalize_sparse_cpm(sparse_matrix):
    '''CPM normalization for sparse matrix.'''
    lib_sizes = np.array(sparse_matrix.sum(axis=0)).flatten()
    scaling_factors = 1e6 / lib_sizes
    normalized = sparse_matrix.multiply(scaling_factors)  # Broadcasts across columns
    return normalized

def normalize_sparse_log1p(sparse_matrix):
    '''Log1p transform sparse matrix in place.'''
    result = sparse_matrix.copy()
    result.data = np.log1p(result.data)
    return result

cpm = normalize_sparse_cpm(sparse_matrix)
log_cpm = normalize_sparse_log1p(cpm)
```

## 10X Matrix Format

```python
import scipy.io
import pandas as pd

# Read 10X format
matrix = scipy.io.mmread('matrix.mtx').T.tocsr()  # Transpose and convert to CSR
features = pd.read_csv('features.tsv', sep='\t', header=None)
barcodes = pd.read_csv('barcodes.tsv', sep='\t', header=None)

gene_names = features[1].tolist()  # Gene symbols
cell_barcodes = barcodes[0].tolist()

# Write 10X format
scipy.io.mmwrite('output_matrix.mtx', sparse_matrix)
```

## Backed Mode for Large Datasets

**Goal:** Work with H5AD files too large to fit in memory by loading data on demand.

**Approach:** Open the file in backed mode so only accessed rows/columns are loaded from disk.

```python
import scanpy as sc

# Read-only backed mode -- data stays on disk, loaded on demand
adata = sc.read_h5ad('large_dataset.h5ad', backed='r')
print(f'Shape: {adata.shape}, Type: {type(adata.X)}')

# Only the accessed slice is loaded into memory
subset = adata[adata.obs['cell_type'] == 'T_cell', :].to_memory()
```

Backed mode limitations:
- Only the X matrix supports backed access; obs/var are fully loaded
- Very large sparse H5AD files (>35 GB) can still cause memory issues
- Write mode (`backed='r+'`) only supports updates to X, not obs/var
- Many scanpy functions require in-memory data; use `.to_memory()` for subsets

For datasets too large for backed mode, process in chunks:

```python
import anndata as ad
import numpy as np

def process_in_chunks(h5ad_path, chunk_size=10000, func=None):
    '''Process large H5AD file in cell chunks.'''
    adata = sc.read_h5ad(h5ad_path, backed='r')
    n_cells = adata.shape[0]
    results = []
    for start in range(0, n_cells, chunk_size):
        end = min(start + chunk_size, n_cells)
        chunk = adata[start:end].to_memory()
        if func:
            chunk = func(chunk)
        results.append(chunk)
    return ad.concat(results)
```

## Sparsity Changes After Transformation

After log-transformation or normalization, zero values become non-zero (log1p(0) = 0, but CPM normalization moves zeros to non-zero when prior counts are added). This reduces the effective sparsity:

```python
import scipy.sparse as sp
import numpy as np

def check_sparsity_after_transform(sparse_matrix, transform_name='original'):
    nnz = sparse_matrix.nnz
    total = sparse_matrix.shape[0] * sparse_matrix.shape[1]
    sparsity = 1 - (nnz / total)
    print(f'{transform_name}: {sparsity:.1%} sparse ({nnz:,} nonzero / {total:,} total)')
    return sparsity

# After normalization, sparsity decreases
# If <50% sparse after transformation, dense format may be more memory-efficient
check_sparsity_after_transform(adata.X, 'raw')
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
check_sparsity_after_transform(adata.X, 'after log1p')
```

Rule of thumb: if sparsity drops below ~50% after transformation, convert back to dense format. The sparse index overhead outweighs the benefit of skipping non-zeros.

## Sparse-Dense Arithmetic Gotcha

Adding/subtracting a dense numpy array from a scipy sparse matrix can return a `numpy.matrix` (deprecated type), not a regular array. Always convert explicitly:

```python
import numpy as np
import scipy.sparse as sp

# This may return numpy.matrix (deprecated)
result_bad = sparse_matrix + dense_array

# Always convert explicitly
result_good = np.asarray((sparse_matrix + dense_array))
```

## Related Skills

- expression-matrix/counts-ingest - Load count data
- expression-matrix/normalization - Normalization affects sparsity
- single-cell/data-io - Single-cell data loading
- single-cell/preprocessing - Single-cell normalization