bio-expression-matrix-sparse-handling

$npx mdskill add GPTomics/bioSkills/bio-expression-matrix-sparse-handling

Efficiently handle sparse expression matrices for bioinformatics data

  • Solve memory issues with large zero-heavy RNA-seq or single-cell datasets
  • Uses scipy.sparse, numpy, pandas, and anndata for sparse matrix operations
  • Analyzes sparsity level to determine if sparse format is appropriate
  • Converts dense data to sparse formats and provides sparsity metrics
SKILL.md
.github/skills/bio-expression-matrix-sparse-handlingView on GitHub ↗
---
name: bio-expression-matrix-sparse-handling
description: Work with sparse matrices for memory-efficient storage of count data. Use when dealing with single-cell data or large bulk RNA-seq datasets where most values are zero.
tool_type: python
primary_tool: scipy.sparse
---

## Version Compatibility

Reference examples tested with: numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Sparse Matrix Handling

**"Convert counts to sparse matrix"** → Store zero-heavy expression data (especially single-cell) in memory-efficient sparse format.
- Python: `scipy.sparse.csr_matrix(dense_array)`, `anndata.X` stores sparse by default
- Python: `scipy.io.mmread('matrix.mtx')` for Market Matrix format (10x Genomics)

## Check Sparsity

```python
import numpy as np

# Calculate sparsity (proportion of zeros)
def check_sparsity(counts):
    zeros = (counts == 0).sum().sum()
    total = counts.size
    sparsity = zeros / total
    print(f'Sparsity: {sparsity:.1%} ({zeros:,} / {total:,} zeros)')
    return sparsity

# Rule of thumb: use sparse if >50% zeros
```

## Convert Dense to Sparse

```python
import scipy.sparse as sp
import pandas as pd

# From pandas DataFrame
dense_df = pd.read_csv('counts.csv', index_col=0)
sparse_matrix = sp.csr_matrix(dense_df.values)

# Keep row/column names
gene_names = dense_df.index.tolist()
sample_names = dense_df.columns.tolist()

# CSR vs CSC
# CSR (Compressed Sparse Row): efficient row slicing, matrix-vector products
# CSC (Compressed Sparse Column): efficient column slicing
sparse_csr = sp.csr_matrix(dense_df.values)  # Row-oriented
sparse_csc = sp.csc_matrix(dense_df.values)  # Column-oriented
```

## Convert Sparse to Dense

```python
import pandas as pd
import scipy.sparse as sp

# To numpy array
dense_array = sparse_matrix.toarray()

# To pandas DataFrame
dense_df = pd.DataFrame(
    sparse_matrix.toarray(),
    index=gene_names,
    columns=sample_names
)
```

## Memory Comparison

```python
import sys
import scipy.sparse as sp

def compare_memory(dense, sparse):
    dense_mb = dense.nbytes / 1e6
    sparse_mb = (sparse.data.nbytes + sparse.indices.nbytes + sparse.indptr.nbytes) / 1e6
    ratio = dense_mb / sparse_mb
    print(f'Dense:  {dense_mb:.1f} MB')
    print(f'Sparse: {sparse_mb:.1f} MB')
    print(f'Ratio:  {ratio:.1f}x smaller')
    return ratio

sparse = sp.csr_matrix(counts.values)
compare_memory(counts.values, sparse)
```

## Save/Load Sparse Matrices

```python
import scipy.sparse as sp
import numpy as np

# Save sparse matrix
sp.save_npz('counts_sparse.npz', sparse_matrix)

# Save with metadata
np.savez('counts_with_meta.npz',
    data=sparse_matrix.data,
    indices=sparse_matrix.indices,
    indptr=sparse_matrix.indptr,
    shape=sparse_matrix.shape,
    genes=np.array(gene_names),
    samples=np.array(sample_names))

# Load sparse matrix
sparse_matrix = sp.load_npz('counts_sparse.npz')

# Load with metadata
loaded = np.load('counts_with_meta.npz', allow_pickle=True)
sparse_matrix = sp.csr_matrix(
    (loaded['data'], loaded['indices'], loaded['indptr']),
    shape=tuple(loaded['shape']))
gene_names = loaded['genes'].tolist()
```

## AnnData with Sparse Matrices

```python
import anndata as ad
import scipy.sparse as sp
import pandas as pd

# Create AnnData with sparse matrix
adata = ad.AnnData(
    X=sp.csr_matrix(counts.values),
    obs=pd.DataFrame(index=counts.columns),  # Samples
    var=pd.DataFrame(index=counts.index)     # Genes
)

# Note: AnnData transposes so cells/samples are rows
adata = ad.AnnData(
    X=sp.csr_matrix(counts.T.values),  # Transpose for samples-as-rows
    obs=pd.DataFrame(index=counts.columns),
    var=pd.DataFrame(index=counts.index)
)

# Save (automatically handles sparse)
adata.write_h5ad('counts.h5ad')

# Check if stored sparse
adata = ad.read_h5ad('counts.h5ad')
print(f'Matrix type: {type(adata.X)}')
```

## Sparse Operations

```python
import scipy.sparse as sp
import numpy as np

# Row sums (gene totals)
row_sums = np.array(sparse_matrix.sum(axis=1)).flatten()

# Column sums (sample totals)
col_sums = np.array(sparse_matrix.sum(axis=0)).flatten()

# Filter rows by sum (keep genes with >10 total counts)
keep_mask = row_sums > 10
sparse_filtered = sparse_matrix[keep_mask, :]

# Filter columns (keep samples with >1000 counts)
keep_cols = col_sums > 1000
sparse_filtered = sparse_matrix[:, keep_cols]

# Log transform (add pseudocount)
sparse_log = sparse_matrix.copy()
sparse_log.data = np.log1p(sparse_log.data)
```

## Subsetting Sparse Matrices

```python
# Select specific genes
gene_idx = [gene_names.index(g) for g in ['TP53', 'BRCA1', 'MYC'] if g in gene_names]
subset = sparse_matrix[gene_idx, :]

# Select specific samples
sample_idx = [sample_names.index(s) for s in ['sample1', 'sample2']]
subset = sparse_matrix[:, sample_idx]

# Boolean indexing
expressed = row_sums > 0
sparse_expressed = sparse_matrix[expressed, :]
```

## Normalization on Sparse

**Goal:** Apply CPM normalization and log transformation to a sparse count matrix without converting to dense format.

**Approach:** Compute library sizes from column sums, broadcast scaling factors with sparse multiply for CPM, then transform only the nonzero data array in-place with log1p.

```python
import numpy as np
import scipy.sparse as sp

def normalize_sparse_cpm(sparse_matrix):
    '''CPM normalization for sparse matrix.'''
    lib_sizes = np.array(sparse_matrix.sum(axis=0)).flatten()
    scaling_factors = 1e6 / lib_sizes
    normalized = sparse_matrix.multiply(scaling_factors)  # Broadcasts across columns
    return normalized

def normalize_sparse_log1p(sparse_matrix):
    '''Log1p transform sparse matrix in place.'''
    result = sparse_matrix.copy()
    result.data = np.log1p(result.data)
    return result

cpm = normalize_sparse_cpm(sparse_matrix)
log_cpm = normalize_sparse_log1p(cpm)
```

## 10X Matrix Format

```python
import scipy.io
import pandas as pd

# Read 10X format
matrix = scipy.io.mmread('matrix.mtx').T.tocsr()  # Transpose and convert to CSR
features = pd.read_csv('features.tsv', sep='\t', header=None)
barcodes = pd.read_csv('barcodes.tsv', sep='\t', header=None)

gene_names = features[1].tolist()  # Gene symbols
cell_barcodes = barcodes[0].tolist()

# Write 10X format
scipy.io.mmwrite('output_matrix.mtx', sparse_matrix)
```

## Backed Mode for Large Datasets

**Goal:** Work with H5AD files too large to fit in memory by loading data on demand.

**Approach:** Open the file in backed mode so only accessed rows/columns are loaded from disk.

```python
import scanpy as sc

# Read-only backed mode -- data stays on disk, loaded on demand
adata = sc.read_h5ad('large_dataset.h5ad', backed='r')
print(f'Shape: {adata.shape}, Type: {type(adata.X)}')

# Only the accessed slice is loaded into memory
subset = adata[adata.obs['cell_type'] == 'T_cell', :].to_memory()
```

Backed mode limitations:
- Only the X matrix supports backed access; obs/var are fully loaded
- Very large sparse H5AD files (>35 GB) can still cause memory issues
- Write mode (`backed='r+'`) only supports updates to X, not obs/var
- Many scanpy functions require in-memory data; use `.to_memory()` for subsets

For datasets too large for backed mode, process in chunks:

```python
import anndata as ad
import numpy as np

def process_in_chunks(h5ad_path, chunk_size=10000, func=None):
    '''Process large H5AD file in cell chunks.'''
    adata = sc.read_h5ad(h5ad_path, backed='r')
    n_cells = adata.shape[0]
    results = []
    for start in range(0, n_cells, chunk_size):
        end = min(start + chunk_size, n_cells)
        chunk = adata[start:end].to_memory()
        if func:
            chunk = func(chunk)
        results.append(chunk)
    return ad.concat(results)
```

## Sparsity Changes After Transformation

After log-transformation or normalization, zero values become non-zero (log1p(0) = 0, but CPM normalization moves zeros to non-zero when prior counts are added). This reduces the effective sparsity:

```python
import scipy.sparse as sp
import numpy as np

def check_sparsity_after_transform(sparse_matrix, transform_name='original'):
    nnz = sparse_matrix.nnz
    total = sparse_matrix.shape[0] * sparse_matrix.shape[1]
    sparsity = 1 - (nnz / total)
    print(f'{transform_name}: {sparsity:.1%} sparse ({nnz:,} nonzero / {total:,} total)')
    return sparsity

# After normalization, sparsity decreases
# If <50% sparse after transformation, dense format may be more memory-efficient
check_sparsity_after_transform(adata.X, 'raw')
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
check_sparsity_after_transform(adata.X, 'after log1p')
```

Rule of thumb: if sparsity drops below ~50% after transformation, convert back to dense format. The sparse index overhead outweighs the benefit of skipping non-zeros.

## Sparse-Dense Arithmetic Gotcha

Adding/subtracting a dense numpy array from a scipy sparse matrix can return a `numpy.matrix` (deprecated type), not a regular array. Always convert explicitly:

```python
import numpy as np
import scipy.sparse as sp

# This may return numpy.matrix (deprecated)
result_bad = sparse_matrix + dense_array

# Always convert explicitly
result_good = np.asarray((sparse_matrix + dense_array))
```

## Related Skills

- expression-matrix/counts-ingest - Load count data
- expression-matrix/normalization - Normalization affects sparsity
- single-cell/data-io - Single-cell data loading
- single-cell/preprocessing - Single-cell normalization
More from GPTomics/bioSkills