bio-experimental-design-sample-size

$npx mdskill add GPTomics/bioSkills/bio-experimental-design-sample-size

Calculate required replicates for statistical significance

  • Estimates sample sizes needed to detect expected effect sizes in RNA-seq and other omics studies.
  • Depends on ssizeRNA, DESeq2, powsimR, and pilot dispersion estimates from biological data.
  • Decides recommendations by comparing target power levels against variability and fold change inputs.
  • Delivers integer replicate counts per group to ensure experiments achieve statistical significance.
SKILL.md
.github/skills/bio-experimental-design-sample-sizeView on GitHub ↗
---
name: bio-experimental-design-sample-size
description: Estimates required sample sizes for differential expression, ChIP-seq, methylation, and proteomics studies. Use when budgeting experiments, writing grant proposals, or determining minimum replicates needed to achieve statistical significance for expected effect sizes.
tool_type: r
primary_tool: ssizeRNA
---

## Version Compatibility

Reference examples tested with: DESeq2 1.42+

Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Sample Size Estimation

**"How many samples do I need for my experiment?"** → Estimate required biological replicates per group for a target power level given expected effect sizes and variability.
- R: `ssizeRNA::ssizeRNA_single()`, `DESeq2` pilot dispersion estimates
- scRNA-seq: `powsimR::simulateDE()`

## RNA-seq Sample Size

```r
library(ssizeRNA)

# Estimate sample size for RNA-seq
# m = total genes, m1 = expected DE genes
# fc = fold change, fdr = target FDR
result <- ssizeRNA_single(nGenes = 20000, pi0 = 0.9, m = 200,
                          mu = 10, disp = 0.1, fc = 2,
                          fdr = 0.05, power = 0.8)
result$ssize  # Required n per group
```

## DESeq2-based Estimation

**Goal:** Derive realistic dispersion estimates from pilot RNA-seq data for use in power and sample size calculations.

**Approach:** Run DESeq2 on pilot count data to estimate per-gene dispersions, then extract the median dispersion as a representative variability parameter for power formulas.

```r
library(DESeq2)

# From pilot data
dds_pilot <- DESeqDataSetFromMatrix(pilot_counts, colData, ~condition)
dds_pilot <- DESeq(dds_pilot)

# Extract dispersion estimates for power calculation
dispersions <- mcols(dds_pilot)$dispGeneEst
median_disp <- median(dispersions, na.rm = TRUE)
# Use median_disp in power calculations
```

## Single-cell Sample Size

```r
library(powsimR)

# Estimate for scRNA-seq
# Accounts for dropout and cell-to-cell variability
params <- estimateParam(pilot_sce)
power <- simulateDE(params, n1 = 100, n2 = 100,
                    p.DE = 0.1, pLFC = 1)
```

## Sample Size by Assay Type

| Assay | Min Recommended | For Small Effects |
|-------|-----------------|-------------------|
| Bulk RNA-seq | 3 | 6-12 |
| scRNA-seq | 3 samples, 1000 cells | 6+ samples |
| ATAC-seq | 2 | 4-6 |
| ChIP-seq | 2 | 3-4 |
| Proteomics | 3 | 6-10 |
| Methylation | 4 | 8-12 |

## Budget Optimization

When resources are limited, prioritize:
1. Biological replicates over technical replicates
2. More samples over deeper sequencing (after ~20M reads for RNA-seq)
3. Balanced designs (equal n per group)

## Related Skills

- experimental-design/power-analysis - Power calculations
- experimental-design/batch-design - Optimal batch assignment
- single-cell/preprocessing - scRNA-seq experimental design
More from GPTomics/bioSkills