bio-atac-seq-deep-learning-atac

Name: bio-atac-seq-deep-learning-atac
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-atac-seq-deep-learning-atac
Predict chromatin accessibility from DNA sequences
Corrects Tn5 bias and predicts per-base accessibility profiles
Depends on chromBPNet, BPNet, scBasset, or EnFormer tools
Decides actions by analyzing sequence windows for variant effects
Delivers corrected predictions via Python scripts using CNN models
SKILL.md
.github/skills/bio-atac-seq-deep-learning-atacView on GitHub ↗
---
name: bio-atac-seq-deep-learning-atac
description: Sequence-based deep learning for ATAC-seq using chromBPNet, BPNet, scBasset, or EnFormer. Use when correcting Tn5 bias with neural networks beyond k-mer models, predicting per-base accessibility profiles, scoring in silico variant effects at GWAS or rare-variant SNPs, discovering motifs via DeepLIFT/TF-MoDISco from a trained model, or generating cell-type-specific accessibility predictions for unobserved cell states.
tool_type: python
primary_tool: chrombpnet
---

## Version Compatibility

Reference examples tested with: chrombpnet 0.1.7+, bpnet-lite 0.6+ (DOI 10.5281/zenodo.7011327), scBasset 0.1.0+ (basenji2 fork), tangermeme 0.1+, tfmodisco-lite 2.2+, DeepLIFT 0.6+, captum 0.7+, tensorflow 2.13+, pytorch 2.1+, kipoi 0.8+.

Verify before use:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws unexpected errors, introspect the installed package and adapt rather than retrying. Deep-learning tooling evolves rapidly; method papers post 2023 may have superseded reference implementations.

# Sequence-Based Deep Learning for ATAC-seq

**"Score the effect of a GWAS SNP on chromatin accessibility"** -> Train (or use pre-trained) sequence-to-accessibility CNNs that take 1-5 kb DNA windows and predict per-base Tn5 cleavage profiles. Outputs include: bias-corrected accessibility, single-base mutation effect predictions, and DeepLIFT contribution scores convertible to motifs via TF-MoDISco.

- CLI: `chrombpnet pipeline --bigwig signal.bw --bigwig-bias bias.bw ...`
- Python: `bpnet-lite` for custom architectures; `tangermeme` for fast scoring
- Python (single-cell): `scBasset` for per-cell sequence-based predictions
- Python (long-context): EnFormer pre-trained models via Kipoi

Sequence models are NOT a replacement for MACS+TOBIAS at every step. They excel at three specific tasks where classical pipelines struggle: (1) Tn5 bias correction in low-complexity sequence contexts, (2) variant effect prediction in non-genic regions, (3) cell-type-specific motif discovery beyond what JASPAR provides.

## Algorithmic Taxonomy

| Tool | Architecture | Training | Output | Strength | Fails when |
|------|-------------|----------|--------|----------|------------|
| chromBPNet (Pampari 2024) | Two-track CNN: bias model + accessibility model; bias trained on naked-DNA control or k-mer baseline, accessibility trained on chromatin signal | Per-cell-type, paired bias track | Bias-corrected per-base profile + total counts | Best-in-class bias correction; established in Kundaje lab pipelines | Requires GPU, ~24h training per cell type; needs >= 50M reads |
| BPNet (Avsec 2021) | Original counts + profile dual-head CNN | TF ChIP-seq or ATAC | Per-base profile prediction | Foundational; widely cited; bpnet-lite reimpl maintained | Less polished than chromBPNet for ATAC; bias correction needs separate model |
| scBasset (Yuan & Kelley 2022) | Basenji2-derived CNN, per-cell projection layer | Single-cell ATAC | Per-cell sequence-derived peak score | First sequence model that predicts per-cell accessibility; outperforms chromVAR for cluster discrimination | Fixed architecture, hard to extend; benchmarks evolving |
| EnFormer (Avsec 2021 Nat Methods) | Long-context Transformer (196 kb input) | Reference epigenome (DNase + histones + CAGE) | Per-bin epigenome prediction | Best for distal regulation modeling; pre-trained available | Pre-trained models cell-line specific; finetuning on custom data is expensive |
| Borzoi (Linder 2025) | EnFormer extension trained on RNA + ATAC | Multi-tissue paired data | Sequence -> RNA + chromatin | Recent SOTA for variant effect on RNA via ATAC linkage | Newer; benchmarks still emerging |
| DeepATAC / Basset (legacy) | Earlier CNN architectures | -- | Binary peak prediction | Historical context; cited in older literature | Superseded by chromBPNet + EnFormer; do not use for new work |
| tangermeme | Inference-only fast wrapper | Use any saved model | Marginal scoring of variants | Speeds up variant effect prediction 100x; works with chromBPNet/BPNet outputs | Inference only; cannot train |

Methodology evolves; verify against current Kundaje lab pipelines (chrombpnet GitHub), Greenleaf lab (scBasset), and Avsec / Linder publications before locking pipelines.

## When Deep Learning Helps vs When Classical Pipelines Suffice

| Task | Classical | Deep Learning |
|------|-----------|---------------|
| Peak calling | MACS3 / Genrich (sufficient) | chromBPNet (overkill unless variant downstream) |
| Tn5 bias correction at TF motifs | TOBIAS ATACorrect (good) | chromBPNet (better at hard cases: low-complexity flanks, deep TF footprints) |
| Differential accessibility | DiffBind / DESeq2 (sufficient) | -- (no clear DL advantage) |
| GWAS variant effect prediction at causal SNPs | Limited (overlap heuristics) | chromBPNet / EnFormer (essential) |
| Motif discovery from de novo data | MEME / HOMER (good) | chromBPNet + TF-MoDISco (better; finds composite + cooperative motifs) |
| Per-cell TF activity | chromVAR (sufficient at the cluster level) | scBasset (better at fine-grained cell states) |
| Cross-cell-type accessibility prediction | -- | EnFormer / Borzoi (only option) |
| Predicting cell-type-specific enhancer activity from sequence | -- | chromBPNet / EnFormer (essential) |

For most standard ATAC analysis, classical pipelines remain primary. Deep learning enters when (a) variant interpretation is the goal, (b) cell-type prediction is needed beyond observed data, or (c) bias correction quality is paramount (low-input, FFPE, transcription factors with weak motifs).

## Per-Tool Failure Modes

### chromBPNet -- Bias model mismatch

**Trigger:** Training the bias model on a dataset different from the accessibility dataset (e.g. K562 bias model used on primary T cells).

**Mechanism:** chromBPNet's bias model captures sequence-specific Tn5 preference, which is mostly cell-type-invariant BUT contributions of chromatin context at cuts can vary. Cross-celltype bias models work but with degraded performance.

**Symptom:** Predicted footprints look correct at known TFs (CTCF) but fail on cell-type-specific regulators.

**Fix:** Train a per-cell-type bias model from naked-DNA control if available, OR use the chromBPNet authors' pre-trained k562 / GM12878 / HepG2 bias as a fallback (acknowledged degradation).

### chromBPNet -- Insufficient training data

**Trigger:** Training on < 50M deduplicated nuclear reads, or < 30k peaks.

**Mechanism:** CNN training needs enough peaks for stable gradient updates and enough background regions for the dual-task loss.

**Fix:** Pool replicates before training; reduce model capacity (`--num-filters`); use pre-trained model on closest cell type and skip retraining.

### BPNet / chromBPNet -- DeepLIFT vs Integrated Gradients confusion

**Trigger:** Computing per-base contributions for motif discovery.

**Mechanism:** DeepLIFT (RevealCancel rule) and Integrated Gradients (50 baseline samples) give different attribution patterns. DeepLIFT preserves additivity; IG is stochastic.

**Fix:** Use DeepLIFT rescale-rule (chromBPNet default) for TF-MoDISco. IG only when DeepLIFT fails on saturating activations. Document the choice.

### scBasset -- Cell projection layer instability

**Trigger:** Few cells per cluster; sparse training data.

**Mechanism:** scBasset learns a per-cell projection vector; with < 100 cells per cluster the projection is noisy.

**Fix:** Aggregate cells to clusters before training, OR use chromBPNet trained on pseudobulks per cluster instead.

### EnFormer -- Pre-trained models lack target cell type

**Trigger:** Using EnFormer for variant effects in a cell type not in its training set (e.g. GTEx tissues are covered; novel primary cell types are not).

**Mechanism:** EnFormer's outputs are per-track predictions; if the target cell type wasn't trained, the agent can use a similar track as proxy but accuracy degrades.

**Fix:** Use a similar tissue track as proxy (HepG2 for liver biology; GM12878 for B-cell-like) OR fine-tune EnFormer on custom data (expensive). Document the proxy.

### tangermeme -- Marginal vs in silico mutagenesis confusion

**Trigger:** Asking for a "variant effect score" without specifying the formula.

**Mechanism:** Marginal effects = ref vs alt at the SNP only. ISM = saturation across all positions in the window (every base mutated). Different magnitudes; different questions.

**Fix:** Define which calculation. For GWAS variant prediction, use marginal at the SNP (matches phenotype-genotype coupling). For motif discovery, use ISM.

## Decision Tree by Goal

| Goal | Recommended approach |
|------|---------------------|
| Score 100 GWAS SNPs for chromatin effects | Pre-trained chromBPNet model on closest cell type; tangermeme for fast scoring |
| Score 1 lead SNP at high resolution | chromBPNet + tangermeme + ISM saturation map |
| Identify TF binding motifs from a new cell type's ATAC | chromBPNet train + DeepLIFT contributions + TF-MoDISco-lite |
| Predict accessibility in a cell type not in training | EnFormer pre-trained (best for ENCODE cell types) or scBasset for sc state interpolation |
| Bias-correct a low-input ATAC library before footprinting | chromBPNet bias model output as `--bias` to TOBIAS or directly use chromBPNet corrected track |
| Cell-type-specific enhancer prediction | chromBPNet trained on each cell type; per-cell-type ISM at candidate loci |
| Replace TOBIAS bias correction | chromBPNet corrected bigWig as input to TOBIAS ScoreBigwig; skip ATACorrect |

## chromBPNet Standard Pipeline

```bash
# 1. Generate train / valid / test chromosome splits (output is a JSON file with chrom assignments)
chrombpnet prep splits \
    -c hg38.chrom.sizes \
    -tcr chr1 chr3 chr6 \
    -vcr chr8 chr20 \
    -op splits/fold_0
# Train chromosomes are auto-inferred (whatever is not in -tcr/-vcr). The `-tecr` flag does NOT exist.

# 2. Train bias model from background regions
chrombpnet bias pipeline \
    -ibam atac.bam \
    -d ATAC \
    -g hg38.fa \
    -c hg38.chrom.sizes \
    -p peaks.narrowPeak \
    -n nonpeaks.bed \
    -fl splits/fold_0.json \
    -b 0.5 \
    -o bias_model/

# 3. Train accessibility model with bias correction
chrombpnet pipeline \
    -ibam atac.bam \
    -d ATAC \
    -g hg38.fa \
    -c hg38.chrom.sizes \
    -p peaks.narrowPeak \
    -n nonpeaks.bed \
    -fl splits/fold_0.json \
    -b bias_model/bias.h5 \
    -o output/

# 4. Variant effect prediction at GWAS SNPs uses the SEPARATE kundajelab/variant-scorer repo
# (the `chrombpnet snp_score` subcommand is commented out in current chrombpnet/parsers.py)
git clone https://github.com/kundajelab/variant-scorer
python variant-scorer/src/variant_scoring.py \
    --model output/chrombpnet_no_bias.h5 \
    --list variants.tsv \
    --genome hg38.fa \
    --chrom_sizes hg38.chrom.sizes \
    --out_prefix variants_predicted
# Output: variants_predicted.variant_scores.tsv with per-SNP log2FC magnitudes
```

`-b 0.5` is the bias scaling factor; chromBPNet docs recommend 0.5-1.0 depending on enrichment. For variant scoring, use the standalone `kundajelab/variant-scorer` companion repo, NOT a chrombpnet subcommand. Verify exact flag names with `python variant_scoring.py --help` because the API evolves.

## DeepLIFT + TF-MoDISco for Motif Discovery

The maintained version is `tfmodisco-lite` (jmschrei/tfmodisco-lite, `pip install modisco-lite`), which exposes a CLI rather than the deprecated v1 `TfModiscoWorkflow` Python API. The original `kundajelab/tfmodisco` package (with `tfmodisco.tfmodisco_workflow.workflow.TfModiscoWorkflow`) is unmaintained and incompatible with `modisco-lite`.

```bash
# Generate one-hot sequence and SHAP / DeepLIFT contribution score arrays from chromBPNet
# (chromBPNet `chrombpnet contribs_bw` writes hypothetical contributions; convert to numpy via shap_to_modisco)

# Run TF-MoDISco-lite via its CLI
modisco motifs \
    -s ohe.npz \
    -a shap.npz \
    -n 2000 \
    -w 500 \
    -o modisco_results.h5

# Generate HTML report with discovered motifs matched to known databases
modisco report \
    -i modisco_results.h5 \
    -o modisco_report/ \
    -m motifs_meme.txt \
    -s modisco_report/
```

`-n 2000` caps seqlets per metacluster; `-w 500` is the sliding-window length. `motifs_meme.txt` (e.g. JASPAR or HOCOMOCO MEME-format) lets `modisco report` annotate clusters against known motifs.

## In Silico Variant Effect Prediction

```python
import numpy as np

# tangermeme's variant-effect API: substitution_effect for SNPs, marginalize for motif insertions
from tangermeme.variant_effect import substitution_effect
from tangermeme.predict import predict

# Load pre-trained chromBPNet model (saved as Keras .h5 or PyTorch state_dict).
# chromBPNet wraps Keras; load with tensorflow.keras.models.load_model and wrap for tangermeme.
# `load_chrombpnet_model` below is pseudocode -- substitute the actual loader for the installed version
# (e.g. tf.keras.models.load_model + tangermeme.io.adapter, or torch.load for PyTorch checkpoints).
model = load_chrombpnet_model('output/chrombpnet_no_bias.h5')

# substitution_effect: per-SNP ref vs alt prediction across a sequence window
# X shape (N, 4, L); substitutions is a sparse-COO tensor of shape (-1, 3) where each row is
# [example_idx, position, new_base_idx] (new_base_idx 0-3 for ACGT)
y_ref, y_alt = substitution_effect(model, X, substitutions)
log2fc = np.log2(y_alt.sum(axis=-1) / y_ref.sum(axis=-1))
```

For motif marginalization (testing a candidate motif's effect by inserting it into background sequences), use `tangermeme.marginalize.marginalize(model, X, motif)`. The `motif` argument is a one-hot tensor of shape `(-1, 4, motif_length)`; convert string motifs via `tangermeme.utils.one_hot_encode`. Verify the exact signatures with `help(tangermeme.marginalize.marginalize)` because tangermeme is actively developed; `marginal_predict` is NOT a real function name.

`log2fc` magnitudes are unitless; |log2fc| > 1 typical for strong-effect SNPs in regulatory regions.

## Reconciliation

| Pattern | Likely cause | Action |
|---------|--------------|--------|
| chromBPNet predicts strong effect; MACS does not call peak | Sequence model captures latent regulatory potential | Trust chromBPNet for variant effect; not for peak calling |
| EnFormer prediction differs from chromBPNet at same locus | Different context windows (196 kb vs 1-2 kb); different cell types | Both can be correct at different scales; report both with their context size |
| TF-MoDISco motifs differ from JASPAR | Different methodology (sequence-based vs ChIP-validated) | TF-MoDISco can find composites and cooperative; check JASPAR for confirmation |
| chromBPNet bias correction differs from TOBIAS ATACorrect | Different bias models (CNN vs k-mer) | chromBPNet is more accurate but slower; TOBIAS still publishable for standard use |

**Operational rule:** For high-confidence variant prediction, agree across two approaches: chromBPNet + EnFormer (or Borzoi). Single-tool calls should be reported as exploratory. For motif discovery, validate TF-MoDISco hits against JASPAR/HOCOMOCO before publication.

## GPU and Compute Considerations

| Task | Hardware | Wall time |
|------|---------|-----------|
| chromBPNet training (per cell type) | 1 A100 GPU, 80 GB RAM | ~24 h |
| chromBPNet inference at 1M variants | 1 A100 | ~4 h |
| EnFormer pre-trained inference | 1 V100+ | ~30 min for 100k variants |
| Borzoi training | 1 A100, ~250 GB RAM | ~7 days |
| scBasset training (10k cells) | 1 V100, 32 GB RAM | ~12 h |
| TF-MoDISco on 1M peaks | CPU 32 cores | ~6 h |

For most labs without sustained GPU access: use pre-trained chromBPNet/EnFormer models for inference; only train custom models when the cell type is not in the public model zoo (encodeproject.org/atac-seq pre-trained chromBPNet).

## Common Errors

| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| chromBPNet `bias.h5` missing | Bias model training failed silently | Re-run `chrombpnet bias pipeline` with verbose; check input BAM size |
| Out of memory during training | Default batch size too large for GPU | `--batch-size 64` or smaller; reduce `--num-filters` |
| Predicted profile is constant | Model collapsed (training too short) | Increase epochs; verify input peaks are non-empty |
| TF-MoDISco produces too many small clusters | `target_seqlet_fdr` too loose | Tighten to 0.01; or increase `flank_size` |
| EnFormer prediction has wrong shape | Pre-trained model expects 196 kb input | Pad input to exactly 196,608 bp |
| Variant effect predictions cluster near zero | SNP outside model's effective window | Predict on window-centered sequences (variant at the center) |
| chromBPNet model not converging | Peaks file contains chrM or blacklist | Pre-filter; chromBPNet does not auto-filter |
| scBasset training crashes on Apple Silicon | TensorFlow Metal incompatible with operations | Use CPU mode or run on Linux GPU |

## References

- Pampari A et al 2024 bioRxiv (chromBPNet; Tn5 bias correction with deep learning)
- Avsec Z et al 2021 Nat Methods 18:1196 (BPNet; foundational sequence-to-profile)
- Avsec Z et al 2021 Nat Methods 18:1224 (EnFormer; long-context Transformer)
- Linder J et al 2025 (Borzoi; multi-tissue sequence-to-RNA+chromatin)
- Yuan H & Kelley DR 2022 Nat Methods 19:1088 (scBasset)
- Shrikumar A et al 2017 ICML (DeepLIFT)
- Schreiber J et al 2024 (tangermeme; fast inference utilities)
- Shrikumar A et al 2018 bioRxiv (TF-MoDISco)
- Kelley DR 2020 Genome Res 30:1133 (Basenji2; precursor)

## Related Skills

- atac-seq/atac-peak-calling - Classical peak calling input
- atac-seq/footprinting - Use chromBPNet bias correction as TOBIAS alternative
- atac-seq/motif-deviation - chromVAR vs scBasset for per-cell motif activity
- atac-seq/single-cell-atac - scBasset integration with sc workflow
- atac-seq/enhancer-gene-linking - Variant effect feeds enhancer scoring
- atac-seq/allele-specific-accessibility - DL-predicted variant effects vs observed allelic imbalance
- causal-genomics/fine-mapping - Downstream use of variant effect scores
- machine-learning/biomarker-discovery - General ML patterns
- gene-regulatory-networks/scenic-regulons - Combine motif discovery with TF networks