bio-atac-seq-co-accessibility
$
npx mdskill add GPTomics/bioSkills/bio-atac-seq-co-accessibilityInfer cis-regulatory links from scATAC-seq data
- Connect enhancers to promoters using chromatin-only signals
- Depends on Cicero, ArchR getCoAccessibility, or SCENIC+ tools
- Analyzes joint peak accessibility patterns across cells
- Outputs thresholded peak-pair graphs with co-accessibility scores
SKILL.md
.github/skills/bio-atac-seq-co-accessibilityView on GitHub ↗
---
name: bio-atac-seq-co-accessibility
description: Infer cis-regulatory connections (peak-to-peak co-accessibility) from scATAC-seq using Cicero, ArchR getCoAccessibility, or SCENIC+. Use when linking enhancer accessibility to promoter accessibility, identifying enhancer-gene pairs from chromatin alone (without paired RNA), running gene-regulatory inference combining ATAC + RNA, or comparing predicted regulatory contacts against Hi-C/Micro-C ground truth.
tool_type: r
primary_tool: cicero
---
## Version Compatibility
Reference examples tested with: Cicero 1.20+, monocle3 1.3+, ArchR 1.0.2+, SCENIC+ 1.0+, pycisTopic 1.0+, Signac 1.13+, GenomicRanges 1.54+, GenomicInteractions 1.36+, BSgenome.Hsapiens.UCSC.hg38 1.4+.
Verify before use:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws unexpected errors, introspect the installed package and adapt rather than retrying.
# Co-accessibility (cis-Regulatory Linkage)
**"Which enhancers connect to which promoters in my scATAC data?"** -> Use cell-to-cell variability in joint accessibility of nearby peaks to infer cis-regulatory connections without explicit RNA expression. Output is a peak-pair graph with co-accessibility scores; thresholding produces enhancer-gene candidate pairs.
- R: `cicero::run_cicero(input_cds, genomic_coords)` -> peak-pair connection scores
- R: `ArchR::addCoAccessibility(proj)` -> ArchR-internal Cicero wrapper
- Python: `pycisTopic` + `SCENIC+` for network-level inference combining ATAC + RNA + motifs
Co-accessibility is NOT 3D contact; it's a statistical association based on cell-to-cell co-variation. Strong co-accessibility correlates with Hi-C/Micro-C contacts (~30-50% concordance) but is not equivalent.
## What Co-accessibility Captures vs What It Doesn't
| Captures | Misses |
|----------|--------|
| Peak pairs that vary together across cell states | 3D physical contacts that don't vary in accessibility |
| Cis-regulatory grammar within a cell type | Trans-chromosomal interactions |
| Active enhancer-promoter pairs | Constitutive structural contacts |
| Lineage-specific regulation | Developmental contacts that opened before scATAC sample |
| Distance-decay biology of enhancer-promoter | Hub enhancers that contact many distal targets |
For physical contact, use Hi-C, Micro-C, or PCHi-C. Co-accessibility is the chromatin-only proxy.
## Algorithmic Taxonomy
| Tool | Method | Input | Output | Strength | Fails when |
|------|--------|-------|--------|----------|------------|
| Cicero (Pliner 2018) | Graphical lasso on aggregated cell metacells | scATAC peak-cell matrix + cell trajectory | Peak-pair connection score (0-1) | Original, well-validated; integrates with Monocle3 | Slow on >50K cells; sensitive to alpha tuning |
| ArchR getCoAccessibility | Cicero-based; uses ArchR's metacell aggregation | ArchR project | Same as Cicero | Built-in to ArchR pipeline; faster on large datasets | Tied to ArchR; same biology as Cicero |
| SCENIC+ (Bravo 2023) | Multi-step: co-accessibility + motif scoring + RNA correlation | Multiome (ATAC + RNA) or paired | TF-driven enhancer-gene networks | Most comprehensive; multi-modal | Multiome data required; computationally heavy |
| LinkPeaks (Signac) | Pearson correlation of accessibility with paired gene expression | Multiome | Peak-gene linkage score | Direct enhancer-gene from RNA correlation | Multiome-only; not pure ATAC |
| GeneHancer / FANTOM5 / EpiMap | Bulk-derived enhancer-gene reference | None (database lookup) | Pre-computed enhancer-gene pairs | Comprehensive; published references | Cell-type-agnostic; may not match the biology of interest |
Methodology evolves; verify against Pliner 2018 (Cicero), Bravo 2023 (SCENIC+), Nasser 2021 (ABC model alternative for enhancer-gene), and current Hi-C concordance benchmarks.
## How Cicero Works (Conceptually)
Cell-to-cell variability is too sparse for direct correlation. Cicero solves this via metacells:
1. Reduce dimensionality (UMAP from input).
2. Build k-NN graph of cells.
3. Aggregate k cells into metacells (default k = 50).
4. Compute correlation in accessibility across metacells, restricted to peak pairs within `genomic_distance_max` (default 500 kb cis).
5. Apply graphical lasso with regularization `alpha` to sparsify the correlation matrix.
6. Output: per-pair connection score; positive = co-variation, negative = anti-co-variation.
Connection thresholds typically 0.05-0.5; > 0.25 is high-confidence.
## Per-Tool Failure Modes
### Cicero -- alpha tuning shifts results
**Trigger:** Default alpha (sometimes computed automatically from data); custom alpha < 0.5 or > 5.
**Mechanism:** Alpha controls graphical lasso regularization. Too low: dense graph with many spurious connections; too high: sparse with biology missing.
**Symptom:** Connection count varies 10-100x across alpha sweeps.
**Fix:** Use Cicero's `estimate_distance_parameter()` to get data-driven alpha; verify connection count is biologically plausible (~10-50% of peaks have at least one strong connection).
### Cicero -- metacell aggregation hides cell-type-specific connections
**Trigger:** Running Cicero on heterogeneous dataset spanning multiple cell types.
**Mechanism:** Metacells aggregate across cell types; connections that exist only in one cell type get diluted.
**Fix:** Run Cicero per-cluster separately; combine results with cluster annotations. Cell-type-specific connections often differ.
### Cicero -- distance assumption
**Trigger:** Default `genomic_distance_max=500000` (500 kb cis only).
**Mechanism:** Distal connections beyond 500 kb cis are excluded; trans-chromosomal entirely missed.
**Fix:** For specific use cases (e.g., gene desertless TADs), increase `genomic_distance_max` to 1 Mb or more. Trans connections require Hi-C, not co-accessibility.
### SCENIC+ -- RNA scaling
**Trigger:** RNA-side dropouts in Multiome data.
**Mechanism:** SCENIC+ requires reasonable RNA quantification per cell. Sparse Multiome RNA with many zero genes causes correlation degradation.
**Fix:** Filter cells with insufficient RNA; aggregate cells if necessary. Multiome RNA should look comparable to standalone scRNA-seq.
### LinkPeaks (Signac) -- Distance default
**Trigger:** Default `LinkPeaks(..., distance=5e+05)`.
**Mechanism:** Same as Cicero; 500 kb cis only by default.
**Fix:** Same; widen if needed but trans not supported.
## Decision Tree by Goal
| Goal | Tool |
|------|------|
| ATAC-only enhancer-promoter inference | Cicero |
| ATAC-only inside ArchR ecosystem | ArchR getCoAccessibility |
| Multiome (RNA + ATAC) enhancer-gene inference | LinkPeaks (Signac) for direct correlation; SCENIC+ for TF network |
| TF-driven regulatory networks | SCENIC+ (requires Multiome) |
| Comparison against Hi-C / Micro-C | Cicero output -> overlap with HiCCUPS loops |
| Published reference enhancer-gene pairs | GeneHancer, FANTOM5, EpiMap (pre-computed lookup) |
| Gene desertless distal regulation | Cicero with widened distance; or H3K27ac HiChIP |
## Cicero Standard Workflow
```r
library(cicero); library(monocle3); library(GenomicRanges)
# Input: peak-cell binary matrix from Signac/ArchR (rows = peaks, cols = cells)
# Convert peaks to "chrN_start_end" format
peak_names <- paste0(seqnames(peaks), '_', start(peaks), '_', end(peaks))
input_cds <- new_cell_data_set(peak_matrix, cell_metadata=metadata,
gene_metadata=peak_metadata)
# Reduce dimensionality (UMAP from input)
input_cds <- detect_genes(input_cds)
input_cds <- estimate_size_factors(input_cds)
input_cds <- preprocess_cds(input_cds, method='LSI')
input_cds <- reduce_dimension(input_cds, reduction_method='UMAP',
preprocess_method='LSI')
# Build metacell-aggregated CDS
umap_coords <- reducedDims(input_cds)$UMAP
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates=umap_coords, k=50)
# Run Cicero with hg38 chrom sizes
genome_df <- data.frame(chr=seqnames(seqinfo(BSgenome.Hsapiens.UCSC.hg38)),
length=seqlengths(seqinfo(BSgenome.Hsapiens.UCSC.hg38)))
conns <- run_cicero(cicero_cds, genomic_coords=genome_df,
window=500000, sample_num=100)
# Filter to high-confidence connections
strong <- conns[conns$coaccess > 0.25, ]
cat(sprintf('Total conns: %d; strong (>0.25): %d\n', nrow(conns), nrow(strong)))
```
## ArchR getCoAccessibility
```r
library(ArchR)
proj <- loadArchRProject('ArchR_out')
proj <- addCoAccessibility(proj, reducedDims='IterativeLSI',
k=100, knnIteration=500,
maxDist=250000) # 250 kb cis (tighter than default)
co_acc <- getCoAccessibility(proj, corCutOff=0.5, # Default 0.5; lower for more
returnLoops=FALSE)
```
ArchR returns connections as a GRanges object compatible with `GenomicInteractions` for direct overlap with Hi-C loops.
## Visualizing Connections
```r
# As arc plot at a locus of interest
library(Gviz); library(GenomicInteractions)
gi <- GenomicInteractions(anchorOne=GRanges(strong[, 1:3]),
anchorTwo=GRanges(strong[, 4:6]),
counts=as.integer(strong$coaccess * 100))
plotInteractions(gi, view='arc', interactions.color='red')
```
For genome-browser visualization with ArchR: `plotPeak2GeneHeatmap()` shows the peak-gene linkage matrix; `plotBrowserTrack()` overlays connections on tracks.
## SCENIC+ TF-Driven Networks
```python
import scenicplus as sp
# Multi-step: requires preprocessed pycisTopic models + paired RNA AnnData
scplus_obj = sp.create_SCENICPLUS_object(
GEX_anndata=rna_adata,
cisTopic_obj=topic_obj,
menr=motif_enr_dict)
# Calculate eRegulons (TF + target genes + linked enhancers)
sp.run_scenicplus(scplus_obj, ...)
scplus_obj.uns['eRegulon_metadata']
```
SCENIC+ is significantly more complex than Cicero; budget 1-2 days for setup. The benefit is that outputs are TF -> enhancer -> gene triples, not just peak-peak co-accessibility.
## Cicero Alpha Mathematics
**Trigger:** Tuning Cicero's regularization parameter for the graphical lasso step.
**Mechanism:** `estimate_distance_parameter()` fits a regression of pairwise correlation versus genomic distance for nearby peaks; the slope gives the expected co-accessibility decay with distance. Cicero's alpha penalty is set proportional to this slope so the graph sparsifies at biologically appropriate distance scales.
**Implementation:** Cicero internally calls `estimate_distance_parameter(cicero_cds, window=window, maxit=100, sample_num=100)` -- the function bootstraps 100 windows, fits per-window distance-correlation regression, and averages the alpha estimates. Default value is robust at typical peak densities; manual override is rarely needed.
**When manual tuning helps:** Very dense peaksets (>200k peaks) may need higher alpha to control false positives; very sparse (<10k peaks) may need lower alpha to recover signal. Verify by running on technical replicates -- the expected outcome is ~0 strong connections.
## ABC Model Cross-Reference
For enhancer-to-gene linking with paired Hi-C/Micro-C, the canonical method is the ABC model (Fulco 2019, Nasser 2021), not Cicero. ABC computes ABC = (Activity_E * Contact_E,G) / sum_e(Activity_e * Contact_e,G); standardizes on combined ATAC + H3K27ac activity and Hi-C contact frequencies. ENCODE-rE2G (2024) is the modern logistic-regression replacement.
See atac-seq/enhancer-gene-linking for full ABC and ENCODE-rE2G coverage. Cicero is the ATAC-only fallback when no Hi-C is available.
## HiChIP H3K27ac as Orthogonal Anchor
| Decision | Action |
|----------|--------|
| Have Hi-C / Micro-C | Use ABC (atac-seq/enhancer-gene-linking) primary; Cicero as ATAC-only sanity check |
| Have HiChIP H3K27ac | FitHiChIP loops (FDR < 0.05, count >= 5) primary; ABC + HiChIP intersection is high-confidence |
| Have ATAC + H3K27ac, no 3D | ABC with average HiC fallback (Fulco 2019); document degraded performance |
| Have only ATAC | Cicero (this skill); known concordance with Hi-C ~30-50% |
Cicero is appropriate when no 3D data exists; do not use Cicero in lieu of ABC when Hi-C/Micro-C are available.
## Hi-C / Micro-C Concordance
| Hi-C concordance | Action |
|-----------------|--------|
| > 50% of strong Cicero connections overlap Hi-C loops | High-confidence; Cicero captures real 3D structure |
| 30-50% | Standard; some 3D contacts don't vary in accessibility |
| < 20% | Co-accessibility may not reflect contacts; lineage-specific contacts may be missing |
```r
# Compare Cicero against published Hi-C loops
library(GenomicInteractions)
hic_loops <- import('hiccups_loops.bedpe', format='bedpe')
ci <- GenomicInteractions(anchorOne=GRanges(strong$Peak1),
anchorTwo=GRanges(strong$Peak2))
overlap <- countOverlaps(ci, hic_loops, type='equal') > 0
cat(sprintf('Cicero connections overlapping HiCCUPS loops: %.1f%%\n',
100 * mean(overlap)))
```
## Reconciliation
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| Cicero many weak connections; ArchR few strong | Different alpha or aggregation | Standardize parameters |
| LinkPeaks (Multiome) finds connections Cicero misses | LinkPeaks uses RNA expression as the anchor; Cicero is ATAC-only | Both valid; report intersection as high-confidence |
| Co-accessibility doesn't match Hi-C in heterochromatin | Heterochromatic contacts are constitutive; co-accessibility needs variation | Expected; co-accessibility complements Hi-C |
| SCENIC+ network has ENCODE-validated TFs but missing some | Motif database limited or RNA imputation missed | Expand motif database; integrate paired ChIP-seq if available |
**Operational rule:** Co-accessibility is a hypothesis generator. Validate with Hi-C, ChIP-seq, or experimental enhancer-promoter interaction (CRISPRi-FlowFISH).
## Common Errors
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Cicero `make_cicero_cds` slow / crashes | k too high or cell count too large | Reduce k or subsample cells |
| All connections near zero | alpha set too high | Use `estimate_distance_parameter()` |
| Connection score > 1 reported | Bug in older Cicero versions | Update; check `as.numeric(coaccess)` for outliers |
| ArchR getCoAccessibility "TileMatrix" error | Need PeakMatrix not TileMatrix | `addPeakMatrix()` first |
| SCENIC+ install fails | Many heavy dependencies | Use the published Docker image |
| Connection count varies wildly per run | Stochastic metacell aggregation | Set seed; or aggregate at higher k for stability |
| LinkPeaks all NaN | RNA expression has too many zeros | Re-filter cells with sufficient RNA |
| Peak names not matching | format mismatch (chr_start_end vs chr:start-end) | Standardize naming convention |
## References
- Pliner HA et al 2018 Mol Cell 71:858 (Cicero)
- Granja JM et al 2021 Nat Genet 53:403 (ArchR getCoAccessibility)
- Bravo Gonzalez-Blas C et al 2023 Nat Methods 20:1355 (SCENIC+)
- Stuart T et al 2021 Nat Methods 18:1333 (Signac LinkPeaks)
- Nasser J et al 2021 Nature 593:238 (ABC model; alternative enhancer-gene)
- Fulco CP et al 2019 Nat Genet 51:1664 (CRISPRi-FlowFISH; gold-standard validation)
- Mumbach MR et al 2017 Nat Genet 49:1602 (HiChIP H3K27ac for enhancer-promoter)
- Boix CA et al 2021 Nature 590:300 (EpiMap; bulk enhancer-gene reference)
## Related Skills
- atac-seq/single-cell-atac - scATAC preprocessing (input)
- atac-seq/consensus-peakset - Peak set used for connection inference
- atac-seq/motif-deviation - chromVAR for TF activity (complement)
- atac-seq/enhancer-gene-linking - ABC, ENCODE-rE2G, CRISPRi-FlowFISH validation when Hi-C is available
- atac-seq/deep-learning-atac - chromBPNet variant effect at predicted enhancers
- gene-regulatory-networks/scenic-regulons - Standalone SCENIC for TF networks
- hi-c-analysis/loop-calling - Physical contacts from Hi-C
- hi-c-analysis/contact-pairs - Hi-C / Micro-C contact pairs
- single-cell/multimodal-integration - Multiome integration
- chip-seq/peak-annotation - Cross-validate with TF ChIP
- pathway-analysis/gsea - Downstream gene-level enrichment
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.