bio-bedgraph-handling
$
npx mdskill add GPTomics/bioSkills/bio-bedgraph-handlingProcess and convert bedGraph files for genome visualization
- Generate bedGraph tracks from BAM files for genome browser display
- Uses pyBigWig, bedtools, and samtools for manipulation and conversion
- Applies normalization and signal processing based on sequencing data type
- Exports results as bedGraph or converts to bigWig for efficient visualization
SKILL.md
.github/skills/bio-bedgraph-handlingView on GitHub ↗
---
name: bio-bedgraph-handling
description: Create, manipulate, and convert bedGraph files for genome browser visualization. Covers bedGraph format, conversion to/from bigWig, normalization, and signal processing. Use when handling coverage and signal tracks from ChIP-seq, ATAC-seq, or RNA-seq.
tool_type: mixed
primary_tool: pyBigWig
---
## Version Compatibility
Reference examples tested with: bedtools 2.31+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# bedGraph Handling
**"Work with bedGraph signal tracks"** → Create, manipulate, and convert bedGraph files for displaying coverage or signal intensity on genome browsers.
- CLI: `bedtools genomecov -bg` to generate, `bedGraphToBigWig` to convert
- Python: `pyBigWig`, `pybedtools`
bedGraph is a text format for displaying continuous-valued data on genome browsers. Common for coverage, signal intensity, and scores.
## bedGraph Format
```
track type=bedGraph name="Sample" description="Coverage"
chr1 0 100 1.5
chr1 100 200 2.3
chr1 200 300 0.8
```
Four columns: chrom, start, end, value (0-based, half-open)
## Create bedGraph from BAM
### Using bedtools genomecov
```bash
bedtools genomecov -ibam sample.bam -bg > sample.bedgraph
bedtools genomecov -ibam sample.bam -bg -split > sample.bedgraph
bedtools genomecov -ibam sample.bam -bg -scale 1.5 > sample.scaled.bedgraph
```
### Strand-Specific
```bash
bedtools genomecov -ibam sample.bam -bg -strand + > sample.plus.bedgraph
bedtools genomecov -ibam sample.bam -bg -strand - > sample.minus.bedgraph
```
### 5' End Coverage (ChIP-seq)
```bash
bedtools genomecov -ibam sample.bam -bg -5 > sample.5prime.bedgraph
```
### Normalize by Library Size (CPM)
```bash
total_reads=$(samtools view -c -F 260 sample.bam)
scale=$(echo "scale=10; 1000000 / $total_reads" | bc)
bedtools genomecov -ibam sample.bam -bg -scale $scale > sample.cpm.bedgraph
```
## Sort bedGraph
bedGraph must be sorted for conversion to bigWig.
```bash
sort -k1,1 -k2,2n sample.bedgraph > sample.sorted.bedgraph
LC_ALL=C sort -k1,1 -k2,2n sample.bedgraph > sample.sorted.bedgraph
```
## Convert bedGraph to bigWig
### Using UCSC bedGraphToBigWig
```bash
bedGraphToBigWig sample.sorted.bedgraph chrom.sizes sample.bw
fetchChromSizes hg38 > hg38.chrom.sizes
bedGraphToBigWig sample.sorted.bedgraph hg38.chrom.sizes sample.bw
```
### Generate chrom.sizes
```bash
samtools faidx reference.fa
cut -f1,2 reference.fa.fai > chrom.sizes
fetchChromSizes hg38 > hg38.chrom.sizes
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
"select chrom, size from hg38.chromInfo" > hg38.chrom.sizes
```
### Clip to Chromosome Boundaries
```bash
bedClip sample.bedgraph chrom.sizes sample.clipped.bedgraph
bedGraphToBigWig sample.clipped.bedgraph chrom.sizes sample.bw
```
## Convert bigWig to bedGraph
```bash
bigWigToBedGraph sample.bw sample.bedgraph
bigWigToBedGraph sample.bw sample.chr1.bedgraph -chrom=chr1
bigWigToBedGraph sample.bw sample.region.bedgraph -chrom=chr1 -start=1000 -end=2000
```
## Merge bedGraph Files
### Using bedtools unionbedg
```bash
bedtools unionbedg -i sample1.bedgraph sample2.bedgraph sample3.bedgraph \
-header -names sample1 sample2 sample3 > merged.bedgraph
```
### Average Across Samples
```bash
bedtools unionbedg -i sample1.bedgraph sample2.bedgraph sample3.bedgraph | \
awk '{sum=0; for(i=4;i<=NF;i++) sum+=$i; print $1,$2,$3,sum/(NF-3)}' OFS='\t' \
> average.bedgraph
```
## Mathematical Operations
### bedtools map for Region Statistics
```bash
bedtools map -a regions.bed -b sample.bedgraph -c 4 -o mean > region_means.bed
bedtools map -a regions.bed -b sample.bedgraph -c 4 -o sum > region_sums.bed
bedtools map -a regions.bed -b sample.bedgraph -c 4 -o max > region_max.bed
```
### Subtract Background
```bash
bedtools unionbedg -i treatment.bedgraph input.bedgraph | \
awk '{diff=$4-$5; if(diff<0) diff=0; print $1,$2,$3,diff}' OFS='\t' \
> subtracted.bedgraph
```
### Log Transform
```bash
awk '{print $1,$2,$3,log($4+1)/log(2)}' OFS='\t' sample.bedgraph > sample.log2.bedgraph
```
### Smooth Signal
```bash
bedtools slop -i sample.bedgraph -g chrom.sizes -b 50 | \
bedtools merge -i - -c 4 -o mean > smoothed.bedgraph
```
## Python with pyBigWig
### Write bedGraph
```python
import pyBigWig
bw = pyBigWig.open('output.bedgraph', 'w')
bw.addHeader([('chr1', 248956422), ('chr2', 242193529)])
chroms = ['chr1', 'chr1', 'chr1']
starts = [0, 100, 200]
ends = [100, 200, 300]
values = [1.5, 2.3, 0.8]
bw.addEntries(chroms, starts, ends=ends, values=values)
bw.close()
```
### Read bigWig to bedGraph Format
```python
import pyBigWig
bw = pyBigWig.open('sample.bw')
for chrom, size in bw.chroms().items():
intervals = bw.intervals(chrom)
if intervals:
for start, end, value in intervals:
print(f'{chrom}\t{start}\t{end}\t{value}')
bw.close()
```
### Convert bigWig Region to bedGraph
```python
import pyBigWig
bw = pyBigWig.open('sample.bw')
intervals = bw.intervals('chr1', 1000000, 2000000)
with open('region.bedgraph', 'w') as f:
for start, end, value in intervals:
f.write(f'chr1\t{start}\t{end}\t{value}\n')
bw.close()
```
## deepTools for Normalization
### bamCoverage (BAM to bedGraph/bigWig)
```bash
bamCoverage -b sample.bam -o sample.bw --normalizeUsing RPKM
bamCoverage -b sample.bam -o sample.bw --normalizeUsing CPM
bamCoverage -b sample.bam -o sample.bw --normalizeUsing BPM
bamCoverage -b sample.bam -o sample.bedgraph --outFileFormat bedgraph --normalizeUsing CPM
```
### bamCompare (Treatment vs Control)
```bash
bamCompare -b1 treatment.bam -b2 input.bam -o log2ratio.bw --scaleFactorsMethod readCount
bamCompare -b1 treatment.bam -b2 input.bam -o subtracted.bw --ratio subtract
```
### bigwigCompare
```bash
bigwigCompare -b1 treatment.bw -b2 input.bw -o ratio.bw --ratio log2
bigwigCompare -b1 sample1.bw -b2 sample2.bw -o diff.bw --ratio subtract
```
## Filtering and Subsetting
### Filter by Value
```bash
awk '$4 >= 1.0' sample.bedgraph > high_signal.bedgraph
awk '$4 > 0' sample.bedgraph > nonzero.bedgraph
```
### Extract Regions
```bash
bedtools intersect -a sample.bedgraph -b regions.bed > subset.bedgraph
```
### Remove Specific Chromosomes
```bash
grep -v "^chrM" sample.bedgraph | grep -v "_random" > filtered.bedgraph
awk '$1 ~ /^chr[0-9XY]+$/' sample.bedgraph > standard_chroms.bedgraph
```
## Aggregate to Bins
### Fixed-Size Bins
```bash
bedtools makewindows -g chrom.sizes -w 1000 > bins.bed
bedtools map -a bins.bed -b sample.bedgraph -c 4 -o mean > binned.bedgraph
```
### Gene Bodies
```bash
bedtools map -a genes.bed -b sample.bedgraph -c 4 -o mean > gene_signal.bed
```
## Quality Control
### Check for Overlapping Intervals
```bash
bedtools merge -i sample.bedgraph -c 4 -o collapse | \
awk 'index($4,",") > 0' | head
```
### Verify Sorted Order
```bash
sort -c -k1,1 -k2,2n sample.bedgraph && echo "Sorted" || echo "Not sorted"
```
### Check Value Range
```bash
awk 'NR==1 {min=$4; max=$4} {if($4<min) min=$4; if($4>max) max=$4}
END {print "Min:", min, "Max:", max}' sample.bedgraph
```
## Complete Pipeline
**Goal:** Generate a CPM-normalized bigWig track from a BAM file in a single automated workflow.
**Approach:** Count total mapped reads with samtools, compute a CPM scale factor, generate a scaled bedGraph with bedtools genomecov, sort and clip to chromosome boundaries, then convert to bigWig format.
```bash
#!/bin/bash
BAM=$1
NAME=$(basename $BAM .bam)
CHROM_SIZES=$2
total_reads=$(samtools view -c -F 260 $BAM)
scale=$(echo "scale=10; 1000000 / $total_reads" | bc)
bedtools genomecov -ibam $BAM -bg -scale $scale > ${NAME}.bedgraph
sort -k1,1 -k2,2n ${NAME}.bedgraph > ${NAME}.sorted.bedgraph
bedClip ${NAME}.sorted.bedgraph $CHROM_SIZES ${NAME}.clipped.bedgraph
bedGraphToBigWig ${NAME}.clipped.bedgraph $CHROM_SIZES ${NAME}.bw
rm ${NAME}.bedgraph ${NAME}.sorted.bedgraph ${NAME}.clipped.bedgraph
echo "Created ${NAME}.bw (CPM normalized)"
```
## Track Header for UCSC
```bash
echo 'track type=bedGraph name="Sample" description="CPM normalized" visibility=full color=0,0,255 altColor=255,0,0 autoScale=on graphType=bar' > track.bedgraph
cat sample.bedgraph >> track.bedgraph
```
## Related Skills
- coverage-analysis - Generate coverage from alignments
- bigwig-tracks - Work with bigWig format
- chip-seq/chipseq-visualization - Visualize signal tracks
- alignment-files/sam-bam-basics - BAM file processing
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.