bio-workflows-clip-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-clip-pipelineAnalyze protein-RNA interactions from FASTQ to binding sites and motifs.
- Executes end-to-end CLIP-seq analysis for protein-RNA binding studies.
- Depends on CLIPper, STAR, bedtools, cutadapt, and samtools.
- Adapts to specific CLIP methods like HITS-CLIP, PAR-CLIP, iCLIP, or eCLIP.
- Delivers binding sites and motif enrichment results via annotated pipelines.
SKILL.md
.github/skills/bio-workflows-clip-pipelineView on GitHub ↗
---
name: bio-workflows-clip-pipeline
description: End-to-end CLIP-seq analysis from FASTQ to binding sites and motif enrichment. Use when analyzing protein-RNA interactions from CLIP-based methods.
tool_type: mixed
primary_tool: CLIPper
---
## Version Compatibility
Reference examples tested with: FastQC 0.12+, STAR 2.7.11+, bedtools 2.31+, cutadapt 4.4+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# CLIP-seq Pipeline
**"Analyze my CLIP-seq data from FASTQ to binding sites and motifs"** → Orchestrate UMI extraction, adapter trimming, STAR alignment, PCR deduplication, CLIPper/PureCLIP peak calling, binding site annotation, and HOMER motif enrichment.
## Pipeline Overview
```
FASTQ → QC → UMI extract → Trim adapters → Align → Filter → Dedup → Peak call → Annotate → Motifs
```
## CLIP Method Variants
| Method | UMI | Crosslink Site | Adapter |
|--------|-----|----------------|---------|
| HITS-CLIP | Optional | Deletions | 3' adapter |
| PAR-CLIP | Optional | T→C mutations | 3' adapter |
| iCLIP | Required | 5' of read | 3' adapter |
| eCLIP | Required | 5' of read | 3' adapter |
## Step 1: Quality Control
```bash
# Initial QC
fastqc reads.fastq.gz -o qc_pre/
# Check for adapter contamination and UMI structure
# For eCLIP: expect 10nt UMI at read start
zcat reads.fastq.gz | head -n 100 | cut -c1-15
```
## Step 2: UMI Extraction
```bash
# eCLIP (10nt UMI at 5' end)
umi_tools extract \
--stdin=reads.fastq.gz \
--bc-pattern=NNNNNNNNNN \
--stdout=extracted.fastq.gz \
--log=umi_extract.log
# iCLIP (5nt experimental barcode + 5nt UMI)
umi_tools extract \
--stdin=reads.fastq.gz \
--bc-pattern=NNNNNXXXXX \
--stdout=extracted.fastq.gz
```
## Step 3: Adapter Trimming
```bash
# Trim 3' adapter (common eCLIP adapter)
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--minimum-length 20 \
--quality-cutoff 20 \
-o trimmed.fastq.gz \
extracted.fastq.gz
# For paired UMI adapters
cutadapt -a AGATCGGAAGAGCACACGTCT \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--minimum-length 20 \
-o trimmed_R1.fq.gz -p trimmed_R2.fq.gz \
extracted_R1.fq.gz extracted_R2.fq.gz
```
## Step 4: Alignment
```bash
# Build STAR index (once)
STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles genome.fa \
--sjdbGTFfile genes.gtf \
--sjdbOverhang 100
# Align with STAR (optimized for short CLIP reads)
STAR --genomeDir star_index \
--readFilesIn trimmed.fastq.gz \
--readFilesCommand zcat \
--outFilterMismatchNmax 2 \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--alignEndsType EndToEnd \
--outFileNamePrefix clip_
```
## Step 5: Alignment Filtering
```bash
# Remove unmapped and low-quality reads
samtools view -b -F 4 -q 10 clip_Aligned.sortedByCoord.out.bam > filtered.bam
samtools index filtered.bam
# Optional: remove reads mapping to rRNA/tRNA
bedtools intersect -v -abam filtered.bam -b rrna_trna.bed > filtered_norRNA.bam
```
## Step 6: PCR Deduplication
```bash
# UMI-aware deduplication
umi_tools dedup \
-I filtered.bam \
-S dedup.bam \
--output-stats=dedup_stats
samtools index dedup.bam
# Check deduplication rate
echo "Duplication rate:" $(grep "Input Reads" dedup_stats.log | awk '{print $3}')
```
## Step 7: Peak Calling
```bash
# CLIPper (recommended)
clipper -b dedup.bam -s hg38 -o peaks.bed --FDR 0.05 --superlocal
# Alternative: Piranha
Piranha -s dedup.bam -o piranha_peaks.bed -p 0.01
# For PAR-CLIP with T→C mutations
PARalyzer settings.ini
# Strand-specific calling
samtools view -h -F 16 dedup.bam | samtools view -Sb - > plus.bam
samtools view -h -f 16 dedup.bam | samtools view -Sb - > minus.bam
clipper -b plus.bam -s hg38 -o peaks_plus.bed
clipper -b minus.bam -s hg38 -o peaks_minus.bed
cat peaks_plus.bed peaks_minus.bed | sort -k1,1 -k2,2n > peaks_stranded.bed
```
## Step 8: Peak Annotation
```bash
# Annotate with gene features
bedtools intersect -a peaks.bed -b genes.gtf -wo > peaks_annotated.txt
# Or use HOMER
annotatePeaks.pl peaks.bed hg38 > peaks_homer_annotated.txt
# Feature distribution
awk -F'\t' '{print $8}' peaks_homer_annotated.txt | sort | uniq -c | sort -rn
```
## Step 9: Motif Analysis
```bash
# Extract peak sequences
bedtools getfasta -fi genome.fa -bed peaks.bed -s -fo peaks.fa
# HOMER motif finding (RNA mode)
findMotifs.pl peaks.fa fasta motif_output -rna -len 5,6,7,8 -p 8
# MEME-ChIP
meme-chip -oc meme_output -dna peaks.fa -meme-mod zoops -meme-nmotifs 10
```
## Step 10: Cross-link Site Analysis
```bash
# For iCLIP/eCLIP: identify crosslink sites (read 5' ends)
bedtools genomecov -ibam dedup.bam -bg -5 -strand + > crosslinks_plus.bg
bedtools genomecov -ibam dedup.bam -bg -5 -strand - > crosslinks_minus.bg
# For PAR-CLIP: identify T→C conversion sites
# Requires specialized tools like PARpipe
```
## Quality Checkpoints
| Step | Metric | Expected |
|------|--------|----------|
| Raw | Read count | >10M |
| Trimmed | Reads >20bp | >80% |
| Aligned | Mapping rate | >50% |
| Dedup | Unique rate | >20% |
| Peaks | Peak count | 1,000-50,000 |
| Peaks | Median width | 20-100 nt |
| FRiP | Reads in peaks | >10% |
```bash
# Calculate FRiP
reads_in_peaks=$(bedtools intersect -a dedup.bam -b peaks.bed -u | samtools view -c -)
total_reads=$(samtools view -c dedup.bam)
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
```
## Complete Pipeline Script
```bash
#!/bin/bash
set -euo pipefail
SAMPLE=$1
READS=$2
GENOME_DIR=$3
GENOME_FA=$4
mkdir -p qc trimmed aligned peaks motifs
# QC
fastqc $READS -o qc/
# UMI extract
umi_tools extract --stdin=$READS --bc-pattern=NNNNNNNNNN \
--stdout=trimmed/${SAMPLE}_extracted.fq.gz
# Trim
cutadapt -a AGATCGGAAGAGCACACGTCT --minimum-length 20 \
-o trimmed/${SAMPLE}_trimmed.fq.gz trimmed/${SAMPLE}_extracted.fq.gz
# Align
STAR --genomeDir $GENOME_DIR --readFilesIn trimmed/${SAMPLE}_trimmed.fq.gz \
--readFilesCommand zcat --outFilterMismatchNmax 2 --outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate --outFileNamePrefix aligned/${SAMPLE}_
# Filter and dedup
samtools view -b -F 4 -q 10 aligned/${SAMPLE}_Aligned.sortedByCoord.out.bam | \
samtools sort -o aligned/${SAMPLE}_filtered.bam
samtools index aligned/${SAMPLE}_filtered.bam
umi_tools dedup -I aligned/${SAMPLE}_filtered.bam -S aligned/${SAMPLE}_dedup.bam
samtools index aligned/${SAMPLE}_dedup.bam
# Peaks
clipper -b aligned/${SAMPLE}_dedup.bam -s hg38 -o peaks/${SAMPLE}_peaks.bed
# Motifs
bedtools getfasta -fi $GENOME_FA -bed peaks/${SAMPLE}_peaks.bed -s -fo peaks/${SAMPLE}.fa
findMotifs.pl peaks/${SAMPLE}.fa fasta motifs/${SAMPLE} -rna -len 5,6,7 -p 4
echo "Pipeline complete for $SAMPLE"
```
## Related Skills
- clip-seq/clip-preprocessing - Detailed preprocessing
- clip-seq/clip-alignment - Alignment optimization
- clip-seq/clip-peak-calling - Peak caller comparison
- clip-seq/binding-site-annotation - Feature annotation
- clip-seq/clip-motif-analysis - Motif discovery
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.