bio-clip-seq-clip-preprocessing

$npx mdskill add GPTomics/bioSkills/bio-clip-seq-clip-preprocessing

Trim adapters, extract UMIs, and remove PCR duplicates from CLIP-seq reads.

  • Prepares raw CLIP, iCLIP, and eCLIP reads for peak calling analysis.
  • Depends on umi_tools, cutadapt, and pysam for data processing.
  • Adapts code patterns to match installed package versions and APIs.
  • Outputs processed FASTQ files ready for downstream alignment tasks.
SKILL.md
.github/skills/bio-clip-seq-clip-preprocessingView on GitHub ↗
---
name: bio-clip-seq-clip-preprocessing
description: Preprocess CLIP-seq data including adapter trimming, UMI extraction, and PCR duplicate removal. Use when preparing raw CLIP, iCLIP, or eCLIP reads for peak calling.
tool_type: cli
primary_tool: umi_tools
---

## Version Compatibility

Reference examples tested with: cutadapt 4.4+, pysam 0.22+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# CLIP-seq Preprocessing

**"Preprocess my CLIP-seq reads"** → Extract UMIs, trim adapters, and remove PCR duplicates from raw CLIP/iCLIP/eCLIP reads to prepare for alignment and peak calling.
- CLI: `umi_tools extract` for UMI handling, `cutadapt` for adapter trimming

## UMI Extraction (eCLIP/iCLIP)

**Goal:** Extract UMI barcodes from CLIP-seq reads and append them to read names for downstream deduplication.

**Approach:** Run umi_tools extract with a barcode pattern matching the UMI length and position in the read.

```bash
# Extract UMI from read 1
umi_tools extract \
    --stdin=reads_R1.fastq.gz \
    --read2-in=reads_R2.fastq.gz \
    --bc-pattern=NNNNNNNNNN \
    --stdout=R1_umi.fastq.gz \
    --read2-out=R2_umi.fastq.gz

# bc-pattern: UMI barcode pattern
# N = UMI base
# For eCLIP: typically 10-nt UMI in read 1
```

## Adapter Trimming

```bash
# Trim adapters after UMI extraction
cutadapt \
    -a AGATCGGAAGAGCACACGTCT \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -m 18 \
    -o trimmed_R1.fastq.gz \
    -p trimmed_R2.fastq.gz \
    R1_umi.fastq.gz R2_umi.fastq.gz
```

## Two-Pass Trimming (eCLIP)

**Goal:** Remove inline adapters from eCLIP reads that appear at both 3' and 5' ends.

**Approach:** Run two sequential cutadapt passes: first trim the 3' adapter, then trim any remaining 5' adapter from read-through events.

```bash
# eCLIP protocol has inline adapters
# First pass: trim 3' adapter
cutadapt -a AGATCGGAAGAGC -m 18 -o pass1.fq.gz input.fq.gz

# Second pass: trim 5' adapter (read-through)
cutadapt -g AGATCGGAAGAGC -m 18 -o pass2.fq.gz pass1.fq.gz
```

## PCR Duplicate Removal

```bash
# After alignment, deduplicate using UMIs
umi_tools dedup \
    --stdin=aligned.bam \
    --stdout=deduped.bam \
    --paired \
    --method=unique

# Methods:
# unique: Exact UMI match
# cluster: Allow UMI mismatches (default)
# adjacency: Network-based clustering
```

## Python Preprocessing

```python
from umi_tools import UMIClusterer
import pysam

def count_umis_per_position(bam_path):
    '''Count unique UMIs at each genomic position'''
    from collections import defaultdict

    position_umis = defaultdict(set)

    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for read in bam:
            if read.is_unmapped:
                continue

            # Extract UMI from read name (added by umi_tools extract)
            umi = read.query_name.split('_')[-1]
            pos = (read.reference_name, read.reference_start)
            position_umis[pos].add(umi)

    return {pos: len(umis) for pos, umis in position_umis.items()}
```

## Quality Control

```python
def clip_qc(bam_path):
    '''CLIP-seq specific QC metrics'''
    import pysam

    total = 0
    unique_positions = set()
    read_lengths = []

    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for read in bam:
            if read.is_unmapped:
                continue
            total += 1
            unique_positions.add((read.reference_name, read.reference_start))
            read_lengths.append(read.query_length)

    return {
        'total_reads': total,
        'unique_positions': len(unique_positions),
        'mean_read_length': sum(read_lengths) / len(read_lengths),
        'complexity': len(unique_positions) / total
    }
```

## Related Skills

- clip-alignment - Align preprocessed reads
- read-qc/umi-processing - General UMI handling
- clip-peak-calling - Call peaks from aligned reads
More from GPTomics/bioSkills