bio-genome-engineering-off-target-prediction
$
npx mdskill add GPTomics/bioSkills/bio-genome-engineering-off-target-predictionScan genomes for CRISPR off-target sites and assess guide specificity.
- Identifies unintended cleavage sites to evaluate guide RNA safety.
- Integrates Cas-OFFinder CLI and CFD scoring algorithms.
- Scores mismatches using penalty matrices to rank guide risk.
- Outputs genome-wide search results with specificity assessments.
SKILL.md
.github/skills/bio-genome-engineering-off-target-predictionView on GitHub ↗
---
name: bio-genome-engineering-off-target-prediction
description: Predict CRISPR off-target sites using Cas-OFFinder and CFD scoring algorithms. Identify potential unintended cleavage sites genome-wide and assess guide specificity. Use when evaluating guide RNA specificity or selecting guides with minimal off-target risk.
tool_type: cli
primary_tool: Cas-OFFinder
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Off-Target Prediction
**"Check my guide RNA for off-target sites"** → Search the genome for potential unintended cleavage sites allowing mismatches, then score each off-target by cutting frequency determination (CFD) to assess guide specificity.
- CLI: `cas-offinder` for genome-wide off-target search
- Python: CFD scoring with mismatch penalty matrices
## Cas-OFFinder (CLI)
Cas-OFFinder searches genomes for potential off-target sites allowing mismatches.
```bash
# Input file format (input.txt):
# Line 1: Path to genome directory (2bit or fasta index)
# Line 2: PAM pattern (N = any, R = A/G, Y = C/T)
# Line 3+: Guide sequences with mismatch tolerance
# Example input.txt:
# /path/to/genome
# NNNNNNNNNNNNNNNNNNNNNGG
# ATCGATCGATCGATCGATCGNNN 4
# Run Cas-OFFinder
cas-offinder input.txt C output.txt # C = use CPU
cas-offinder input.txt G output.txt # G = use GPU (faster)
```
## Cas-OFFinder Input Preparation
```python
def prepare_cas_offinder_input(guides, genome_path, max_mismatches=4, pam='NGG'):
'''Prepare Cas-OFFinder input file
Args:
guides: List of 20nt guide sequences
genome_path: Path to genome directory with .2bit or indexed fasta
max_mismatches: Maximum mismatches to search (0-6 typical)
More mismatches = slower but more comprehensive
4 mismatches: good balance of speed and sensitivity
pam: PAM sequence (NGG for SpCas9)
'''
lines = [genome_path]
# Build pattern: 20 N's for guide + PAM
pattern = 'N' * 20 + pam
lines.append(pattern)
# Add each guide with mismatch tolerance
for guide in guides:
# Append NNN to represent PAM positions (not matched)
lines.append(f'{guide}NNN {max_mismatches}')
return '\n'.join(lines)
```
## Parse Cas-OFFinder Output
```python
import pandas as pd
def parse_cas_offinder_output(output_file):
'''Parse Cas-OFFinder results
Output columns:
- Guide: Query guide sequence
- Chromosome: Target chromosome
- Position: Genomic position (0-based)
- Sequence: Off-target sequence found
- Strand: + or -
- Mismatches: Number of mismatches
'''
columns = ['guide', 'chrom', 'position', 'sequence', 'strand', 'mismatches']
df = pd.read_csv(output_file, sep='\t', header=None, names=columns)
# Sort by mismatches (fewer = more concerning)
df = df.sort_values('mismatches')
return df
def summarize_off_targets(df):
'''Summarize off-target counts by mismatch number'''
summary = df.groupby(['guide', 'mismatches']).size().unstack(fill_value=0)
# Calculate specificity score
# Fewer off-targets with 0-2 mismatches = higher specificity
summary['specificity'] = 1 / (1 + summary.get(0, 0) * 100 +
summary.get(1, 0) * 10 +
summary.get(2, 0))
return summary
```
## CFD Score Calculation
```python
# Cutting Frequency Determination (CFD) score
# Predicts cleavage probability at off-target sites
# From Doench et al. 2016
# Position-specific mismatch penalties
CFD_MISMATCH_SCORES = {
# (position, ref_nt, target_nt): penalty_multiplier
# Position 1 = PAM-proximal, Position 20 = PAM-distal
# Values < 1 indicate reduced cutting
(1, 'C', 'A'): 0.5, (1, 'C', 'G'): 0.7, (1, 'C', 'T'): 0.3,
(1, 'G', 'A'): 0.4, (1, 'G', 'C'): 0.6, (1, 'G', 'T'): 0.3,
# ... (full matrix has all 20 positions x 12 mismatch types)
(20, 'C', 'A'): 0.9, (20, 'C', 'G'): 0.95, (20, 'C', 'T'): 0.85,
}
# PAM mismatch penalties
CFD_PAM_SCORES = {
'AGG': 0.26, 'CGG': 0.11, 'TGG': 0.02, # First position
'GAG': 0.07, 'GCG': 0.03, 'GTG': 0.02, # Second position
'GGA': 0.01, 'GGC': 0.01, 'GGT': 0.01, # Third position
}
def calculate_cfd_score(guide, off_target, pam='NGG'):
'''Calculate CFD score for an off-target site
CFD score interpretation:
- 1.0: Perfect match (on-target)
- >0.5: High probability of cleavage (concerning)
- 0.1-0.5: Moderate probability
- <0.1: Low probability (likely acceptable)
'''
score = 1.0
# Apply mismatch penalties
for i, (g, t) in enumerate(zip(guide, off_target[:20]), 1):
if g != t:
key = (i, g, t)
penalty = CFD_MISMATCH_SCORES.get(key, 0.5) # Default 0.5
score *= penalty
# Apply PAM penalty if not NGG
off_pam = off_target[20:23]
if off_pam != 'NGG' and off_pam in CFD_PAM_SCORES:
score *= CFD_PAM_SCORES[off_pam]
return score
```
## Aggregate Off-Target Score
```python
def calculate_guide_specificity(guide, off_targets):
'''Calculate aggregate specificity score for a guide
Uses sum of CFD scores for all off-targets.
Lower aggregate = more specific guide.
Specificity score interpretation:
- >0.9: Highly specific (excellent)
- 0.7-0.9: Specific (good)
- 0.5-0.7: Moderate specificity (acceptable)
- <0.5: Poor specificity (consider alternatives)
'''
cfd_sum = sum(calculate_cfd_score(guide, ot['sequence'], ot.get('pam', 'NGG'))
for ot in off_targets)
# Specificity = 1 / (1 + sum of off-target CFD scores)
specificity = 1 / (1 + cfd_sum)
return specificity
```
## CRISPOR-style Analysis
**Goal:** Run a complete off-target analysis pipeline from a single guide sequence to a scored, ranked list of potential off-target sites.
**Approach:** Prepare a Cas-OFFinder input file, run the genome-wide mismatch search, parse the tabular output, calculate CFD scores for each hit, and flag high-risk off-targets by cleavage probability.
```python
def analyze_guide_specificity(guide_seq, genome_fasta, max_mm=4):
'''Full off-target analysis workflow
1. Run Cas-OFFinder to find potential sites
2. Calculate CFD scores for each site
3. Compute aggregate specificity
4. Flag high-risk off-targets
'''
import subprocess
import tempfile
# Prepare input
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write(f'{genome_fasta}\n')
f.write('N' * 20 + 'NGG\n')
f.write(f'{guide_seq}NNN {max_mm}\n')
input_file = f.name
output_file = input_file.replace('.txt', '_out.txt')
# Run Cas-OFFinder
subprocess.run(['cas-offinder', input_file, 'C', output_file], check=True)
# Parse results
off_targets = parse_cas_offinder_output(output_file)
# Calculate CFD for each
off_targets['cfd_score'] = off_targets.apply(
lambda row: calculate_cfd_score(guide_seq, row['sequence']), axis=1
)
# Flag high-risk (CFD > 0.5 and in exon)
# Would need exon annotations for full implementation
return off_targets
```
## Related Skills
- genome-engineering/grna-design - Design guides before off-target check
- variant-calling/variant-annotation - Check if off-targets overlap known variants
- genome-intervals/bed-file-basics - Intersect off-targets with genomic features
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.