bio-rna-structure-ncrna-search
$
npx mdskill add GPTomics/bioSkills/bio-rna-structure-ncrna-searchSearches for non-coding RNA homologs using Infernal and Rfam
- Identifies structured RNAs by sequence and secondary structure conservation
- Uses Infernal CLI tools like cmscan, cmbuild, and cmcalibrate
- Matches input sequences against Rfam covariance models for classification
- Returns RNA family classifications and homolog detection results
SKILL.md
.github/skills/bio-rna-structure-ncrna-searchView on GitHub ↗
---
name: bio-rna-structure-ncrna-search
description: Searches for non-coding RNA homologs and classifies RNA families using Infernal covariance model searches against the Rfam database. Identifies structured RNAs by sequence and secondary structure conservation. Use when querying sequences against Rfam, building custom covariance models for novel RNA families, or classifying non-coding transcripts by family.
tool_type: cli
primary_tool: Infernal
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, Infernal 1.1+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# ncRNA Search
**"Search my sequences for known non-coding RNA families"** → Query sequences against the Rfam database using covariance models that score both sequence and secondary structure conservation, or build custom CMs for novel RNA families.
- CLI: `cmscan` for searching against Rfam CMs
- CLI: `cmbuild` + `cmcalibrate` for building custom covariance models
Search for non-coding RNA homologs and classify RNA families using covariance models (CMs). Infernal scores both sequence and secondary structure conservation, making it more sensitive than sequence-only methods for structured RNAs.
## Rfam Database Setup
```bash
# Download current Rfam covariance models (~500 MB compressed)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
# Press the CM database (required before searching)
cmpress Rfam.cm
# Download clan information (for resolving overlapping hits)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
```
## cmscan: Query Sequences Against Rfam
Search one or more query sequences against the full Rfam database to classify ncRNAs by family.
```bash
# Basic cmscan against Rfam
cmscan --cpu 8 --tblout results.tbl --fmt 2 Rfam.cm query.fa > results.out
# With clan overlap resolution (removes redundant hits from same clan)
cmscan --cpu 8 --tblout results.tbl --fmt 2 --clanin Rfam.clanin --oclan Rfam.cm query.fa > results.out
# Strict E-value threshold for high-confidence hits
cmscan --cpu 8 --tblout results.tbl --fmt 2 --clanin Rfam.clanin --oclan \
-E 1e-5 --incE 1e-5 Rfam.cm query.fa > results.out
```
### cmscan Key Options
| Option | Description |
|--------|-------------|
| `--tblout` | Tabular output (easier to parse) |
| `--fmt 2` | Format 2 tabular output (includes model coordinates) |
| `-E` | Report E-value threshold (default: 10.0) |
| `--incE` | Inclusion E-value for significant hits (default: 0.01) |
| `--clanin` | Clan file for resolving overlapping family hits |
| `--oclan` | Enable clan overlap resolution |
| `--cut_ga` | Use Rfam gathering threshold (curated per-family cutoff) |
| `--cpu` | Number of threads |
| `--noali` | Skip alignment output (faster for large searches) |
### Gathering Threshold vs E-value
```bash
# Gathering threshold: curated per-family cutoff, recommended for Rfam
# Provides consistent sensitivity per family as calibrated by Rfam curators
cmscan --cut_ga --tblout results.tbl --fmt 2 --clanin Rfam.clanin --oclan \
Rfam.cm query.fa > results.out
# E-value based: unified threshold, useful for custom databases
cmscan -E 1e-3 --tblout results.tbl Rfam.cm query.fa > results.out
```
## cmsearch: Search Specific CM Against Sequence Database
Search a specific covariance model against a sequence database (inverse of cmscan).
```bash
# Extract a single family CM from Rfam
cmfetch Rfam.cm RF00005 > tRNA.cm
cmpress tRNA.cm
# Search for tRNAs in a genome
cmsearch --cpu 8 --tblout trna_hits.tbl tRNA.cm genome.fa > trna_hits.out
# Search with bit score threshold instead of E-value
cmsearch --cpu 8 -T 30.0 --tblout hits.tbl tRNA.cm genome.fa > hits.out
```
## Building Custom Covariance Models
**Goal:** Create a covariance model for a novel RNA family not represented in Rfam, enabling sensitive homolog searches that leverage both sequence and structure conservation.
**Approach:** Prepare a Stockholm-format alignment with secondary structure annotation, build and calibrate a covariance model from the alignment, then search target sequences with the custom CM.
For novel RNA families not in Rfam, build a custom CM from a structure-annotated alignment.
### Step 1: Prepare Stockholm Alignment
```
# STOCKHOLM 1.0
#=GF AC MYFAM00001
#=GF DE My novel RNA family
seq1 GGGCUAUUAGCUCAGUUGGUUAGAGC
seq2 GGGCUAUAAGCUCAGUUGGAUAGAGC
seq3 GGGCUAUUAGCUCAGUUGGUUAGAGC
#=GC SS_cons ((((....((((......))))))))
//
```
### Step 2: Build and Calibrate
```bash
# Build CM from alignment
cmbuild my_family.cm alignment.sto
# Calibrate E-value statistics (compute-intensive but essential for accurate E-values)
cmcalibrate --cpu 8 my_family.cm
# Press for searching
cmpress my_family.cm
```
### Step 3: Search
```bash
# Search genome with custom CM
cmsearch --cpu 8 --tblout custom_hits.tbl my_family.cm target_sequences.fa > custom_hits.out
# Iterative search: use hits to refine alignment and rebuild CM
cmsearch -A new_hits.sto my_family.cm target_sequences.fa
# Then manually curate new_hits.sto and rebuild
```
### cmbuild Options
| Option | Description |
|--------|-------------|
| `--hand` | Use reference annotation for consensus (trust SS_cons exactly) |
| `--enone` | Turn off entropy weighting |
| `-n` | Name the CM |
| `--ere` | Target mean match state relative entropy |
## Parsing Infernal Output
### Tabular Output (--tblout --fmt 2)
```python
import pandas as pd
def parse_cmscan_tblout(tblout_file):
'''Parse Infernal cmscan --fmt 2 tabular output.'''
rows = []
with open(tblout_file) as f:
for line in f:
if line.startswith('#'):
continue
fields = line.strip().split()
if len(fields) < 18:
continue
rows.append({
'target_name': fields[0],
'target_accession': fields[1],
'query_name': fields[2],
'query_accession': fields[3],
'mdl_type': fields[4],
'mdl_from': int(fields[5]),
'mdl_to': int(fields[6]),
'seq_from': int(fields[7]),
'seq_to': int(fields[8]),
'strand': fields[9],
'trunc': fields[10],
'pass': fields[11],
'gc': float(fields[12]),
'bias': float(fields[13]),
'score': float(fields[14]),
'evalue': float(fields[15]),
'inc': fields[16],
'description': ' '.join(fields[17:])
})
df = pd.DataFrame(rows)
return df
def filter_significant_hits(df, evalue_threshold=1e-5):
'''Filter for significant hits and sort by score.'''
significant = df[df['evalue'] <= evalue_threshold].copy()
significant = significant.sort_values('score', ascending=False)
return significant
def summarize_families(df):
'''Summarize ncRNA family assignments.'''
summary = df.groupby('target_name').agg(
count=('query_name', 'count'),
best_score=('score', 'max'),
best_evalue=('evalue', 'min')
).sort_values('count', ascending=False)
return summary
```
### Extract Hit Sequences
```python
from Bio import SeqIO
def extract_hit_sequences(fasta_file, hits_df, output_file):
'''Extract sequences for cmscan/cmsearch hits.'''
seqs = SeqIO.to_dict(SeqIO.parse(fasta_file, 'fasta'))
records = []
for _, hit in hits_df.iterrows():
seq_record = seqs[hit['query_name']]
start, end = sorted([hit['seq_from'], hit['seq_to']])
subseq = seq_record[start-1:end]
if hit['strand'] == '-':
subseq = subseq.reverse_complement()
subseq.id = f'{hit["query_name"]}_{start}_{end}_{hit["target_name"]}'
subseq.description = f'family={hit["target_name"]} score={hit["score"]:.1f} E={hit["evalue"]:.1e}'
records.append(subseq)
SeqIO.write(records, output_file, 'fasta')
print(f'Extracted {len(records)} hit sequences to {output_file}')
```
## Quality Thresholds
| Metric | Threshold | Rationale |
|--------|-----------|-----------|
| E-value (Rfam scan) | < 1e-5 | High-confidence family assignment |
| Gathering threshold | --cut_ga | Rfam-curated per-family cutoffs, recommended default |
| Bit score | > 20 | Minimum for reportable hits in custom searches |
| Truncation | != 5'/3' | Hits at sequence edges may be truncated; check completeness |
## Related Skills
- secondary-structure-prediction - Predict structures for novel ncRNA candidates
- genome-annotation/ncrna-annotation - Genome-wide ncRNA annotation pipelines
- alignment/msa-statistics - Evaluate alignment quality for CM building
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.