bio-read-qc-quality-reports
$
npx mdskill add GPTomics/bioSkills/bio-read-qc-quality-reportsGenerates and interprets sequencing data quality reports using FastQC and MultiQC
- Assesses FASTQ file quality for per-base scores, adapter content, and GC bias
- Uses FastQC for individual files and MultiQC for aggregation across samples
- Evaluates metrics like duplication levels and overrepresented sequences
- Delivers HTML reports with visualizations and summary statistics
SKILL.md
.github/skills/bio-read-qc-quality-reportsView on GitHub ↗
---
name: bio-read-qc-quality-reports
description: Generate and interpret quality reports from FASTQ files using FastQC and MultiQC. Assess per-base quality, adapter content, GC bias, duplication levels, and overrepresented sequences. Use when performing initial QC on raw sequencing data or validating preprocessing results.
tool_type: cli
primary_tool: fastqc
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Quality Reports
Generate quality reports for FASTQ files using FastQC and aggregate multiple reports with MultiQC.
**"Run quality control on FASTQ files"** → Generate per-base quality, adapter content, and duplication plots, then aggregate across samples.
- CLI: `fastqc *.fastq.gz` then `multiqc .`
## FastQC - Single Sample Reports
### Basic Usage
```bash
# Single file
fastqc sample.fastq.gz
# Multiple files
fastqc *.fastq.gz
# Specify output directory
fastqc -o qc_reports/ sample_R1.fastq.gz sample_R2.fastq.gz
# Set threads
fastqc -t 4 *.fastq.gz
```
### Output Files
FastQC produces two files per input:
- `sample_fastqc.html` - Interactive HTML report
- `sample_fastqc.zip` - Data files and images
### Key Modules
| Module | What It Shows | Warning Signs |
|--------|---------------|---------------|
| Per base sequence quality | Quality scores across read | Drop below Q20 at 3' end |
| Per sequence quality | Quality score distribution | Bimodal distribution |
| Per base sequence content | Nucleotide composition | Imbalance at start (normal) |
| Per sequence GC content | GC distribution | Secondary peak (contamination) |
| Per base N content | Unknown bases | High N content |
| Sequence length distribution | Read lengths | Unexpected variation |
| Sequence duplication | Duplicate reads | High duplication (PCR) |
| Overrepresented sequences | Common sequences | Adapter contamination |
| Adapter content | Adapter sequences | Visible adapter curves |
### Extract Data from ZIP
```bash
# Unzip to access raw data
unzip sample_fastqc.zip
# View summary
cat sample_fastqc/summary.txt
# Get per-base quality
cat sample_fastqc/fastqc_data.txt | grep -A 50 ">>Per base sequence quality"
```
## MultiQC - Aggregate Reports
### Basic Usage
```bash
# Aggregate all FastQC reports in current directory
multiqc .
# Specify input and output
multiqc qc_reports/ -o multiqc_output/
# Custom report name
multiqc . -n my_project_qc
# Force overwrite
multiqc . -f
```
### Common Options
```bash
# Flat directory (no sample subdirs)
multiqc --flat .
# Export data as TSV
multiqc . --export
# Only specific modules
multiqc . -m fastqc
# Exclude patterns
multiqc . --ignore '*_trimmed*'
# Include patterns
multiqc . --ignore-samples '*negative*'
```
### Output Files
- `multiqc_report.html` - Interactive HTML report
- `multiqc_data/` - Directory with data tables
- `multiqc_fastqc.txt` - FastQC metrics
- `multiqc_general_stats.txt` - Summary statistics
- `multiqc_sources.txt` - Source files used
### Extract Data Programmatically
```python
import pandas as pd
general_stats = pd.read_csv('multiqc_data/multiqc_general_stats.txt', sep='\t')
print(general_stats.columns)
fastqc_data = pd.read_csv('multiqc_data/multiqc_fastqc.txt', sep='\t')
```
## Batch Processing
### Process Multiple Samples
```bash
# All FASTQ files in parallel
fastqc -t 8 -o qc_reports/ raw_data/*.fastq.gz
# Then aggregate
multiqc qc_reports/ -o multiqc_output/
```
### Before and After Trimming
```bash
# Create separate directories
mkdir -p qc_reports/raw qc_reports/trimmed
# QC raw reads
fastqc -o qc_reports/raw/ raw_data/*.fastq.gz
# After trimming (using fastp, cutadapt, etc.)
fastqc -o qc_reports/trimmed/ trimmed_data/*.fastq.gz
# Compare with MultiQC
multiqc qc_reports/ -o qc_comparison/
```
## Interpretation Guide
### Quality Scores
| Phred Score | Error Rate | Interpretation |
|-------------|------------|----------------|
| Q40 | 0.0001 | Excellent |
| Q30 | 0.001 | Good (Illumina target) |
| Q20 | 0.01 | Acceptable |
| Q10 | 0.1 | Poor |
### Common Issues
| Issue | Likely Cause | Action |
|-------|--------------|--------|
| Low quality at 3' end | Normal degradation | Trim 3' end |
| Adapter contamination | Short inserts | Trim adapters |
| GC bias | Library prep | Consider correction |
| High duplication | Low complexity, PCR | Mark/remove duplicates |
| Overrepresented seqs | Adapters, primers | Check sequences |
## Configuration
### Custom Adapters
Create `~/.fastqc/Configuration/adapter_list.txt`:
```
Custom_Adapter_Name ACGTACGTACGT
```
### Custom Limits
Create `~/.fastqc/Configuration/limits.txt` to customize thresholds:
```
# Warn if mean quality below 25
quality_sequence warn 25
quality_sequence error 20
```
## Related Skills
- adapter-trimming - Remove adapters detected by FastQC
- fastp-workflow - All-in-one QC and trimming
- sequence-io/read-sequences - FASTQ file reading/writing
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.