bio-population-genetics-scikit-allel-analysis
$
npx mdskill add GPTomics/bioSkills/bio-population-genetics-scikit-allel-analysisAnalyzes population genetics data using scikit-allel in Python
- Reads and processes VCF files for population genetic analysis
- Uses scikit-allel, numpy, and matplotlib for computations and visualization
- Computes allele frequencies, diversity, PCA, and selection scans from genotype data
- Returns structured genetic statistics and visualizations for downstream analysis
SKILL.md
.github/skills/bio-population-genetics-scikit-allel-analysisView on GitHub ↗
---
name: bio-population-genetics-scikit-allel-analysis
description: Python population genetics with scikit-allel. Read VCF files, compute allele frequencies, calculate diversity statistics, perform PCA, and run selection scans using GenotypeArray and HaplotypeArray data structures. Use when analyzing population genetics in Python.
tool_type: python
primary_tool: scikit-allel
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+, matplotlib 3.8+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# scikit-allel Analysis
**"Analyze population genetics in Python"** → Read VCF files into efficient array structures, compute allele frequencies, diversity statistics, PCA, and selection scans using scikit-allel.
- Python: `allel.read_vcf()`, `allel.GenotypeArray()`, `allel.mean_pairwise_difference()`
Python library for population genetics analysis with efficient array data structures.
## Installation
```bash
pip install scikit-allel
# Optional: zarr for chunked storage
pip install zarr
```
## Reading VCF Files
### Load VCF
```python
import allel
callset = allel.read_vcf('data.vcf.gz')
print(callset.keys())
# dict_keys(['samples', 'calldata/GT', 'variants/CHROM', 'variants/POS', 'variants/REF', 'variants/ALT', ...])
samples = callset['samples']
genotypes = callset['calldata/GT']
positions = callset['variants/POS']
chroms = callset['variants/CHROM']
```
### Specify Fields
```python
callset = allel.read_vcf('data.vcf.gz',
fields=['samples', 'calldata/GT', 'variants/POS', 'variants/CHROM', 'variants/QUAL'])
callset = allel.read_vcf('data.vcf.gz', fields='*') # All fields
callset = allel.read_vcf('data.vcf.gz',
region='chr1:1000000-2000000',
samples=['sample1', 'sample2'])
```
### Large Files (Chunked)
```python
import zarr
allel.vcf_to_zarr('large.vcf.gz', 'data.zarr', fields='*', overwrite=True)
callset = zarr.open('data.zarr', mode='r')
gt = allel.GenotypeArray(callset['calldata/GT'])
```
## Genotype Arrays
### GenotypeArray
```python
gt = allel.GenotypeArray(callset['calldata/GT'])
print(gt.shape) # (n_variants, n_samples, ploidy)
print(gt.n_variants)
print(gt.n_samples)
print(gt[0]) # Genotypes at first variant
print(gt[:, 0]) # All variants for first sample
```
### Basic Operations
```python
ac = gt.count_alleles()
print(ac.shape) # (n_variants, n_alleles)
af = ac.to_frequencies()
is_segregating = ac.is_segregating()
gt_filtered = gt.compress(is_segregating, axis=0)
```
### Missing Data
```python
is_called = gt.is_called()
is_missing = gt.is_missing()
miss_per_variant = (~is_called).sum(axis=1)
miss_per_sample = (~is_called).sum(axis=0)
call_rate_variant = is_called.mean(axis=1)
call_rate_sample = is_called.mean(axis=0)
```
## Allele Counts and Frequencies
```python
ac = gt.count_alleles()
ac_ref = ac[:, 0]
ac_alt = ac[:, 1]
af = ac.to_frequencies()
maf = af.min(axis=1)
n_singletons = (ac[:, 1] == 1).sum()
n_doubletons = (ac[:, 1] == 2).sum()
```
### By Population
```python
subpops = {
'pop1': [0, 1, 2, 3, 4],
'pop2': [5, 6, 7, 8, 9]
}
ac_subpops = gt.count_alleles_subpops(subpops)
ac_pop1 = ac_subpops['pop1']
ac_pop2 = ac_subpops['pop2']
```
## Haplotype Arrays
```python
h = gt.to_haplotypes()
print(h.shape) # (n_variants, n_haplotypes)
print(h.n_haplotypes)
ac_hap = h.count_alleles()
```
## PCA
```python
import allel
import numpy as np
gn = gt.to_n_alt(fill=-1)
gn_filtered = gn[is_segregating]
gn_imputed = np.where(gn_filtered < 0, 0, gn_filtered)
coords, model = allel.pca(gn_imputed, n_components=10, scaler='patterson')
print(coords.shape) # (n_samples, n_components)
```
### Plot PCA
```python
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.scatter(coords[:, 0], coords[:, 1], c=population_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.savefig('pca.png')
```
## Diversity Statistics
### Heterozygosity
```python
ho = allel.heterozygosity_observed(gt)
he = allel.heterozygosity_expected(ac, ploidy=2)
mean_ho = np.mean(ho)
mean_he = np.mean(he)
```
### Nucleotide Diversity (Pi)
```python
pi = allel.sequence_diversity(positions, ac)
print(f'Pi = {pi:.6f}')
windows = allel.moving_statistic(positions, statistic=lambda x: allel.sequence_diversity(x, ac), size=10000, step=5000)
```
### Watterson's Theta
```python
theta_w = allel.watterson_theta(positions, ac)
print(f'Theta_W = {theta_w:.6f}')
```
## Site Frequency Spectrum
```python
sfs = allel.sfs(ac[:, 1])
plt.figure(figsize=(10, 5))
allel.plot_sfs(sfs)
plt.savefig('sfs.png')
```
### Folded SFS
```python
sfs_folded = allel.sfs_folded(ac)
plt.figure(figsize=(10, 5))
allel.plot_sfs_folded(sfs_folded)
plt.savefig('sfs_folded.png')
```
## Windowed Statistics
```python
pos = np.array(positions)
windows = np.arange(0, pos.max(), 100000)
pi_windowed, windows_used, n_bases, counts = allel.windowed_diversity(pos, ac, size=100000, step=50000)
plt.figure(figsize=(14, 4))
plt.plot(windows_used[:, 0], pi_windowed)
plt.xlabel('Position')
plt.ylabel('Pi')
plt.savefig('pi_windows.png')
```
## Sample Subsetting
```python
pop1_idx = np.array([0, 1, 2, 3, 4])
pop2_idx = np.array([5, 6, 7, 8, 9])
gt_pop1 = gt.take(pop1_idx, axis=1)
gt_pop2 = gt.take(pop2_idx, axis=1)
ac_pop1 = gt_pop1.count_alleles()
ac_pop2 = gt_pop2.count_alleles()
```
## Filter Variants
```python
is_snp = callset['variants/is_snp']
is_biallelic = ac.max_allele() == 1
is_segregating = ac.is_segregating()
qual = callset['variants/QUAL']
is_high_qual = qual > 30
flt = is_snp & is_biallelic & is_segregating & is_high_qual
gt_filtered = gt.compress(flt, axis=0)
pos_filtered = positions[flt]
```
## Complete Workflow Example
**Goal:** Load VCF data, filter to segregating biallelic variants, compute summary diversity statistics, and run PCA in a single Python workflow.
**Approach:** Read VCF into GenotypeArray, apply segregating and biallelic filters, calculate nucleotide diversity and heterozygosity from allele counts, then perform Patterson PCA on the alt-allele count matrix.
```python
import allel
import numpy as np
callset = allel.read_vcf('data.vcf.gz', fields=['samples', 'calldata/GT', 'variants/POS'])
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
samples = callset['samples']
ac = gt.count_alleles()
flt = ac.is_segregating() & (ac.max_allele() == 1)
gt = gt.compress(flt, axis=0)
pos = pos[flt]
ac = gt.count_alleles()
print(f'Variants after filtering: {gt.n_variants}')
print(f'Samples: {gt.n_samples}')
print(f'Nucleotide diversity: {allel.sequence_diversity(pos, ac):.6f}')
print(f'Mean Het observed: {allel.heterozygosity_observed(gt).mean():.4f}')
gn = gt.to_n_alt(fill=-1)
gn = np.where(gn < 0, 0, gn)
coords, model = allel.pca(gn, n_components=10, scaler='patterson')
```
## Related Skills
- selection-statistics - Fst, Tajima's D, iHS with scikit-allel
- linkage-disequilibrium - LD calculations in Python
- variant-calling/vcf-basics - VCF format and bcftools
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.