bio-vcf-manipulation

$npx mdskill add GPTomics/bioSkills/bio-vcf-manipulation

Manipulate VCF files using bcftools for merging, sorting, and comparing variant data

  • Combine, sort, and compare genomic variant data from multiple VCF files
  • Relies on bcftools CLI for VCF operations and GATK for compatibility
  • Chooses merge, concat, sort, isec, or view based on input structure and task
  • Delivers processed VCF files or command outputs for downstream analysis
SKILL.md
.github/skills/bio-vcf-manipulationView on GitHub ↗
---
name: bio-vcf-manipulation
description: Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data.
tool_type: cli
primary_tool: bcftools
---

## Version Compatibility

Reference examples tested with: GATK 4.5+, bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# VCF Manipulation

Merge, concat, sort, and compare VCF files using bcftools.

## Operations Overview

| Operation | Command | Use Case |
|-----------|---------|----------|
| Merge | `bcftools merge` | Combine samples from multiple VCFs |
| Concat | `bcftools concat` | Combine regions from multiple VCFs |
| Sort | `bcftools sort` | Sort unsorted VCF |
| Intersect | `bcftools isec` | Compare/intersect call sets |
| Subset | `bcftools view` | Extract samples or regions |

## Merge vs Concat Decision

| Operation | Input structure | Result | Requires index |
|-----------|----------------|--------|----------------|
| `bcftools merge` | Different **samples** at (potentially) same positions | Multi-sample VCF | Yes |
| `bcftools concat` | Same **samples** from different **regions** | Combined-region VCF | No (unless `-a`) |

Common mistake: using concat when merge is needed (or vice versa). `concat --allow-overlaps` (`-a`) can handle some overlap cases but does not merge genotypes across samples -- it only resolves duplicate records from the same sample.

## bcftools merge

**Goal:** Combine VCF files from different samples into a single multi-sample VCF.

**Approach:** Use bcftools merge to join files with different sample columns at shared genomic positions.

**"Merge my per-sample VCFs into one file"** → Combine variant records from multiple samples into a single multi-sample VCF.

Combine multiple VCF files with **different samples** at the same positions.

### Basic Merge

```bash
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
```

### Merge Multiple Files

```bash
bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz
```

### Merge from File List

```bash
# files.txt: one VCF path per line
bcftools merge -l files.txt -Oz -o merged.vcf.gz
```

### Handle Missing Genotypes

```bash
# Output missing genotypes as ./. (default)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

# Output missing as reference (0/0)
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
```

### Force Sample Names

When sample names conflict:

```bash
bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
```

### Multiallelic Handling During Merge

When merging VCFs, if the same site has different ALT alleles across files, bcftools merge creates a multiallelic record by default. This is usually correct behavior. Use `--merge none` to keep records separate, or `--merge snps`/`--merge indels` to control merging granularity.

```bash
# Default: merge all variant types at same position
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

# Keep SNPs and indels at same position as separate records
bcftools merge --merge snps sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

# No merging -- every record stays separate
bcftools merge --merge none sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
```

### Merge Specific Regions

```bash
bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
```

## bcftools concat

**Goal:** Concatenate VCF files that cover different genomic regions for the same samples.

**Approach:** Use bcftools concat to join region-split files (e.g., per-chromosome VCFs) in order.

Combine VCF files with **same samples** from different regions.

### Concatenate Chromosomes

```bash
bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz
```

### Concatenate All Chromosomes

```bash
bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz
```

### From File List

```bash
# files.txt: one VCF path per line (in order)
bcftools concat -f files.txt -Oz -o concatenated.vcf.gz
```

### Allow Overlapping Regions

```bash
bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz
```

### Remove Duplicates

```bash
bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz
```

Options for `-d`:
- `snps` - Remove duplicate SNPs
- `indels` - Remove duplicate indels
- `both` - Remove duplicate SNPs and indels
- `all` - Remove all duplicates
- `exact` - Remove exact duplicates only

## bcftools sort

**Goal:** Sort a VCF file by chromosome and position.

**Approach:** Use bcftools sort with optional temp directory and memory limits for large files.

Sort VCF by chromosome and position.

### Basic Sort

```bash
bcftools sort input.vcf -Oz -o sorted.vcf.gz
```

### With Temporary Directory

For large files:

```bash
bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz
```

### Memory Limit

```bash
bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz
```

## bcftools isec

**Goal:** Identify shared and private variants between two or more VCF files.

**Approach:** Use bcftools isec to partition variants into private-to-each-file and shared subsets.

**"Find variants called by both GATK and bcftools"** → Intersect two call sets to identify concordant and discordant variants.

Intersect and compare VCF files.

**Normalize before comparing:** Before intersecting VCF files from different callers, ALWAYS normalize both files first. Different callers may represent the same variant differently (e.g., different indel left-alignment, multiallelic vs biallelic representation). Without normalization, bcftools isec will report false discordance. See variant-normalization for details.

```bash
bcftools norm -m-any -f ref.fa gatk.vcf.gz -Oz -o gatk_norm.vcf.gz
bcftools norm -m-any -f ref.fa bcftools_calls.vcf.gz -Oz -o bcftools_norm.vcf.gz
bcftools isec -p comparison gatk_norm.vcf.gz bcftools_norm.vcf.gz
```

### Find Shared Variants

```bash
bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz
```

Creates:
- `0000.vcf` - Private to sample1
- `0001.vcf` - Private to sample2
- `0002.vcf` - Shared (sample1 records)
- `0003.vcf` - Shared (sample2 records)

### Output Compressed

```bash
bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz
```

### Intersection Only

```bash
bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz
# Only outputs variants present in exactly 2 files
```

### Comparison Options

| Flag | Description |
|------|-------------|
| `-n=2` | Present in exactly 2 files |
| `-n+2` | Present in 2 or more files |
| `-n-2` | Present in fewer than 2 files |
| `-n~11` | Boolean: file1 AND file2 |
| `-n~10` | Boolean: file1 AND NOT file2 |

### Two-File Intersection

The `-w` flag selects which file's records appear in the output. `-w1` outputs records from the first file, `-w2` from the second. This matters when the two files carry different INFO/FORMAT annotations -- the output inherits whichever file's records are selected.

```bash
# Variants in both files, output records from file 1 (with file 1 annotations)
bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz

# Same intersection, but output records from file 2
bcftools isec -n=2 -w2 sample1.vcf.gz sample2.vcf.gz -Oz -o shared_file2_annot.vcf.gz

# Variants only in sample1
bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz
```

### Complement Mode

```bash
# Variants in file1 not in file2
bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz
```

## Subsetting VCF Files

**Goal:** Extract a subset of samples or regions from a multi-sample VCF.

**Approach:** Use bcftools view with -s (samples) or -r/-R (regions) flags to create targeted subsets.

### Extract Samples

```bash
bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz
```

### Exclude Samples

```bash
bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz
```

### From Sample List File

```bash
# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz
```

### Extract Region

```bash
bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz
```

### Extract Multiple Regions

```bash
bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz
```

## Renaming Samples

**Goal:** Rename sample columns in a VCF header.

**Approach:** Use bcftools reheader with a mapping file of old-to-new sample names.

### Single Sample

```bash
echo "old_name new_name" > rename.txt
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
```

### Multiple Samples

```bash
# rename.txt format: old_name new_name
cat > rename.txt << EOF
sample1 patient_001
sample2 patient_002
sample3 patient_003
EOF

bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
```

## Splitting VCF Files

**Goal:** Split a multi-sample or multi-chromosome VCF into separate files.

**Approach:** Iterate over samples or chromosomes and extract each with bcftools view.

### Split by Sample

```bash
for sample in $(bcftools query -l input.vcf.gz); do
    bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
done
```

### Split by Chromosome

```bash
for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do
    bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
done
```

### Split Multiallelic Sites

```bash
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
```

## Common Workflows

**Goal:** Execute typical multi-step VCF manipulation tasks.

**Approach:** Chain merge, concat, isec, and view operations for cohort assembly, caller comparison, and filtering.

### Merge Cohort VCFs

```bash
# Create file list
ls *.vcf.gz > files.txt

# Merge all samples
bcftools merge -l files.txt -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz
```

### Combine Chromosome VCFs

```bash
# After parallel variant calling by chromosome
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz
```

### Compare Two Callers

```bash
# Find variants called by both GATK and bcftools
bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz

# Count results
wc -l comparison/*.vcf
```

### Extract Passing Variants

```bash
bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz
bcftools index pass_only.vcf.gz
```

## cyvcf2 Python Operations

**Goal:** Perform VCF set operations programmatically in Python.

**Approach:** Use cyvcf2 for position-based comparisons and record concatenation; use bcftools merge for true multi-sample merging.

**Note:** True VCF merging (combining samples at matching positions) is complex.
Use `bcftools merge` for production work. cyvcf2 is better for filtering/querying.

### Concatenate Records (Not True Merge)

```python
from cyvcf2 import VCF, Writer

# WARNING: This concatenates records, not a true merge
# For actual merging of samples, use bcftools merge
vcf1 = VCF('file1.vcf.gz')
writer = Writer('combined.vcf', vcf1)

for variant in vcf1:
    writer.write_record(variant)

writer.close()
vcf1.close()
```

### Find Shared Positions

```python
from cyvcf2 import VCF

# Load positions from first VCF
vcf1_positions = set()
for variant in VCF('sample1.vcf.gz'):
    vcf1_positions.add((variant.CHROM, variant.POS))

# Check second VCF
shared = 0
unique = 0
for variant in VCF('sample2.vcf.gz'):
    if (variant.CHROM, variant.POS) in vcf1_positions:
        shared += 1
    else:
        unique += 1

print(f'Shared: {shared}')
print(f'Unique to sample2: {unique}')
```

## Quick Reference

| Task | Command |
|------|---------|
| Merge samples | `bcftools merge s1.vcf.gz s2.vcf.gz -Oz -o merged.vcf.gz` |
| Concat regions | `bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o all.vcf.gz` |
| Sort VCF | `bcftools sort input.vcf -Oz -o sorted.vcf.gz` |
| Intersect | `bcftools isec -p dir a.vcf.gz b.vcf.gz` |
| Extract samples | `bcftools view -s sample1 input.vcf.gz` |
| Rename samples | `bcftools reheader -s names.txt input.vcf.gz` |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `different samples` | merge vs concat confusion | Use merge for samples, concat for regions |
| `not sorted` | Unsorted input to concat | Sort first or use `-a` flag |
| `sample name conflict` | Duplicate sample names | Use `--force-samples` |
| `index required` | Missing index for merge/isec | Run `bcftools index` first |

## Related Skills

- variant-calling/vcf-basics - View and query VCF files
- variant-calling/filtering-best-practices - Filter variants before manipulation
- variant-calling/variant-normalization - Normalize before comparing (critical for isec accuracy)
- variant-calling/vcf-statistics - Compare statistics after manipulation
- variant-calling/variant-calling - Upstream variant discovery that produces input VCFs
More from GPTomics/bioSkills