Automating Variant Filtering with VCFTools: Step-by-Step Examples
Variant Call Format (VCF) files are central to genome analysis pipelines. VCFTools is a widely used suite for filtering, summarizing, and manipulating VCFs. This article shows practical, repeatable steps to automate variant filtering with VCFTools using command-line examples and short scripts so you can integrate them into pipelines.
Prerequisites
- VCFTools installed (vcftools CLI). Install via package manager or from source.
- Input VCF (compressed .vcf.gz recommended) and its index (.tbi) if using bgzip/Tabix.
- Basic shell (bash) familiarity.
Goals and assumptions
- Remove low-quality variants.
- Exclude variants with low call rate (high missingness).
- Filter by minor allele frequency (MAF).
- Separate SNPs and indels.
- Produce a final compressed VCF ready for downstream tools.
Assumptions: input file is sample.vcf.gz. Adjust filenames and thresholds as needed.
Step 1 — Prepare files
Ensure VCF is bgzipped and indexed (Tabix).
bash
bgzip -c sample.vcf > sample.vcf.gz tabix -p vcf sample.vcf.gz
Step 2 — Basic quality filtering
Filter by phred-scaled quality (QUAL) and genotype quality (GQ). VCFTools operates on site- and genotype-level filters; use –minQ for site QUAL, and –minGQ for genotype GQ.
bash
vcftools –gzvcf sample.vcf.gz –minQ 30 –minGQ 20 –recode –recode-INFO-all –out filtered_q bgzip -c filtered_q.recode.vcf > filteredq.recode.vcf.gz
Step 3 — Filter by missingness (call rate)
Remove sites with high missing data; e.g., keep sites with at least 90% call rate (–max-missing 0.9).
bash
vcftools –gzvcf filtered_q.recode.vcf.gz –max-missing 0.9 –recode –recode-INFO-all –out filtered_q_miss bgzip -c filtered_q_miss.recode.vcf > filtered_qmiss.recode.vcf.gz
Step 4 — Filter by minor allele frequency (MAF)
Exclude rare variants below desired frequency, e.g., MAF < 0.01:
bash
vcftools –gzvcf filtered_q_miss.recode.vcf.gz –maf 0.01 –recode –recode-INFO-all –out filtered_q_miss_maf bgzip -c filtered_q_miss_maf.recode.vcf > filtered_q_missmaf.recode.vcf.gz
Step 5 — Separate SNPs and indels
VCFTools can extract SNPs using –remove-indels or extract indels with –keep-only-indels.
bash
# SNPs only vcftools –gzvcf filtered_q_miss_maf.recode.vcf.gz –remove-indels –recode –recode-INFO-all –out final_snps bgzip -c final_snps.recode.vcf > final_snps.recode.vcf.gz # Indels only vcftools –gzvcf filtered_q_miss_maf.recode.vcf.gz –keep-only-indels –recode –recode-INFO-all –out final_indels bgzip -c final_indels.recode.vcf > finalindels.recode.vcf.gz
Step 6 — Filtering by depth (DP)
Filter sites by depth (e.g., min-meanDP 10, max-meanDP 200).
bash
vcftools –gzvcf final_snps.recode.vcf.gz –min-meanDP 10 –max-meanDP 200 –recode –recode-INFO-all –out final_snps_dp bgzip -c final_snps_dp.recode.vcf > final_snpsdp.recode.vcf.gz
Step 7 — Combine filters in one run (example)
You can combine many filters in a single vcftools invocation to speed up processing:
bash
vcftools –gzvcf sample.vcf.gz –minQ 30 –minGQ 20 –max-missing 0.9 –maf 0.01 –remove-indels –min-meanDP 10 –max-meanDP 200 –recode –recode-INFO-all –out combined_filtered_snps bgzip -c combined_filtered_snps.recode.vcf > combined_filteredsnps.recode.vcf.gz
Step 8 — Automate with a shell script
Create a reusable script that takes input VCF and parameters.
bash
#!/usr/bin/env bash set -euo pipefail IN_VCF=\({1</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">sample.vcf.gz}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">OUT_PREFIX</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){2:-filtered} MINQ=\({3</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">30}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MINGQ</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){4:-20} MAXMISS=\({5</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">0.9}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MAF</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){6:-0.01} MINDP=\({7</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">10}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MAXDP</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){8:-200} vcftools –gzvcf “\({IN_VCF}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --minQ </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MINQ}” –minGQ “\({MINGQ}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --max-missing </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MAXMISS}” –maf “\({MAF}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --remove-indels </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --min-meanDP </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MINDP}” –max-meanDP “\({MAXDP}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --recode --recode-INFO-all </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --out </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){OUT_PREFIX}_snps” bgzip -c “\({OUT_PREFIX}</span><span class="token" style="color: rgb(163, 21, 21);">_snps.recode.vcf"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);">></span><span> </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){OUT_PREFIX}_snps.recode.vcf.gz” tabix -p vcf “${OUT_PREFIX}snps.recode.vcf.gz”
Make executable and run:
bash
chmod +x filter_vcftools.sh ./filtervcftools.sh sample.vcf.gz myfiltered 30 20 0.9 0.01 10 200
Step 9 — Reporting and QC
VCFTools can produce summary statistics useful for QC:
bash
vcftools –gzvcf sample.vcf.gz –TsTv-summary –out ts_tv vcftools –gzvcf sample.vcf.gz –depth –out depth_stats vcftools –gzvcf sample.vcf.gz –missing-site –out missing_site
Inspect outputs (text files) or parse them into plots with R/Python.
Tips and best practices
- Keep intermediate files for reproducibility or use a workflow manager (Snakemake/Nextflow).
- Tune thresholds to cohort size and sequencing platform.
- For per-sample filters (e.g., sample missingness), use –missing-indv and remove samples with high missingness.
- Consider complementary tools (bcftools, GATK) for complex filters or annotation.
Example pipeline outline
- bgzip + tabix input
- Combined vcftools filtering (quality, missingness, MAF, depth)
- Separate SNPs/indels
- Index outputs
- Run QC summaries
- Archive filtered VCFs
This sequence gives a reproducible, automatable approach to variant filtering with VCFTools. Adjust flags and thresholds to match your study design and sequencing characteristics.
Leave a Reply