Automating Variant Filtering with VCFTools: Step-by-Step Examples

Automating Variant Filtering with VCFTools: Step-by-Step Examples

Variant Call Format (VCF) files are central to genome analysis pipelines. VCFTools is a widely used suite for filtering, summarizing, and manipulating VCFs. This article shows practical, repeatable steps to automate variant filtering with VCFTools using command-line examples and short scripts so you can integrate them into pipelines.

Prerequisites

  • VCFTools installed (vcftools CLI). Install via package manager or from source.
  • Input VCF (compressed .vcf.gz recommended) and its index (.tbi) if using bgzip/Tabix.
  • Basic shell (bash) familiarity.

Goals and assumptions

  • Remove low-quality variants.
  • Exclude variants with low call rate (high missingness).
  • Filter by minor allele frequency (MAF).
  • Separate SNPs and indels.
  • Produce a final compressed VCF ready for downstream tools.

Assumptions: input file is sample.vcf.gz. Adjust filenames and thresholds as needed.

Step 1 — Prepare files

Ensure VCF is bgzipped and indexed (Tabix).

bash

bgzip -c sample.vcf > sample.vcf.gz tabix -p vcf sample.vcf.gz

Step 2 — Basic quality filtering

Filter by phred-scaled quality (QUAL) and genotype quality (GQ). VCFTools operates on site- and genotype-level filters; use –minQ for site QUAL, and –minGQ for genotype GQ.

bash

vcftools –gzvcf sample.vcf.gz –minQ 30 –minGQ 20 –recode –recode-INFO-all –out filtered_q bgzip -c filtered_q.recode.vcf > filteredq.recode.vcf.gz

Step 3 — Filter by missingness (call rate)

Remove sites with high missing data; e.g., keep sites with at least 90% call rate (–max-missing 0.9).

bash

vcftools –gzvcf filtered_q.recode.vcf.gz –max-missing 0.9 –recode –recode-INFO-all –out filtered_q_miss bgzip -c filtered_q_miss.recode.vcf > filtered_qmiss.recode.vcf.gz

Step 4 — Filter by minor allele frequency (MAF)

Exclude rare variants below desired frequency, e.g., MAF < 0.01:

bash

vcftools –gzvcf filtered_q_miss.recode.vcf.gz –maf 0.01 –recode –recode-INFO-all –out filtered_q_miss_maf bgzip -c filtered_q_miss_maf.recode.vcf > filtered_q_missmaf.recode.vcf.gz

Step 5 — Separate SNPs and indels

VCFTools can extract SNPs using –remove-indels or extract indels with –keep-only-indels.

bash

# SNPs only vcftools –gzvcf filtered_q_miss_maf.recode.vcf.gz –remove-indels –recode –recode-INFO-all –out final_snps bgzip -c final_snps.recode.vcf > final_snps.recode.vcf.gz # Indels only vcftools –gzvcf filtered_q_miss_maf.recode.vcf.gz –keep-only-indels –recode –recode-INFO-all –out final_indels bgzip -c final_indels.recode.vcf > finalindels.recode.vcf.gz

Step 6 — Filtering by depth (DP)

Filter sites by depth (e.g., min-meanDP 10, max-meanDP 200).

bash

vcftools –gzvcf final_snps.recode.vcf.gz –min-meanDP 10 –max-meanDP 200 –recode –recode-INFO-all –out final_snps_dp bgzip -c final_snps_dp.recode.vcf > final_snpsdp.recode.vcf.gz

Step 7 — Combine filters in one run (example)

You can combine many filters in a single vcftools invocation to speed up processing:

bash

vcftools –gzvcf sample.vcf.gz –minQ 30 –minGQ 20 –max-missing 0.9 –maf 0.01 –remove-indels –min-meanDP 10 –max-meanDP 200 –recode –recode-INFO-all –out combined_filtered_snps bgzip -c combined_filtered_snps.recode.vcf > combined_filteredsnps.recode.vcf.gz

Step 8 — Automate with a shell script

Create a reusable script that takes input VCF and parameters.

bash

#!/usr/bin/env bash set -euo pipefail IN_VCF=\({1</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">sample.vcf.gz}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">OUT_PREFIX</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){2:-filtered} MINQ=\({3</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">30}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MINGQ</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){4:-20} MAXMISS=\({5</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">0.9}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MAF</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){6:-0.01} MINDP=\({7</span><span class="token" style="color: rgb(57, 58, 52);">:-</span><span class="token" style="color: rgb(54, 172, 170);">10}</span><span> </span><span></span><span class="token assign-left" style="color: rgb(54, 172, 170);">MAXDP</span><span class="token" style="color: rgb(57, 58, 52);">=</span><span class="token" style="color: rgb(54, 172, 170);">\){8:-200} vcftools –gzvcf \({IN_VCF}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --minQ </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MINQ} –minGQ \({MINGQ}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --max-missing </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MAXMISS} –maf \({MAF}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --remove-indels </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --min-meanDP </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){MINDP} –max-meanDP \({MAXDP}</span><span class="token" style="color: rgb(163, 21, 21);">"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --recode --recode-INFO-all </span><span class="token" style="color: rgb(57, 58, 52);"></span><span> </span><span> --out </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){OUT_PREFIX}_snps” bgzip -c \({OUT_PREFIX}</span><span class="token" style="color: rgb(163, 21, 21);">_snps.recode.vcf"</span><span> </span><span class="token" style="color: rgb(57, 58, 52);">></span><span> </span><span class="token" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(54, 172, 170);">\){OUT_PREFIX}_snps.recode.vcf.gz” tabix -p vcf ${OUT_PREFIX}snps.recode.vcf.gz”

Make executable and run:

bash

chmod +x filter_vcftools.sh ./filtervcftools.sh sample.vcf.gz myfiltered 30 20 0.9 0.01 10 200

Step 9 — Reporting and QC

VCFTools can produce summary statistics useful for QC:

bash

vcftools –gzvcf sample.vcf.gz –TsTv-summary –out ts_tv vcftools –gzvcf sample.vcf.gz –depth –out depth_stats vcftools –gzvcf sample.vcf.gz –missing-site –out missing_site

Inspect outputs (text files) or parse them into plots with R/Python.

Tips and best practices

  • Keep intermediate files for reproducibility or use a workflow manager (Snakemake/Nextflow).
  • Tune thresholds to cohort size and sequencing platform.
  • For per-sample filters (e.g., sample missingness), use –missing-indv and remove samples with high missingness.
  • Consider complementary tools (bcftools, GATK) for complex filters or annotation.

Example pipeline outline

  1. bgzip + tabix input
  2. Combined vcftools filtering (quality, missingness, MAF, depth)
  3. Separate SNPs/indels
  4. Index outputs
  5. Run QC summaries
  6. Archive filtered VCFs

This sequence gives a reproducible, automatable approach to variant filtering with VCFTools. Adjust flags and thresholds to match your study design and sequencing characteristics.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *