πŸ€– BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: Resequencing and Variant Calling Tutorial

Paper Reading

Please read this paper before class: Van der Auwera & O’Connor (2020) Genomics in the Cloud (GATK Best Practices). O’Reilly Media

Also see the GATK Best Practices: GATK Best Practices for Germline SNPs & Indels


Overview: Resequencing and Variant Calling

Resequencing (also called whole-genome resequencing, WGS, or whole-exome sequencing, WES) aligns short reads from an individual’s genome back to a reference genome to discover genetic variants β€” differences between the sample and the reference.

Types of Variants

Variant Type Size Description
SNP 1 bp Single nucleotide polymorphism
InDel 1–50 bp Insertion or deletion
CNV kb–Mb Copy number variation
SV > 50 bp Structural variant (inversion, translocation)

Applications

Resequencing vs. RNA-Seq

Feature Resequencing (WGS) RNA-Seq
Input Genomic DNA RNA (cDNA)
Coverage Uniform (genome) Non-uniform (expressed genes)
Aligner BWA-MEM, Bowtie2 HISAT2, STAR (splice-aware)
Output Variants (VCF) Expression counts

The Variant Calling Workflow

Raw FASTQ (DNA reads)
    β”‚
    β”œβ”€β”€ FastQC β†’ MultiQC (QC)
    β”œβ”€β”€ Trim Galore / fastp (trimming)
    β”‚
    β”œβ”€β”€ BWA-MEM2 (alignment to reference genome)
    β”‚       β”‚
    β”‚       └── SAMtools sort + index
    β”‚
    β”œβ”€β”€ Picard MarkDuplicates (remove PCR duplicates)
    β”‚
    β”œβ”€β”€ GATK BaseRecalibrator + ApplyBQSR (BQSR)
    β”‚
    β”œβ”€β”€ GATK HaplotypeCaller (variant calling β†’ GVCF)
    β”‚
    β”œβ”€β”€ GATK GenotypeGVCFs (joint genotyping)
    β”‚
    β”œβ”€β”€ GATK VariantFiltration (hard filtering)
    β”‚
    └── SnpEff / VEP (variant annotation)

1. Environment Setup

Create Conda Environment

conda create -n reseq python=3.11
conda activate reseq

Install Tools

conda install -c bioconda -c conda-forge \
  fastqc trim-galore bwa-mem2 samtools picard gatk4 \
  snpeff multiqc bcftools plink

Verify Installations

bwa-mem2 version
samtools --version
gatk --version

Create Working Directory

mkdir -p ~/bch709/reseq
cd ~/bch709/reseq

2. Quality Control

Download Example Data

For this tutorial, we use a publicly available Arabidopsis thaliana WGS dataset from the 1001 Genomes Project. Download using SRA tools:

conda install -c bioconda sra-tools
cd ~/bch709/reseq

# Arabidopsis accession Col-0 re-sequencing (SRR519585)
fasterq-dump --split-files SRR519585 -O .
mv SRR519585_1.fastq wgs_R1.fastq
mv SRR519585_2.fastq wgs_R2.fastq
gzip wgs_R1.fastq wgs_R2.fastq
ls -lh

Replace SRR519585 with any SRA accession from your experiment. Browse datasets at NCBI SRA.

Run FastQC

fastqc -t 4 wgs_R1.fastq.gz wgs_R2.fastq.gz
multiqc .

Key Metrics for WGS

Metric What to Check
Per-base quality Should be Q30+ across most of the read
GC content Should match species genome GC%
Duplication High duplication may indicate low-complexity library
Adapter content Remove if present

3. Read Trimming

trim_galore \
  --paired \
  --cores 4 \
  --fastqc \
  --gzip \
  -o trim \
  wgs_R1.fastq.gz wgs_R2.fastq.gz

multiqc --dirs ~/bch709/reseq --filename trim

4. Reference Genome Preparation

Download Reference

# Arabidopsis TAIR10 reference (example)
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
gunzip TAIR10_chr_all.fas.gz
mv TAIR10_chr_all.fas reference.fasta

Create BWA-MEM2 Index

BWA-MEM2 is the faster successor to BWA-MEM:

bwa-mem2 index reference.fasta

Create FASTA Index and Sequence Dictionary (required by GATK)

samtools faidx reference.fasta
picard CreateSequenceDictionary R=reference.fasta O=reference.dict

5. Alignment with BWA-MEM2

BWA-MEM2 uses the Burrows-Wheeler Transform (BWT) + FM-index for short-read alignment.

Why BWA-MEM2 for DNA (not HISAT2)?

Feature BWA-MEM2 HISAT2/STAR
Splice-aware No Yes
DNA reads Optimal Not designed for
RNA reads Suboptimal Optimal

Align Reads

The @RG (read group) tag is required by GATK:

bwa-mem2 mem \
  -t 8 \
  -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
  reference.fasta \
  trim/wgs_R1_val_1.fq.gz \
  trim/wgs_R2_val_2.fq.gz \
  | samtools sort -@ 4 -o sample1.bam

samtools index sample1.bam
samtools flagstat sample1.bam

Alignment Statistics

samtools stats sample1.bam > sample1.stats
multiqc .
Metric What to Expect
Mapping rate > 95% for matched reference
Properly paired > 90%
Average depth 10–30x for population WGS; 30–60x for clinical

6. Mark Duplicates

PCR amplification creates duplicate reads. For variant calling, these must be marked (not just discarded β€” GATK handles them).

Run Picard MarkDuplicates

picard MarkDuplicates \
  I=sample1.bam \
  O=sample1.markdup.bam \
  M=sample1.markdup.metrics \
  VALIDATION_STRINGENCY=SILENT

samtools index sample1.markdup.bam

Check Duplication Rate

cat sample1.markdup.metrics
multiqc .

Note: Duplication rates > 30–40% may indicate problems with input DNA quality or library complexity. For very high duplication, use a PCR-free library prep.


7. Base Quality Score Recalibration (BQSR)

GATK BQSR corrects systematic errors in base quality scores from the sequencer. It requires a known variants VCF (e.g., dbSNP).

Download Known Variants

BQSR requires a VCF of known polymorphic sites to distinguish true variants from sequencing errors.

# For Arabidopsis: download from the 1001 Genomes Project
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz
mv 1001genomes_snp-short-indel_only_ACGTN.vcf.gz known_variants.vcf.gz
bgzip -d known_variants.vcf.gz
gatk IndexFeatureFile -I known_variants.vcf

# For human (hg38): download dbSNP from GATK resource bundle
# wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf

No known variants available? For less-characterized species, skip BQSR or use an iterative β€œbootstrap” approach: call variants with HaplotypeCaller, use those as known sites, then re-run BQSR.

Step 1: Compute Recalibration Table

gatk BaseRecalibrator \
  -I sample1.markdup.bam \
  -R reference.fasta \
  --known-sites known_variants.vcf \
  -O sample1.recal.table \
  --verbosity ERROR

Step 2: Apply Recalibration

gatk ApplyBQSR \
  -I sample1.markdup.bam \
  -R reference.fasta \
  --bqsr-recal-file sample1.recal.table \
  -O sample1.recal.bam

samtools index sample1.recal.bam

Note for model organisms without known variants: The BQSR β€œbootstrap” approach: (1) Call raw variants with HaplotypeCaller β†’ hard-filter β†’ use as known sites β†’ re-run BQSR β†’ call final variants. Two rounds are usually sufficient.


8. Variant Calling with GATK HaplotypeCaller

GATK HaplotypeCaller performs local assembly of haplotypes to call SNPs and indels simultaneously.

Call Variants (per-sample GVCF mode)

Using GVCF mode allows joint genotyping across multiple samples β€” recommended for multi-sample projects:

gatk HaplotypeCaller \
  -R reference.fasta \
  -I sample1.recal.bam \
  -O sample1.g.vcf.gz \
  -ERC GVCF \
  --native-pair-hmm-threads 4

Joint Genotyping (for multiple samples)

# Combine GVCFs (if multiple samples)
gatk CombineGVCFs \
  -R reference.fasta \
  -V sample1.g.vcf.gz \
  -V sample2.g.vcf.gz \
  -O cohort.g.vcf.gz

# Genotype
gatk GenotypeGVCFs \
  -R reference.fasta \
  -V cohort.g.vcf.gz \
  -O cohort.vcf.gz

9. Variant Filtering

Separate SNPs and Indels

# Extract SNPs
gatk SelectVariants \
  -R reference.fasta \
  -V cohort.vcf.gz \
  --select-type-to-include SNP \
  -O cohort.snps.vcf.gz

# Extract Indels
gatk SelectVariants \
  -R reference.fasta \
  -V cohort.vcf.gz \
  --select-type-to-include INDEL \
  -O cohort.indels.vcf.gz

Hard Filter SNPs (GATK Best Practices)

gatk VariantFiltration \
  -R reference.fasta \
  -V cohort.snps.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 60.0" --filter-name "FS60" \
  --filter-expression "MQ < 40.0" --filter-name "MQ40" \
  --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
  --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
  -O cohort.snps.filtered.vcf.gz

Hard Filter Indels

gatk VariantFiltration \
  -R reference.fasta \
  -V cohort.indels.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 200.0" --filter-name "FS200" \
  --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
  -O cohort.indels.filtered.vcf.gz

Key GATK Filters Explained

Filter Field Threshold (SNP) Meaning
QD Quality by Depth < 2.0 Normalized variant quality
FS Fisher Strand Bias > 60.0 Strand bias test
MQ RMS Mapping Quality < 40.0 Average mapping quality
MQRankSum MQ Rank Sum < -12.5 Comparison of MQ for ref/alt
ReadPosRankSum Read Position RS < -8.0 Position bias within reads

10. VCF Format

VCF File Structure

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth">
#CHROM  POS     ID      REF  ALT  QUAL  FILTER  INFO              FORMAT     SAMPLE1
Chr1    12345   .       A    T    220   PASS    DP=45;AF=0.5;...  GT:AD:DP   0/1:22,23:45

Key VCF Fields

Field Description
CHROM Chromosome
POS Position (1-based)
REF Reference allele
ALT Alternate allele(s)
QUAL Variant quality score
FILTER PASS or filter names
INFO Variant annotations
FORMAT Sample field format
GT Genotype (0/0=hom-ref, 0/1=het, 1/1=hom-alt)

Basic VCF Manipulation with BCFtools

# Count variants
bcftools stats cohort.snps.filtered.vcf.gz | grep "^SN"

# Filter PASS variants only
bcftools view -f PASS cohort.snps.filtered.vcf.gz -O z -o cohort.snps.pass.vcf.gz

# Extract specific region
bcftools view cohort.snps.pass.vcf.gz Chr1:100000-200000

# Get variant summary
bcftools stats cohort.snps.pass.vcf.gz > stats.txt
multiqc .

11. Variant Annotation with SnpEff

SnpEff predicts the functional effect of each variant (missense, nonsense, synonymous, etc.).

Setup SnpEff Database

# List available databases
snpEff databases | grep -i arabidopsis

# Download database (example: Arabidopsis TAIR10)
snpEff download athalianaTair10

Annotate Variants

snpEff \
  -v athalianaTair10 \
  cohort.snps.pass.vcf.gz \
  > cohort.snps.annotated.vcf

# View annotation summary
cat snpEff_summary.html

Variant Effect Categories

Effect Description
HIGH Frameshift, stop gained/lost, splice site
MODERATE Missense, in-frame InDel
LOW Synonymous, splice region
MODIFIER Intergenic, intronic, upstream/downstream

12. Population-Level Analysis

For multiple samples, use population genomics tools:

PCA (Principal Component Analysis)

# Convert VCF to PLINK format
plink --vcf cohort.snps.pass.vcf.gz \
  --make-bed \
  --out cohort \
  --allow-extra-chr

# Run PCA
plink --bfile cohort \
  --pca 10 \
  --out cohort.pca \
  --allow-extra-chr

Linkage Disequilibrium Pruning

Before running GWAS or population structure analyses, prune variants in LD to avoid biasing results:

plink --bfile cohort \
  --indep-pairwise 50 10 0.2 \
  --out cohort.ld \
  --allow-extra-chr

plink --bfile cohort \
  --extract cohort.ld.prune.in \
  --make-bed \
  --out cohort.pruned \
  --allow-extra-chr

GWAS

For GWAS workflows, see the GWAS tutorial.


13. Full Workflow Summary

Step Tool Input Output
QC FastQC + MultiQC FASTQ HTML report
Trim Trim Galore FASTQ Trimmed FASTQ
Align BWA-MEM2 FASTQ + Reference BAM
Sort/Index SAMtools BAM Sorted BAM
Mark duplicates Picard BAM Markdup BAM
BQSR GATK BAM + known VCF Recal BAM
Call variants GATK HaplotypeCaller BAM GVCF
Joint genotype GATK GenotypeGVCFs GVCFs VCF
Filter GATK VariantFiltration VCF Filtered VCF
Annotate SnpEff VCF Annotated VCF

14. Cleanup

conda deactivate
# Optional: remove environment when done
conda env remove --name reseq

References

Resource Link
GATK Best Practices broadinstitute.github.io/gatk
BWA-MEM2 paper Md et al. 2019, iScience
SAM flag decoder Picard SAM Flags
VCF specification samtools.github.io/hts-specs
SnpEff documentation pcingola.github.io/SnpEff
BCFtools manual samtools.github.io/bcftools