Resequencing and Variant Calling Tutorial – BCH709 Introduction to Bioinformatics

Paper Reading

Please read this paper before class: Van der Auwera & O’Connor (2020) Genomics in the Cloud (GATK Best Practices). O’Reilly Media

Also see the GATK Best Practices: GATK Best Practices for Germline SNPs & Indels

Overview: Resequencing and Variant Calling

Resequencing (also called whole-genome resequencing, WGS, or whole-exome sequencing, WES) aligns short reads from an individual’s genome back to a reference genome to discover genetic variants — differences between the sample and the reference.

Types of Variants

Variant Type	Size	Description
SNP	1 bp	Single nucleotide polymorphism
InDel	1–50 bp	Insertion or deletion
CNV	kb–Mb	Copy number variation
SV	> 50 bp	Structural variant (inversion, translocation)

Applications

Population genetics / GWAS
Disease gene discovery (human genetics)
Crop breeding (QTL, marker-assisted selection)
Cancer genomics (somatic mutation detection)
Phylogenomics (SNP-based trees)

Resequencing vs. RNA-Seq

Feature	Resequencing (WGS)	RNA-Seq
Input	Genomic DNA	RNA (cDNA)
Coverage	Uniform (genome)	Non-uniform (expressed genes)
Aligner	BWA-MEM2, Bowtie2	HISAT2, STAR (splice-aware)
Output	Variants (VCF)	Expression counts

The Variant Calling Workflow

Resequencing and Variant Calling Workflow

0. Prerequisites

Disk Space: This tutorial requires at least 30 GB of free disk space. Each sample generates ~5 GB of intermediate files (BAM, markdup BAM, recal BAM, GVCF). Check your available space before starting:
df -h .
If disk space is limited, follow the cleanup tips marked with the label [Cleanup] after each major step.

1. Environment Setup

Create Conda Environment

conda create -n reseq -c bioconda -c conda-forge python=3.11
conda activate reseq

conda install -c bioconda -c conda-forge fastqc fastp bwa-mem2 samtools
conda install -c bioconda -c conda-forge openjdk=17 picard gatk4 bcftools
conda install -c bioconda -c conda-forge snpeff plink tabix
pip install multiqc

# Clean conda package cache to free disk space
conda clean --all -y

Verify Installations

fastp --version
bwa-mem2 version
samtools --version
gatk --version
bcftools --version

Common Error — libcrypto.so.1.0.0: cannot open shared object file
samtools: error while loading shared libraries: libcrypto.so.1.0.0:
cannot open shared object file: No such file or directory
Cause: OpenSSL version mismatch between samtools/bcftools and conda environment. Fix: Check which libcrypto file exists and create a symlink:
ls ${CONDA_PREFIX}/lib/libcrypto.so.*
# Use the actual version you see (often libcrypto.so.3 or libcrypto.so.1.1)
ln -s ${CONDA_PREFIX}/lib/libcrypto.so.3 ${CONDA_PREFIX}/lib/libcrypto.so.1.0.0

Common Error — Conda install hangs or fails with “Solving environment”

Cause: Default conda solver is slow and can get stuck resolving dependencies. Fix: Use mamba or the libmamba solver:
conda install -n base -c conda-forge mamba
# Then use `mamba install` instead of `conda install`

Create Working Directory

mkdir -p ~/bch709/reseq
cd ~/bch709/reseq

2. Quality Control

Download Example Data

For this tutorial, we use publicly available Arabidopsis thaliana WGS datasets from the 1001 Genomes Project. We download two samples so that we can demonstrate joint genotyping later.

Note: The SRA Toolkit (fasterq-dump, fastq-dump) is known to cause segmentation faults on WSL (Windows Subsystem for Linux). Downloading FASTQ files directly from ENA avoids this issue entirely.

Option A — Download from ENA (European Nucleotide Archive):

cd ~/bch709/reseq

# Sample 1: Col-0 re-sequencing (SRR519585)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_1.fastq.gz -O sample1_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_2.fastq.gz -O sample1_R2.fastq.gz

# Sample 2: TBO-01 re-sequencing (SRR519586)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_1.fastq.gz -O sample2_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_2.fastq.gz -O sample2_R2.fastq.gz

ls -lh

Option B — Download from Dropbox mirror (if ENA is slow or fails):

cd ~/bch709/reseq

# Sample 1
wget "https://www.dropbox.com/scl/fi/8fy6hrhczh7v23ojqox5r/wgs_R1.fastq.gz?rlkey=nmanzksfbtm76xer4jesq2vuy&dl=1" -O sample1_R1.fastq.gz
wget "https://www.dropbox.com/scl/fi/kt29bt8c29vf9i294xeek/wgs_R2.fastq.gz?rlkey=8xm8lk839jqcz2qfjlzx4gqle&dl=1" -O sample1_R2.fastq.gz

# Sample 2 — download from ENA (no Dropbox mirror)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_1.fastq.gz -O sample2_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_2.fastq.gz -O sample2_R2.fastq.gz

ls -lh

Dropbox direct download tip: Add &dl=1 to the end of a Dropbox share URL, and wrap the URL in double quotes to prevent the shell from interpreting & and ? characters.

Common Error — Dropbox download gives tiny HTML file instead of FASTQ

Symptom: downloaded file is only a few KB and running file on it shows HTML document. Cause: Missed the &dl=1 at the end of the URL, or forgot to quote the URL. Fix:
# WRONG — shell interprets &
wget https://www.dropbox.com/.../file.fastq.gz?rlkey=...&dl=1
# RIGHT — quote the URL
wget "https://www.dropbox.com/.../file.fastq.gz?rlkey=...&dl=1" -O file.fastq.gz

Common Error — fasterq-dump: segmentation fault
fasterq-dump: Segmentation fault (core dumped)
Cause: Known SRA Toolkit bug on WSL (Windows Subsystem for Linux). Fix: Skip the SRA Toolkit — download FASTQ directly from ENA using wget as shown above.

You can find ENA download links for any SRA accession at ENA Browser or NCBI SRA.

Run FastQC

# -t 4 : use 4 threads
fastqc -t 4 sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz

# Aggregate all QC reports into one
multiqc .

Key Metrics for WGS

Metric	What to Check
Per-base quality	Should be Q30+ across most of the read
GC content	Should match species genome GC%
Duplication	High duplication may indicate low-complexity library
Adapter content	Remove if present

3. Read Trimming

fastp performs adapter removal, quality trimming, and QC in a single pass — faster and simpler than running separate tools.

Option	Description
`--in1` / `--in2`	Forward / reverse input reads
`--out1` / `--out2`	Forward / reverse output reads
`--detect_adapter_for_pe`	Auto-detect adapters for paired-end data
`--qualified_quality_phred 20`	Minimum base quality threshold (Q20)
`--length_required 50`	Discard reads shorter than 50 bp
`--thread 4`	Number of threads
`--html` / `--json`	QC report outputs (JSON is used by MultiQC)

mkdir -p trim

# Trim sample 1
fastp \
  --in1 sample1_R1.fastq.gz \
  --in2 sample1_R2.fastq.gz \
  --out1 trim/sample1_R1_trimmed.fq.gz \
  --out2 trim/sample1_R2_trimmed.fq.gz \
  --detect_adapter_for_pe \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --thread 4 \
  --html trim/sample1_fastp_report.html \
  --json trim/sample1_fastp_report.json

# Trim sample 2
fastp \
  --in1 sample2_R1.fastq.gz \
  --in2 sample2_R2.fastq.gz \
  --out1 trim/sample2_R1_trimmed.fq.gz \
  --out2 trim/sample2_R2_trimmed.fq.gz \
  --detect_adapter_for_pe \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --thread 4 \
  --html trim/sample2_fastp_report.html \
  --json trim/sample2_fastp_report.json

multiqc trim/ -n trim_report

[Cleanup] After trimming, you can remove the original FASTQ files to save disk space:
rm -f sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz

4. Reference Genome Preparation

Download Reference

# Arabidopsis TAIR10 reference from Ensembl Plants
wget https://ftp.ensemblgenomes.org/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz --no-check-certificate
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
mv Arabidopsis_thaliana.TAIR10.dna.toplevel.fa reference.fasta

Note: The Ensembl TAIR10 reference uses chromosome names 1, 2, 3, 4, 5, Mt, Pt (not Chr1, Chr2, etc.). Keep this in mind when specifying genomic regions in later steps.

Create BWA-MEM2 Index

# Generates .0123, .amb, .ann, .bwt.2bit.64, .pac
bwa-mem2 index reference.fasta

Create FASTA Index and Sequence Dictionary (required by GATK)

# .fai index for samtools/GATK
samtools faidx reference.fasta

# .dict sequence dictionary for GATK
picard CreateSequenceDictionary R=reference.fasta O=reference.dict

Common Error — reference.dict already exists
Exception in thread "main" picard.PicardException: /path/reference.dict already exists.
Delete this file and try again, or specify a different output file.
Cause: Running CreateSequenceDictionary a second time without removing the old .dict. Fix:
rm -f reference.dict
picard CreateSequenceDictionary R=reference.fasta O=reference.dict

Common Error — BWA-MEM2 index files missing
ERROR! Can't open index file reference.fasta.0123
Cause: Index not built, or index was interrupted (disk full, killed process). Fix: Delete any partial index files and re-run:
rm -f reference.fasta.0123 reference.fasta.amb reference.fasta.ann \
      reference.fasta.bwt.2bit.64 reference.fasta.pac
bwa-mem2 index reference.fasta

5. Alignment with BWA-MEM2

BWA-MEM2 uses the Burrows-Wheeler Transform (BWT) + FM-index for short-read alignment.

Why BWA-MEM2 for DNA (not HISAT2)?

Feature	BWA-MEM2	HISAT2/STAR
Splice-aware	No	Yes
DNA reads	Optimal	Not designed for
RNA reads	Suboptimal	Optimal

Align Reads

The @RG (read group) tag is required by GATK.

@RG Field	Description
`ID`	Read group ID
`SM`	Sample name (GATK uses this to identify samples)
`PL`	Sequencing platform (e.g., ILLUMINA)
`LB`	Library name
`PU`	Platform unit (e.g., flowcell-barcode.lane)

# Align sample 1
bwa-mem2 mem \
  -t 4 \
  -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
  reference.fasta \
  trim/sample1_R1_trimmed.fq.gz \
  trim/sample1_R2_trimmed.fq.gz \
  | samtools sort -@ 4 -o sample1.bam

samtools index sample1.bam

# Align sample 2
bwa-mem2 mem \
  -t 4 \
  -R "@RG\tID:sample2\tSM:sample2\tPL:ILLUMINA\tLB:lib2\tPU:unit2" \
  reference.fasta \
  trim/sample2_R1_trimmed.fq.gz \
  trim/sample2_R2_trimmed.fq.gz \
  | samtools sort -@ 4 -o sample2.bam

samtools index sample2.bam

Common Error — BAM file ... does not have any read groups
A USER ERROR has occurred: SAM/BAM/CRAM file <file> does not have any read groups defined
Cause: Forgot the -R "@RG\t..." flag in bwa-mem2 mem. GATK requires a read group. Fix: Re-run alignment with the -R flag. Use literal \t (not actual Tab) — the shell converts it correctly when wrapped in double quotes.

Common Error — samtools sort fails with “No space left on device”
samtools sort: couldn't allocate memory for bam_mem
[E::bgzf_flush] File write failed (wrong size)
Cause: Temporary sort files fill the disk. BWA-MEM2 + sort can use 2-3x the final BAM size in temp space. Fix: Free up space first (df -h ., delete old BAMs), or redirect temp files to a larger disk:
bwa-mem2 mem ... | samtools sort -T /path/to/larger/disk/tmp -@ 4 -o sample1.bam

Common Error — Pipe fails silently, resulting BAM is truncated

Symptom: samtools quickcheck sample1.bam reports EOF missing, but no error was shown during alignment. Cause: bwa-mem2 crashed inside the pipe; samtools sort still produces a (truncated) BAM. Fix: Add set -o pipefail before the command, or check exit codes:
set -o pipefail
bwa-mem2 mem ... | samtools sort -@ 4 -o sample1.bam
echo "Pipeline exit: $?"  # Should be 0
samtools quickcheck sample1.bam && echo OK

Alignment Statistics

samtools flagstat sample1.bam
samtools flagstat sample2.bam

samtools stats sample1.bam > sample1.stats
samtools stats sample2.bam > sample2.stats
multiqc . -n alignment_report

Metric	What to Expect
Mapping rate	> 95% for matched reference
Properly paired	> 90%
Average depth	10–30x for population WGS; 30–60x for clinical

[Cleanup] After alignment, you can remove the trimmed FASTQ files:
rm -rf trim/

6. Mark Duplicates

PCR amplification creates duplicate reads. For variant calling, these must be marked (not just discarded — GATK handles them).

Run Picard MarkDuplicates

Option	Description
`I=`	Input BAM
`O=`	Output BAM with duplicates flagged
`M=`	Duplication metrics file
`VALIDATION_STRINGENCY=SILENT`	Suppress warnings on BAM format

# Mark duplicates for sample 1
picard MarkDuplicates \
  I=sample1.bam \
  O=sample1.markdup.bam \
  M=sample1.markdup.metrics \
  VALIDATION_STRINGENCY=SILENT

samtools index sample1.markdup.bam

# Mark duplicates for sample 2
picard MarkDuplicates \
  I=sample2.bam \
  O=sample2.markdup.bam \
  M=sample2.markdup.metrics \
  VALIDATION_STRINGENCY=SILENT

samtools index sample2.markdup.bam

Common Error — Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:...)
Cause: Picard’s default heap size (~2 GB) is too small for large BAMs. Fix: Increase Java heap memory:
picard -Xmx8g MarkDuplicates I=sample1.bam O=sample1.markdup.bam M=sample1.markdup.metrics

Common Error — MarkDuplicates fails with SAM validation error
SAM validation error: ERROR: Record ..., MAPQ should be 0 for unmapped read
Cause: BWA-MEM2 occasionally emits non-standard fields that strict validators reject. Fix: We already use VALIDATION_STRINGENCY=SILENT to suppress these warnings. Ensure that option is in your command.

Check Duplication Rate

cat sample1.markdup.metrics
cat sample2.markdup.metrics
multiqc . -n markdup_report

Note: Duplication rates > 30–40% may indicate problems with input DNA quality or library complexity. For very high duplication, use a PCR-free library prep.

[Cleanup] After marking duplicates, remove the original sorted BAM files:
rm -f sample1.bam sample1.bam.bai sample2.bam sample2.bam.bai

7. Base Quality Score Recalibration (BQSR)

GATK BQSR corrects systematic errors in base quality scores from the sequencer. It requires a known variants VCF (e.g., dbSNP).

Download Known Variants

BQSR requires a VCF of known polymorphic sites to distinguish true variants from sequencing errors.

Important: The 1001 Genomes multi-sample VCF is very large (~5 GB compressed, ~80 GB decompressed) and contains formatting inconsistencies that cause GATK errors. For BQSR, we only need the variant positions (sites-only), not the genotype data. The commands below extract a sites-only VCF that is much smaller and works correctly with GATK.

# For Arabidopsis: download from the 1001 Genomes Project
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz

# Extract sites-only VCF (removes genotype columns — BQSR only needs positions)
# This also fixes malformed lines in the original VCF.
zcat 1001genomes_snp-short-indel_only_ACGTN.vcf.gz \
  | awk 'BEGIN{OFS="\t"} /^##/{print; next} /^#CHROM/{print $1,$2,$3,$4,$5,$6,$7,$8; next} {print $1,$2,$3,$4,$5,$6,$7,$8}' \
  | bgzip > known_sites.vcf.gz

tabix -p vcf known_sites.vcf.gz

# Remove the large original file
rm -f 1001genomes_snp-short-indel_only_ACGTN.vcf.gz

# For human (hg38): download dbSNP from GATK resource bundle
# wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf

No known variants available? For less-characterized species, skip BQSR or use an iterative “bootstrap” approach: call variants with HaplotypeCaller, use those as known sites, then re-run BQSR. Two rounds are usually sufficient.

Common Error — VCF file is malformed: there are N genotypes while the header requires M genotypes
A USER ERROR has occurred: Error while trying to create index for known_variants.vcf.
Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at
approximately line number 1202192: there are 460 genotypes while the header requires
that 1135 genotypes be present for all records at 1:13410579
Cause: The 1001 Genomes multi-sample VCF has inconsistent genotype columns across rows. Fix: Use the sites-only extraction shown above (strips genotype columns and reheaders).

Common Error — A USER ERROR has occurred: Bad input: ... Contig 'X' is not defined
A USER ERROR has occurred: Bad input: The contig 'Chr1' is not defined in the
sequence dictionary for reference reference.fasta
Cause: Chromosome naming mismatch — the VCF has Chr1 but the reference has 1. Fix: Rename chromosomes in the VCF to match the reference:
# Check names in both
head reference.fasta.fai
zcat known_sites.vcf.gz | grep -v '^#' | cut -f1 | sort -u
# Rename with bcftools if needed
echo "Chr1 1" > chr_map.txt
bcftools annotate --rename-chrs chr_map.txt known_sites.vcf.gz -Oz -o known_sites.renamed.vcf.gz

Step 1: Compute Recalibration Table

Option	Description
`-I`	Input BAM (with duplicates marked)
`-R`	Reference genome
`--known-sites`	Known polymorphic sites VCF
`-O`	Output recalibration table

# Sample 1
gatk BaseRecalibrator \
  -I sample1.markdup.bam \
  -R reference.fasta \
  --known-sites known_sites.vcf.gz \
  -O sample1.recal.table

# Sample 2
gatk BaseRecalibrator \
  -I sample2.markdup.bam \
  -R reference.fasta \
  --known-sites known_sites.vcf.gz \
  -O sample2.recal.table

Step 2: Apply Recalibration

# Sample 1
gatk ApplyBQSR \
  -I sample1.markdup.bam \
  -R reference.fasta \
  --bqsr-recal-file sample1.recal.table \
  -O sample1.recal.bam

samtools index sample1.recal.bam

# Sample 2
gatk ApplyBQSR \
  -I sample2.markdup.bam \
  -R reference.fasta \
  --bqsr-recal-file sample2.recal.table \
  -O sample2.recal.bam

samtools index sample2.recal.bam

Common Error — ApplyBQSR crashes with No space left on device
htsjdk.samtools.SAMException: Exception writing BAM index file
    Caused by: java.io.IOException: No space left on device
Cause: Disk filled up during BAM writing. BQSR doubles the size of the markdup BAM temporarily. Fix: Free up space, then re-run. Check with df -h . — you need at least 2x the markdup BAM size free.
# Remove failed output and retry
rm -f sample1.recal.bam sample1.recal.bai
gatk ApplyBQSR -I sample1.markdup.bam -R reference.fasta \
    --bqsr-recal-file sample1.recal.table -O sample1.recal.bam

Common Error — samtools quickcheck reports EOF missing
sample1.recal.bam was missing EOF block when one should be present.
Cause: BAM file is truncated — usually the writing process was killed or disk filled up. Fix: Always validate BAMs after writing; re-run if truncated:
samtools quickcheck sample1.recal.bam && echo OK || echo TRUNCATED

[Cleanup] After BQSR, remove the markdup BAM files:
rm -f sample1.markdup.bam sample1.markdup.bam.bai sample2.markdup.bam sample2.markdup.bam.bai

8. Variant Calling with GATK HaplotypeCaller

GATK HaplotypeCaller performs local assembly of haplotypes to call SNPs and indels simultaneously.

Call Variants (per-sample GVCF mode)

Using GVCF mode allows joint genotyping across multiple samples — recommended for multi-sample projects.

Option	Description
`-R`	Reference genome
`-I`	Input recalibrated BAM
`-O`	Output GVCF (genomic VCF)
`-ERC GVCF`	Emit reference confidence — produces GVCF instead of plain VCF
`--native-pair-hmm-threads`	Threads for PairHMM calculation

# Sample 1
gatk HaplotypeCaller \
  -R reference.fasta \
  -I sample1.recal.bam \
  -O sample1.g.vcf.gz \
  -ERC GVCF \
  --native-pair-hmm-threads 4

# Sample 2
gatk HaplotypeCaller \
  -R reference.fasta \
  -I sample2.recal.bam \
  -O sample2.g.vcf.gz \
  -ERC GVCF \
  --native-pair-hmm-threads 4

Common Error — An index is required but was not found
A USER ERROR has occurred: An index is required but was not found for file
sample1.g.vcf.gz. Support for unindexed block-compressed files has been
temporarily disabled. Try running IndexFeatureFile on the input.
Cause: HaplotypeCaller was interrupted before writing the .tbi index file. Fix:
gatk IndexFeatureFile -I sample1.g.vcf.gz

Common Error — Premature end of file on GVCF
htsjdk.samtools.FileTruncatedException: Premature end of file: sample1.g.vcf.gz
Cause: HaplotypeCaller was killed or crashed before finishing; the GVCF is truncated. Fix: Delete the truncated file and re-run HaplotypeCaller:
rm -f sample1.g.vcf.gz sample1.g.vcf.gz.tbi
gatk HaplotypeCaller -R reference.fasta -I sample1.recal.bam \
    -O sample1.g.vcf.gz -ERC GVCF --native-pair-hmm-threads 4
Tip: HaplotypeCaller is slow (30-60 min for full genome). To test the pipeline on a small region first, use -L 1:1-5000000.

Joint Genotyping

# Combine GVCFs from both samples
gatk CombineGVCFs \
  -R reference.fasta \
  -V sample1.g.vcf.gz \
  -V sample2.g.vcf.gz \
  -O cohort.g.vcf.gz

# Joint genotyping across all samples
gatk GenotypeGVCFs \
  -R reference.fasta \
  -V cohort.g.vcf.gz \
  -O cohort.vcf.gz

Single-sample alternative: If you only have one sample, skip CombineGVCFs and run GenotypeGVCFs directly on the single GVCF:
gatk GenotypeGVCFs -R reference.fasta -V sample1.g.vcf.gz -O cohort.vcf.gz
Note: population-level analyses (PCA, LD pruning in Section 12) require 2+ samples.

Common Error — CombineGVCFs fails with Input files ... have incompatible contigs
A USER ERROR has occurred: Input files sample1.g.vcf.gz and reference.fasta have
incompatible contigs: No overlapping contigs found.
Cause: The GVCF and reference were built against different assemblies. Fix: Ensure all GVCFs used the same reference used in the current step. Rebuild GVCFs if necessary.

[Cleanup] After joint genotyping, you can remove individual GVCFs and recal BAMs:

rm -f sample1.g.vcf.gz sample1.g.vcf.gz.tbi sample2.g.vcf.gz sample2.g.vcf.gz.tbi
rm -f cohort.g.vcf.gz cohort.g.vcf.gz.tbi
rm -f sample1.recal.bam sample1.recal.bai sample2.recal.bam sample2.recal.bai

9. Variant Filtering

Separate SNPs and Indels

# Extract SNPs only
gatk SelectVariants \
  -R reference.fasta \
  -V cohort.vcf.gz \
  --select-type-to-include SNP \
  -O cohort.snps.vcf.gz

# Extract Indels only
gatk SelectVariants \
  -R reference.fasta \
  -V cohort.vcf.gz \
  --select-type-to-include INDEL \
  -O cohort.indels.vcf.gz

Hard Filter SNPs (GATK Best Practices)

gatk VariantFiltration \
  -R reference.fasta \
  -V cohort.snps.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 60.0" --filter-name "FS60" \
  --filter-expression "MQ < 40.0" --filter-name "MQ40" \
  --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
  --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
  -O cohort.snps.filtered.vcf.gz

Hard Filter Indels

gatk VariantFiltration \
  -R reference.fasta \
  -V cohort.indels.vcf.gz \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 200.0" --filter-name "FS200" \
  --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
  -O cohort.indels.filtered.vcf.gz

Key GATK Filters Explained

Filter	Field	Threshold (SNP)	Threshold (Indel)	Meaning
QD	Quality by Depth	< 2.0	< 2.0	Normalized variant quality
FS	Fisher Strand Bias	> 60.0	> 200.0	Strand bias test
MQ	RMS Mapping Quality	< 40.0	—	Average mapping quality
MQRankSum	MQ Rank Sum	< -12.5	—	Comparison of MQ for ref/alt
ReadPosRankSum	Read Position RS	< -8.0	< -20.0	Position bias within reads

10. VCF Format

VCF File Structure

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth">
#CHROM  POS     ID      REF  ALT  QUAL  FILTER  INFO              FORMAT     SAMPLE1
1       12345   .       A    T    220   PASS    DP=45;AF=0.5;...  GT:AD:DP   0/1:22,23:45

Key VCF Fields

Field	Description
CHROM	Chromosome
POS	Position (1-based)
REF	Reference allele
ALT	Alternate allele(s)
QUAL	Variant quality score
FILTER	PASS or filter names
INFO	Variant annotations
FORMAT	Sample field format
GT	Genotype (0/0=hom-ref, 0/1=het, 1/1=hom-alt)

Basic VCF Manipulation with BCFtools

Option	Description
`-f PASS`	Keep only variants that passed all filters
`-O z`	Output as compressed VCF (`.vcf.gz`)
`-o`	Output file name

# Count variants
bcftools stats cohort.snps.filtered.vcf.gz | grep "^SN"

# Keep only PASS variants
bcftools view -f PASS -O z -o cohort.snps.pass.vcf.gz cohort.snps.filtered.vcf.gz

# Index the filtered VCF (required for region queries)
tabix -p vcf cohort.snps.pass.vcf.gz

# Extract specific genomic region
# Note: Ensembl TAIR10 uses "1", "2", ... (not "Chr1", "Chr2")
bcftools view cohort.snps.pass.vcf.gz 1:100000-200000

# Get variant summary statistics
bcftools stats cohort.snps.pass.vcf.gz > stats.txt
multiqc . -n vcf_report

Common Error — Could not retrieve index file
[E::idx_find_and_load] Could not retrieve index file for 'cohort.snps.pass.vcf.gz'
Failed to read from cohort.snps.pass.vcf.gz: could not load index
Cause: Region queries (e.g., 1:100000-200000) require a .tbi tabix index. Fix:
tabix -p vcf cohort.snps.pass.vcf.gz

Common Error — Region query returns nothing
$ bcftools view cohort.snps.pass.vcf.gz Chr1:100000-200000
# Only header shown, no variants
Cause: Chromosome name mismatch — Ensembl TAIR10 uses 1, not Chr1. Fix: Check names in the VCF and use the exact name:
zcat cohort.snps.pass.vcf.gz | grep -v '^#' | cut -f1 | sort -u
bcftools view cohort.snps.pass.vcf.gz 1:100000-200000

11. Variant Annotation with SnpEff

SnpEff predicts the functional effect of each variant (missense, nonsense, synonymous, etc.).

Setup SnpEff Database

# List available databases
snpEff databases | grep -i arabidopsis

# Download database
# For SnpEff 5.x, the database name is "Arabidopsis_thaliana"
# Use the output of the command above to confirm the exact name.
snpEff download Arabidopsis_thaliana

Annotate Variants

snpEff Arabidopsis_thaliana cohort.snps.pass.vcf.gz > cohort.snps.annotated.vcf

# snpEff also generates snpEff_summary.html and snpEff_genes.txt

Common Error — ERROR: Genome 'athalianaTair10' not found
java.lang.RuntimeException: ERROR: Genome 'athalianaTair10' not found.
    Please check your data directory or config file.
Cause: Database name differs between SnpEff versions. Older docs use athalianaTair10, but SnpEff 5.x uses Arabidopsis_thaliana. Fix: Check the exact name with:
snpEff databases | grep -i "arabidopsis"

Common Error — stderr mixed into annotated VCF

Symptom: cohort.snps.annotated.vcf starts with timestamps and log messages instead of ##fileformat=VCFv4.2. Cause: Using -v (verbose) flag prints log messages to stderr, and if you redirect both streams (&>) they get mixed into the output. Fix: Use plain > to redirect only stdout, or redirect stderr separately:
snpEff Arabidopsis_thaliana cohort.snps.pass.vcf.gz > cohort.snps.annotated.vcf 2> snpeff.log

Variant Effect Categories

Effect	Description
HIGH	Frameshift, stop gained/lost, splice site
MODERATE	Missense, in-frame InDel
LOW	Synonymous, splice region
MODIFIER	Intergenic, intronic, upstream/downstream

12. Population-Level Analysis

For multiple samples, use population genomics tools:

PCA (Principal Component Analysis)

Option	Description
`--vcf`	Input VCF file
`--make-bed`	Output PLINK binary format (`.bed`/`.bim`/`.fam`)
`--pca 10`	Compute top 10 principal components
`--allow-extra-chr`	Allow non-human chromosome names (required for plants: Arabidopsis uses `1`–`5`, `Mt`, `Pt`)

# Convert VCF to PLINK binary format
plink --vcf cohort.snps.pass.vcf.gz --make-bed --out cohort --allow-extra-chr

# Run PCA (top 10 components)
plink --bfile cohort --pca 10 --out cohort.pca --allow-extra-chr

Common Error — Error: Invalid chromosome code
Error: Invalid chromosome code 'Mt' on line 1 of .bim file.
(This is disallowed for human analyses; use --allow-extra-chr to permit it.)
Cause: Arabidopsis uses chromosome names (Mt, Pt) that don’t exist in PLINK’s default human chromosome list. Fix: Add --allow-extra-chr to every PLINK command.

Common Error — At least 2 people required for pairwise analysis
Error: At least 2 people required for pairwise analysis.
Cause: PCA and LD analyses need multiple samples (founders). Single-sample data cannot be used for population analyses. Fix: Process at least 2 samples (e.g., sample1 + sample2 in this tutorial) before running PCA.

Linkage Disequilibrium Pruning

Before running GWAS or population structure analyses, prune variants in LD to avoid biasing results.

`--indep-pairwise` Parameter	Description
`50`	Window size (number of SNPs)
`10`	Step size (SNPs to shift window)
`0.2`	r² threshold (remove one of a pair if r² > 0.2)

# Identify LD-pruned variant set
plink --bfile cohort --indep-pairwise 50 10 0.2 --out cohort.ld --allow-extra-chr

# Extract pruned variants
plink --bfile cohort --extract cohort.ld.prune.in --make-bed --out cohort.pruned --allow-extra-chr

Common Error — Skipping --indep-pairwise since there are less than two founders
Warning: Skipping --indep-pairwise since there are less than two founders.
(--make-founders may come in handy here.)
Cause: Same as PCA — requires at least 2 samples. With a single sample, LD pruning is meaningless. Fix: Add a second sample, or (for testing) declare the single sample as founder: plink --bfile cohort --make-founders ... — but the pruning output will be uninformative.

GWAS

For GWAS workflows, see the GWAS tutorial.

13. Full Workflow Summary

Step	Tool	Input	Output
QC	FastQC + MultiQC	FASTQ	HTML report
Trim	fastp	FASTQ	Trimmed FASTQ
Align	BWA-MEM2	FASTQ + Reference	BAM
Sort/Index	SAMtools	BAM	Sorted BAM
Mark duplicates	Picard	BAM	Markdup BAM
BQSR	GATK	BAM + known VCF	Recal BAM
Call variants	GATK HaplotypeCaller	BAM	GVCF
Joint genotype	GATK GenotypeGVCFs	GVCFs	VCF
Filter	GATK VariantFiltration	VCF	Filtered VCF
Annotate	SnpEff	VCF	Annotated VCF

14. Cleanup

conda deactivate
# Optional: remove environment when done
conda env remove --name reseq

Troubleshooting — Quick Reference

Each section above has detailed error boxes. This table is a quick index; see the relevant section for the full fix.

Problem	Section	Quick Fix
`No space left on device`	All	`df -h .`; delete intermediate files following `[Cleanup]` tips
`libcrypto.so.1.0.0: cannot open`	1. Setup	`ln -s ${CONDA_PREFIX}/lib/libcrypto.so.3 ${CONDA_PREFIX}/lib/libcrypto.so.1.0.0`
Dropbox download gives HTML file	2. QC	Quote URL and add `&dl=1`
`fasterq-dump: Segmentation fault`	2. QC	Use ENA download via `wget`
`reference.dict already exists`	4. Reference	`rm reference.dict` and re-run
BWA-MEM2 `Can't open index file`	4–5. Index/Align	Delete partial index and re-run `bwa-mem2 index`
`BAM does not have read groups`	5. Alignment	Add `-R "@RG\tID:...\tSM:..."` to `bwa-mem2 mem`
`samtools quickcheck` reports truncation	5. Alignment, 7. BQSR	File write was interrupted — delete and re-run
`OutOfMemoryError: Java heap space`	6. MarkDup, 8. HC	Increase heap: `picard -Xmx8g` / `gatk --java-options "-Xmx8g"`
VCF `malformed: N genotypes while header requires M`	7. BQSR	Extract sites-only VCF (see Section 7)
`Contig 'X' is not defined`	7. BQSR, 10. bcftools	Match chromosome names: `head reference.fasta.fai` vs VCF chromosomes
GVCF `Premature end of file`	8. HC	Delete truncated GVCF; re-run HaplotypeCaller
`An index is required but was not found`	8. HC	`gatk IndexFeatureFile -I sample1.g.vcf.gz`
`Could not retrieve index file` (bcftools)	10. bcftools	`tabix -p vcf file.vcf.gz`
Region query returns no results	10. bcftools	Check chromosome name: Ensembl TAIR10 uses `1`, not `Chr1`
SnpEff `Genome 'X' not found`	11. SnpEff	`snpEff databases \\| grep -i arabidopsis` to find correct name
PLINK `Invalid chromosome code 'Mt'`	12. PLINK	Add `--allow-extra-chr` to every PLINK command
PLINK `At least 2 people required`	12. PLINK	Need 2+ samples for PCA / LD analyses

References

Resource	Link
GATK Best Practices	broadinstitute.github.io/gatk
BWA-MEM2 paper	Md et al. 2019, iScience
fastp paper	Chen et al. 2018, Bioinformatics
SAM flag decoder	Picard SAM Flags
VCF specification	samtools.github.io/hts-specs
SnpEff documentation	pcingola.github.io/SnpEff
BCFtools manual	samtools.github.io/bcftools