Paper Reading
Please read this paper before class: Van der Auwera & O’Connor (2020) Genomics in the Cloud (GATK Best Practices). O’Reilly Media
Also see the GATK Best Practices: GATK Best Practices for Germline SNPs & Indels
Overview: Resequencing and Variant Calling
Resequencing (also called whole-genome resequencing, WGS, or whole-exome sequencing, WES) aligns short reads from an individual’s genome back to a reference genome to discover genetic variants — differences between the sample and the reference.
Types of Variants
| Variant Type | Size | Description |
|---|---|---|
| SNP | 1 bp | Single nucleotide polymorphism |
| InDel | 1–50 bp | Insertion or deletion |
| CNV | kb–Mb | Copy number variation |
| SV | > 50 bp | Structural variant (inversion, translocation) |
Applications
- Population genetics / GWAS
- Disease gene discovery (human genetics)
- Crop breeding (QTL, marker-assisted selection)
- Cancer genomics (somatic mutation detection)
- Phylogenomics (SNP-based trees)
Resequencing vs. RNA-Seq
| Feature | Resequencing (WGS) | RNA-Seq |
|---|---|---|
| Input | Genomic DNA | RNA (cDNA) |
| Coverage | Uniform (genome) | Non-uniform (expressed genes) |
| Aligner | BWA-MEM2, Bowtie2 | HISAT2, STAR (splice-aware) |
| Output | Variants (VCF) | Expression counts |
The Variant Calling Workflow
0. Prerequisites
Disk Space: This tutorial requires at least 30 GB of free disk space. Each sample generates ~5 GB of intermediate files (BAM, markdup BAM, recal BAM, GVCF). Check your available space before starting:
df -h .If disk space is limited, follow the cleanup tips marked with the label [Cleanup] after each major step.
1. Environment Setup
Create Conda Environment
conda create -n reseq -c bioconda -c conda-forge python=3.11
conda activate reseq
conda install -c bioconda -c conda-forge fastqc fastp bwa-mem2 samtools
conda install -c bioconda -c conda-forge openjdk=17 picard gatk4 bcftools
conda install -c bioconda -c conda-forge snpeff plink tabix
pip install multiqc
# Clean conda package cache to free disk space
conda clean --all -y
Verify Installations
fastp --version
bwa-mem2 version
samtools --version
gatk --version
bcftools --version
Common Error —
libcrypto.so.1.0.0: cannot open shared object filesamtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directoryCause: OpenSSL version mismatch between samtools/bcftools and conda environment. Fix: Check which
libcryptofile exists and create a symlink:ls ${CONDA_PREFIX}/lib/libcrypto.so.* # Use the actual version you see (often libcrypto.so.3 or libcrypto.so.1.1) ln -s ${CONDA_PREFIX}/lib/libcrypto.so.3 ${CONDA_PREFIX}/lib/libcrypto.so.1.0.0
Common Error — Conda install hangs or fails with “Solving environment”
Cause: Default conda solver is slow and can get stuck resolving dependencies. Fix: Use mamba or the libmamba solver:
conda install -n base -c conda-forge mamba # Then use `mamba install` instead of `conda install`
Create Working Directory
mkdir -p ~/bch709/reseq
cd ~/bch709/reseq
2. Quality Control
Download Example Data
For this tutorial, we use publicly available Arabidopsis thaliana WGS datasets from the 1001 Genomes Project. We download two samples so that we can demonstrate joint genotyping later.
Note: The SRA Toolkit (
fasterq-dump,fastq-dump) is known to cause segmentation faults on WSL (Windows Subsystem for Linux). Downloading FASTQ files directly from ENA avoids this issue entirely.
Option A — Download from ENA (European Nucleotide Archive):
cd ~/bch709/reseq
# Sample 1: Col-0 re-sequencing (SRR519585)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_1.fastq.gz -O sample1_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_2.fastq.gz -O sample1_R2.fastq.gz
# Sample 2: TBO-01 re-sequencing (SRR519586)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_1.fastq.gz -O sample2_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_2.fastq.gz -O sample2_R2.fastq.gz
ls -lh
Option B — Download from Dropbox mirror (if ENA is slow or fails):
cd ~/bch709/reseq
# Sample 1
wget "https://www.dropbox.com/scl/fi/8fy6hrhczh7v23ojqox5r/wgs_R1.fastq.gz?rlkey=nmanzksfbtm76xer4jesq2vuy&dl=1" -O sample1_R1.fastq.gz
wget "https://www.dropbox.com/scl/fi/kt29bt8c29vf9i294xeek/wgs_R2.fastq.gz?rlkey=8xm8lk839jqcz2qfjlzx4gqle&dl=1" -O sample1_R2.fastq.gz
# Sample 2 — download from ENA (no Dropbox mirror)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_1.fastq.gz -O sample2_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519586/SRR519586_2.fastq.gz -O sample2_R2.fastq.gz
ls -lh
Dropbox direct download tip: Add
&dl=1to the end of a Dropbox share URL, and wrap the URL in double quotes to prevent the shell from interpreting&and?characters.
Common Error — Dropbox download gives tiny HTML file instead of FASTQ
Symptom: downloaded file is only a few KB and running
fileon it showsHTML document. Cause: Missed the&dl=1at the end of the URL, or forgot to quote the URL. Fix:# WRONG — shell interprets & wget https://www.dropbox.com/.../file.fastq.gz?rlkey=...&dl=1 # RIGHT — quote the URL wget "https://www.dropbox.com/.../file.fastq.gz?rlkey=...&dl=1" -O file.fastq.gz
Common Error —
fasterq-dump: segmentation faultfasterq-dump: Segmentation fault (core dumped)Cause: Known SRA Toolkit bug on WSL (Windows Subsystem for Linux). Fix: Skip the SRA Toolkit — download FASTQ directly from ENA using
wgetas shown above.
You can find ENA download links for any SRA accession at ENA Browser or NCBI SRA.
Run FastQC
# -t 4 : use 4 threads
fastqc -t 4 sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz
# Aggregate all QC reports into one
multiqc .
Key Metrics for WGS
| Metric | What to Check |
|---|---|
| Per-base quality | Should be Q30+ across most of the read |
| GC content | Should match species genome GC% |
| Duplication | High duplication may indicate low-complexity library |
| Adapter content | Remove if present |
3. Read Trimming
fastp performs adapter removal, quality trimming, and QC in a single pass — faster and simpler than running separate tools.
| Option | Description |
|---|---|
--in1 / --in2 |
Forward / reverse input reads |
--out1 / --out2 |
Forward / reverse output reads |
--detect_adapter_for_pe |
Auto-detect adapters for paired-end data |
--qualified_quality_phred 20 |
Minimum base quality threshold (Q20) |
--length_required 50 |
Discard reads shorter than 50 bp |
--thread 4 |
Number of threads |
--html / --json |
QC report outputs (JSON is used by MultiQC) |
mkdir -p trim
# Trim sample 1
fastp \
--in1 sample1_R1.fastq.gz \
--in2 sample1_R2.fastq.gz \
--out1 trim/sample1_R1_trimmed.fq.gz \
--out2 trim/sample1_R2_trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--thread 4 \
--html trim/sample1_fastp_report.html \
--json trim/sample1_fastp_report.json
# Trim sample 2
fastp \
--in1 sample2_R1.fastq.gz \
--in2 sample2_R2.fastq.gz \
--out1 trim/sample2_R1_trimmed.fq.gz \
--out2 trim/sample2_R2_trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--thread 4 \
--html trim/sample2_fastp_report.html \
--json trim/sample2_fastp_report.json
multiqc trim/ -n trim_report
[Cleanup] After trimming, you can remove the original FASTQ files to save disk space:
rm -f sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz sample2_R2.fastq.gz
4. Reference Genome Preparation
Download Reference
# Arabidopsis TAIR10 reference from Ensembl Plants
wget https://ftp.ensemblgenomes.org/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz --no-check-certificate
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
mv Arabidopsis_thaliana.TAIR10.dna.toplevel.fa reference.fasta
Note: The Ensembl TAIR10 reference uses chromosome names
1,2,3,4,5,Mt,Pt(notChr1,Chr2, etc.). Keep this in mind when specifying genomic regions in later steps.
Create BWA-MEM2 Index
# Generates .0123, .amb, .ann, .bwt.2bit.64, .pac
bwa-mem2 index reference.fasta
Create FASTA Index and Sequence Dictionary (required by GATK)
# .fai index for samtools/GATK
samtools faidx reference.fasta
# .dict sequence dictionary for GATK
picard CreateSequenceDictionary R=reference.fasta O=reference.dict
Common Error —
reference.dict already existsException in thread "main" picard.PicardException: /path/reference.dict already exists. Delete this file and try again, or specify a different output file.Cause: Running CreateSequenceDictionary a second time without removing the old
.dict. Fix:rm -f reference.dict picard CreateSequenceDictionary R=reference.fasta O=reference.dict
Common Error — BWA-MEM2 index files missing
ERROR! Can't open index file reference.fasta.0123Cause: Index not built, or index was interrupted (disk full, killed process). Fix: Delete any partial index files and re-run:
rm -f reference.fasta.0123 reference.fasta.amb reference.fasta.ann \ reference.fasta.bwt.2bit.64 reference.fasta.pac bwa-mem2 index reference.fasta
5. Alignment with BWA-MEM2
BWA-MEM2 uses the Burrows-Wheeler Transform (BWT) + FM-index for short-read alignment.
Why BWA-MEM2 for DNA (not HISAT2)?
| Feature | BWA-MEM2 | HISAT2/STAR |
|---|---|---|
| Splice-aware | No | Yes |
| DNA reads | Optimal | Not designed for |
| RNA reads | Suboptimal | Optimal |
Align Reads
The @RG (read group) tag is required by GATK.
| @RG Field | Description |
|---|---|
ID |
Read group ID |
SM |
Sample name (GATK uses this to identify samples) |
PL |
Sequencing platform (e.g., ILLUMINA) |
LB |
Library name |
PU |
Platform unit (e.g., flowcell-barcode.lane) |
# Align sample 1
bwa-mem2 mem \
-t 4 \
-R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
reference.fasta \
trim/sample1_R1_trimmed.fq.gz \
trim/sample1_R2_trimmed.fq.gz \
| samtools sort -@ 4 -o sample1.bam
samtools index sample1.bam
# Align sample 2
bwa-mem2 mem \
-t 4 \
-R "@RG\tID:sample2\tSM:sample2\tPL:ILLUMINA\tLB:lib2\tPU:unit2" \
reference.fasta \
trim/sample2_R1_trimmed.fq.gz \
trim/sample2_R2_trimmed.fq.gz \
| samtools sort -@ 4 -o sample2.bam
samtools index sample2.bam
Common Error —
BAM file ... does not have any read groupsA USER ERROR has occurred: SAM/BAM/CRAM file <file> does not have any read groups definedCause: Forgot the
-R "@RG\t..."flag inbwa-mem2 mem. GATK requires a read group. Fix: Re-run alignment with the-Rflag. Use literal\t(not actual Tab) — the shell converts it correctly when wrapped in double quotes.
Common Error —
samtools sortfails with “No space left on device”samtools sort: couldn't allocate memory for bam_mem [E::bgzf_flush] File write failed (wrong size)Cause: Temporary sort files fill the disk. BWA-MEM2 + sort can use 2-3x the final BAM size in temp space. Fix: Free up space first (
df -h ., delete old BAMs), or redirect temp files to a larger disk:bwa-mem2 mem ... | samtools sort -T /path/to/larger/disk/tmp -@ 4 -o sample1.bam
Common Error — Pipe fails silently, resulting BAM is truncated
Symptom:
samtools quickcheck sample1.bamreports EOF missing, but no error was shown during alignment. Cause:bwa-mem2crashed inside the pipe;samtools sortstill produces a (truncated) BAM. Fix: Addset -o pipefailbefore the command, or check exit codes:set -o pipefail bwa-mem2 mem ... | samtools sort -@ 4 -o sample1.bam echo "Pipeline exit: $?" # Should be 0 samtools quickcheck sample1.bam && echo OK
Alignment Statistics
samtools flagstat sample1.bam
samtools flagstat sample2.bam
samtools stats sample1.bam > sample1.stats
samtools stats sample2.bam > sample2.stats
multiqc . -n alignment_report
| Metric | What to Expect |
|---|---|
| Mapping rate | > 95% for matched reference |
| Properly paired | > 90% |
| Average depth | 10–30x for population WGS; 30–60x for clinical |
[Cleanup] After alignment, you can remove the trimmed FASTQ files:
rm -rf trim/
6. Mark Duplicates
PCR amplification creates duplicate reads. For variant calling, these must be marked (not just discarded — GATK handles them).
Run Picard MarkDuplicates
| Option | Description |
|---|---|
I= |
Input BAM |
O= |
Output BAM with duplicates flagged |
M= |
Duplication metrics file |
VALIDATION_STRINGENCY=SILENT |
Suppress warnings on BAM format |
# Mark duplicates for sample 1
picard MarkDuplicates \
I=sample1.bam \
O=sample1.markdup.bam \
M=sample1.markdup.metrics \
VALIDATION_STRINGENCY=SILENT
samtools index sample1.markdup.bam
# Mark duplicates for sample 2
picard MarkDuplicates \
I=sample2.bam \
O=sample2.markdup.bam \
M=sample2.markdup.metrics \
VALIDATION_STRINGENCY=SILENT
samtools index sample2.markdup.bam
Common Error —
Exception in thread "main" java.lang.OutOfMemoryError: Java heap spaceException in thread "main" java.lang.OutOfMemoryError: Java heap space at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:...)Cause: Picard’s default heap size (~2 GB) is too small for large BAMs. Fix: Increase Java heap memory:
picard -Xmx8g MarkDuplicates I=sample1.bam O=sample1.markdup.bam M=sample1.markdup.metrics
Common Error — MarkDuplicates fails with
SAM validation errorSAM validation error: ERROR: Record ..., MAPQ should be 0 for unmapped readCause: BWA-MEM2 occasionally emits non-standard fields that strict validators reject. Fix: We already use
VALIDATION_STRINGENCY=SILENTto suppress these warnings. Ensure that option is in your command.
Check Duplication Rate
cat sample1.markdup.metrics
cat sample2.markdup.metrics
multiqc . -n markdup_report
Note: Duplication rates > 30–40% may indicate problems with input DNA quality or library complexity. For very high duplication, use a PCR-free library prep.
[Cleanup] After marking duplicates, remove the original sorted BAM files:
rm -f sample1.bam sample1.bam.bai sample2.bam sample2.bam.bai
7. Base Quality Score Recalibration (BQSR)
GATK BQSR corrects systematic errors in base quality scores from the sequencer. It requires a known variants VCF (e.g., dbSNP).
Download Known Variants
BQSR requires a VCF of known polymorphic sites to distinguish true variants from sequencing errors.
Important: The 1001 Genomes multi-sample VCF is very large (~5 GB compressed, ~80 GB decompressed) and contains formatting inconsistencies that cause GATK errors. For BQSR, we only need the variant positions (sites-only), not the genotype data. The commands below extract a sites-only VCF that is much smaller and works correctly with GATK.
# For Arabidopsis: download from the 1001 Genomes Project
wget https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz
# Extract sites-only VCF (removes genotype columns — BQSR only needs positions)
# This also fixes malformed lines in the original VCF.
zcat 1001genomes_snp-short-indel_only_ACGTN.vcf.gz \
| awk 'BEGIN{OFS="\t"} /^##/{print; next} /^#CHROM/{print $1,$2,$3,$4,$5,$6,$7,$8; next} {print $1,$2,$3,$4,$5,$6,$7,$8}' \
| bgzip > known_sites.vcf.gz
tabix -p vcf known_sites.vcf.gz
# Remove the large original file
rm -f 1001genomes_snp-short-indel_only_ACGTN.vcf.gz
# For human (hg38): download dbSNP from GATK resource bundle
# wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
No known variants available? For less-characterized species, skip BQSR or use an iterative “bootstrap” approach: call variants with HaplotypeCaller, use those as known sites, then re-run BQSR. Two rounds are usually sufficient.
Common Error —
VCF file is malformed: there are N genotypes while the header requires M genotypesA USER ERROR has occurred: Error while trying to create index for known_variants.vcf. Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1202192: there are 460 genotypes while the header requires that 1135 genotypes be present for all records at 1:13410579Cause: The 1001 Genomes multi-sample VCF has inconsistent genotype columns across rows. Fix: Use the sites-only extraction shown above (strips genotype columns and reheaders).
Common Error —
A USER ERROR has occurred: Bad input: ... Contig 'X' is not definedA USER ERROR has occurred: Bad input: The contig 'Chr1' is not defined in the sequence dictionary for reference reference.fastaCause: Chromosome naming mismatch — the VCF has
Chr1but the reference has1. Fix: Rename chromosomes in the VCF to match the reference:# Check names in both head reference.fasta.fai zcat known_sites.vcf.gz | grep -v '^#' | cut -f1 | sort -u # Rename with bcftools if needed echo "Chr1 1" > chr_map.txt bcftools annotate --rename-chrs chr_map.txt known_sites.vcf.gz -Oz -o known_sites.renamed.vcf.gz
Step 1: Compute Recalibration Table
| Option | Description |
|---|---|
-I |
Input BAM (with duplicates marked) |
-R |
Reference genome |
--known-sites |
Known polymorphic sites VCF |
-O |
Output recalibration table |
# Sample 1
gatk BaseRecalibrator \
-I sample1.markdup.bam \
-R reference.fasta \
--known-sites known_sites.vcf.gz \
-O sample1.recal.table
# Sample 2
gatk BaseRecalibrator \
-I sample2.markdup.bam \
-R reference.fasta \
--known-sites known_sites.vcf.gz \
-O sample2.recal.table
Step 2: Apply Recalibration
# Sample 1
gatk ApplyBQSR \
-I sample1.markdup.bam \
-R reference.fasta \
--bqsr-recal-file sample1.recal.table \
-O sample1.recal.bam
samtools index sample1.recal.bam
# Sample 2
gatk ApplyBQSR \
-I sample2.markdup.bam \
-R reference.fasta \
--bqsr-recal-file sample2.recal.table \
-O sample2.recal.bam
samtools index sample2.recal.bam
Common Error —
ApplyBQSRcrashes withNo space left on devicehtsjdk.samtools.SAMException: Exception writing BAM index file Caused by: java.io.IOException: No space left on deviceCause: Disk filled up during BAM writing. BQSR doubles the size of the markdup BAM temporarily. Fix: Free up space, then re-run. Check with
df -h .— you need at least 2x the markdup BAM size free.# Remove failed output and retry rm -f sample1.recal.bam sample1.recal.bai gatk ApplyBQSR -I sample1.markdup.bam -R reference.fasta \ --bqsr-recal-file sample1.recal.table -O sample1.recal.bam
Common Error —
samtools quickcheckreports EOF missingsample1.recal.bam was missing EOF block when one should be present.Cause: BAM file is truncated — usually the writing process was killed or disk filled up. Fix: Always validate BAMs after writing; re-run if truncated:
samtools quickcheck sample1.recal.bam && echo OK || echo TRUNCATED
[Cleanup] After BQSR, remove the markdup BAM files:
rm -f sample1.markdup.bam sample1.markdup.bam.bai sample2.markdup.bam sample2.markdup.bam.bai
8. Variant Calling with GATK HaplotypeCaller
GATK HaplotypeCaller performs local assembly of haplotypes to call SNPs and indels simultaneously.
Call Variants (per-sample GVCF mode)
Using GVCF mode allows joint genotyping across multiple samples — recommended for multi-sample projects.
| Option | Description |
|---|---|
-R |
Reference genome |
-I |
Input recalibrated BAM |
-O |
Output GVCF (genomic VCF) |
-ERC GVCF |
Emit reference confidence — produces GVCF instead of plain VCF |
--native-pair-hmm-threads |
Threads for PairHMM calculation |
# Sample 1
gatk HaplotypeCaller \
-R reference.fasta \
-I sample1.recal.bam \
-O sample1.g.vcf.gz \
-ERC GVCF \
--native-pair-hmm-threads 4
# Sample 2
gatk HaplotypeCaller \
-R reference.fasta \
-I sample2.recal.bam \
-O sample2.g.vcf.gz \
-ERC GVCF \
--native-pair-hmm-threads 4
Common Error —
An index is required but was not foundA USER ERROR has occurred: An index is required but was not found for file sample1.g.vcf.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.Cause: HaplotypeCaller was interrupted before writing the
.tbiindex file. Fix:gatk IndexFeatureFile -I sample1.g.vcf.gz
Common Error —
Premature end of fileon GVCFhtsjdk.samtools.FileTruncatedException: Premature end of file: sample1.g.vcf.gzCause: HaplotypeCaller was killed or crashed before finishing; the GVCF is truncated. Fix: Delete the truncated file and re-run HaplotypeCaller:
rm -f sample1.g.vcf.gz sample1.g.vcf.gz.tbi gatk HaplotypeCaller -R reference.fasta -I sample1.recal.bam \ -O sample1.g.vcf.gz -ERC GVCF --native-pair-hmm-threads 4Tip: HaplotypeCaller is slow (30-60 min for full genome). To test the pipeline on a small region first, use
-L 1:1-5000000.
Joint Genotyping
# Combine GVCFs from both samples
gatk CombineGVCFs \
-R reference.fasta \
-V sample1.g.vcf.gz \
-V sample2.g.vcf.gz \
-O cohort.g.vcf.gz
# Joint genotyping across all samples
gatk GenotypeGVCFs \
-R reference.fasta \
-V cohort.g.vcf.gz \
-O cohort.vcf.gz
Single-sample alternative: If you only have one sample, skip
CombineGVCFsand runGenotypeGVCFsdirectly on the single GVCF:gatk GenotypeGVCFs -R reference.fasta -V sample1.g.vcf.gz -O cohort.vcf.gzNote: population-level analyses (PCA, LD pruning in Section 12) require 2+ samples.
Common Error —
CombineGVCFsfails withInput files ... have incompatible contigsA USER ERROR has occurred: Input files sample1.g.vcf.gz and reference.fasta have incompatible contigs: No overlapping contigs found.Cause: The GVCF and reference were built against different assemblies. Fix: Ensure all GVCFs used the same reference used in the current step. Rebuild GVCFs if necessary.
[Cleanup] After joint genotyping, you can remove individual GVCFs and recal BAMs:
rm -f sample1.g.vcf.gz sample1.g.vcf.gz.tbi sample2.g.vcf.gz sample2.g.vcf.gz.tbi rm -f cohort.g.vcf.gz cohort.g.vcf.gz.tbi rm -f sample1.recal.bam sample1.recal.bai sample2.recal.bam sample2.recal.bai
9. Variant Filtering
Separate SNPs and Indels
# Extract SNPs only
gatk SelectVariants \
-R reference.fasta \
-V cohort.vcf.gz \
--select-type-to-include SNP \
-O cohort.snps.vcf.gz
# Extract Indels only
gatk SelectVariants \
-R reference.fasta \
-V cohort.vcf.gz \
--select-type-to-include INDEL \
-O cohort.indels.vcf.gz
Hard Filter SNPs (GATK Best Practices)
gatk VariantFiltration \
-R reference.fasta \
-V cohort.snps.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O cohort.snps.filtered.vcf.gz
Hard Filter Indels
gatk VariantFiltration \
-R reference.fasta \
-V cohort.indels.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O cohort.indels.filtered.vcf.gz
Key GATK Filters Explained
| Filter | Field | Threshold (SNP) | Threshold (Indel) | Meaning |
|---|---|---|---|---|
| QD | Quality by Depth | < 2.0 | < 2.0 | Normalized variant quality |
| FS | Fisher Strand Bias | > 60.0 | > 200.0 | Strand bias test |
| MQ | RMS Mapping Quality | < 40.0 | — | Average mapping quality |
| MQRankSum | MQ Rank Sum | < -12.5 | — | Comparison of MQ for ref/alt |
| ReadPosRankSum | Read Position RS | < -8.0 | < -20.0 | Position bias within reads |
10. VCF Format
VCF File Structure
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
1 12345 . A T 220 PASS DP=45;AF=0.5;... GT:AD:DP 0/1:22,23:45
Key VCF Fields
| Field | Description |
|---|---|
| CHROM | Chromosome |
| POS | Position (1-based) |
| REF | Reference allele |
| ALT | Alternate allele(s) |
| QUAL | Variant quality score |
| FILTER | PASS or filter names |
| INFO | Variant annotations |
| FORMAT | Sample field format |
| GT | Genotype (0/0=hom-ref, 0/1=het, 1/1=hom-alt) |
Basic VCF Manipulation with BCFtools
| Option | Description |
|---|---|
-f PASS |
Keep only variants that passed all filters |
-O z |
Output as compressed VCF (.vcf.gz) |
-o |
Output file name |
# Count variants
bcftools stats cohort.snps.filtered.vcf.gz | grep "^SN"
# Keep only PASS variants
bcftools view -f PASS -O z -o cohort.snps.pass.vcf.gz cohort.snps.filtered.vcf.gz
# Index the filtered VCF (required for region queries)
tabix -p vcf cohort.snps.pass.vcf.gz
# Extract specific genomic region
# Note: Ensembl TAIR10 uses "1", "2", ... (not "Chr1", "Chr2")
bcftools view cohort.snps.pass.vcf.gz 1:100000-200000
# Get variant summary statistics
bcftools stats cohort.snps.pass.vcf.gz > stats.txt
multiqc . -n vcf_report
Common Error —
Could not retrieve index file[E::idx_find_and_load] Could not retrieve index file for 'cohort.snps.pass.vcf.gz' Failed to read from cohort.snps.pass.vcf.gz: could not load indexCause: Region queries (e.g.,
1:100000-200000) require a.tbitabix index. Fix:tabix -p vcf cohort.snps.pass.vcf.gz
Common Error — Region query returns nothing
$ bcftools view cohort.snps.pass.vcf.gz Chr1:100000-200000 # Only header shown, no variantsCause: Chromosome name mismatch — Ensembl TAIR10 uses
1, notChr1. Fix: Check names in the VCF and use the exact name:zcat cohort.snps.pass.vcf.gz | grep -v '^#' | cut -f1 | sort -u bcftools view cohort.snps.pass.vcf.gz 1:100000-200000
11. Variant Annotation with SnpEff
SnpEff predicts the functional effect of each variant (missense, nonsense, synonymous, etc.).
Setup SnpEff Database
# List available databases
snpEff databases | grep -i arabidopsis
# Download database
# For SnpEff 5.x, the database name is "Arabidopsis_thaliana"
# Use the output of the command above to confirm the exact name.
snpEff download Arabidopsis_thaliana
Annotate Variants
snpEff Arabidopsis_thaliana cohort.snps.pass.vcf.gz > cohort.snps.annotated.vcf
# snpEff also generates snpEff_summary.html and snpEff_genes.txt
Common Error —
ERROR: Genome 'athalianaTair10' not foundjava.lang.RuntimeException: ERROR: Genome 'athalianaTair10' not found. Please check your data directory or config file.Cause: Database name differs between SnpEff versions. Older docs use
athalianaTair10, but SnpEff 5.x usesArabidopsis_thaliana. Fix: Check the exact name with:snpEff databases | grep -i "arabidopsis"
Common Error — stderr mixed into annotated VCF
Symptom:
cohort.snps.annotated.vcfstarts with timestamps and log messages instead of##fileformat=VCFv4.2. Cause: Using-v(verbose) flag prints log messages to stderr, and if you redirect both streams (&>) they get mixed into the output. Fix: Use plain>to redirect only stdout, or redirect stderr separately:snpEff Arabidopsis_thaliana cohort.snps.pass.vcf.gz > cohort.snps.annotated.vcf 2> snpeff.log
Variant Effect Categories
| Effect | Description |
|---|---|
| HIGH | Frameshift, stop gained/lost, splice site |
| MODERATE | Missense, in-frame InDel |
| LOW | Synonymous, splice region |
| MODIFIER | Intergenic, intronic, upstream/downstream |
12. Population-Level Analysis
For multiple samples, use population genomics tools:
PCA (Principal Component Analysis)
| Option | Description |
|---|---|
--vcf |
Input VCF file |
--make-bed |
Output PLINK binary format (.bed/.bim/.fam) |
--pca 10 |
Compute top 10 principal components |
--allow-extra-chr |
Allow non-human chromosome names (required for plants: Arabidopsis uses 1–5, Mt, Pt) |
# Convert VCF to PLINK binary format
plink --vcf cohort.snps.pass.vcf.gz --make-bed --out cohort --allow-extra-chr
# Run PCA (top 10 components)
plink --bfile cohort --pca 10 --out cohort.pca --allow-extra-chr
Common Error —
Error: Invalid chromosome codeError: Invalid chromosome code 'Mt' on line 1 of .bim file. (This is disallowed for human analyses; use --allow-extra-chr to permit it.)Cause: Arabidopsis uses chromosome names (
Mt,Pt) that don’t exist in PLINK’s default human chromosome list. Fix: Add--allow-extra-chrto every PLINK command.
Common Error —
At least 2 people required for pairwise analysisError: At least 2 people required for pairwise analysis.Cause: PCA and LD analyses need multiple samples (founders). Single-sample data cannot be used for population analyses. Fix: Process at least 2 samples (e.g., sample1 + sample2 in this tutorial) before running PCA.
Linkage Disequilibrium Pruning
Before running GWAS or population structure analyses, prune variants in LD to avoid biasing results.
--indep-pairwise Parameter |
Description |
|---|---|
50 |
Window size (number of SNPs) |
10 |
Step size (SNPs to shift window) |
0.2 |
r² threshold (remove one of a pair if r² > 0.2) |
# Identify LD-pruned variant set
plink --bfile cohort --indep-pairwise 50 10 0.2 --out cohort.ld --allow-extra-chr
# Extract pruned variants
plink --bfile cohort --extract cohort.ld.prune.in --make-bed --out cohort.pruned --allow-extra-chr
Common Error —
Skipping --indep-pairwise since there are less than two foundersWarning: Skipping --indep-pairwise since there are less than two founders. (--make-founders may come in handy here.)Cause: Same as PCA — requires at least 2 samples. With a single sample, LD pruning is meaningless. Fix: Add a second sample, or (for testing) declare the single sample as founder:
plink --bfile cohort --make-founders ...— but the pruning output will be uninformative.
GWAS
For GWAS workflows, see the GWAS tutorial.
13. Full Workflow Summary
| Step | Tool | Input | Output |
|---|---|---|---|
| QC | FastQC + MultiQC | FASTQ | HTML report |
| Trim | fastp | FASTQ | Trimmed FASTQ |
| Align | BWA-MEM2 | FASTQ + Reference | BAM |
| Sort/Index | SAMtools | BAM | Sorted BAM |
| Mark duplicates | Picard | BAM | Markdup BAM |
| BQSR | GATK | BAM + known VCF | Recal BAM |
| Call variants | GATK HaplotypeCaller | BAM | GVCF |
| Joint genotype | GATK GenotypeGVCFs | GVCFs | VCF |
| Filter | GATK VariantFiltration | VCF | Filtered VCF |
| Annotate | SnpEff | VCF | Annotated VCF |
14. Cleanup
conda deactivate
# Optional: remove environment when done
conda env remove --name reseq
Troubleshooting — Quick Reference
Each section above has detailed error boxes. This table is a quick index; see the relevant section for the full fix.
| Problem | Section | Quick Fix |
|---|---|---|
No space left on device |
All | df -h .; delete intermediate files following [Cleanup] tips |
libcrypto.so.1.0.0: cannot open |
1. Setup | ln -s ${CONDA_PREFIX}/lib/libcrypto.so.3 ${CONDA_PREFIX}/lib/libcrypto.so.1.0.0 |
| Dropbox download gives HTML file | 2. QC | Quote URL and add &dl=1 |
fasterq-dump: Segmentation fault |
2. QC | Use ENA download via wget |
reference.dict already exists |
4. Reference | rm reference.dict and re-run |
BWA-MEM2 Can't open index file |
4–5. Index/Align | Delete partial index and re-run bwa-mem2 index |
BAM does not have read groups |
5. Alignment | Add -R "@RG\tID:...\tSM:..." to bwa-mem2 mem |
samtools quickcheck reports truncation |
5. Alignment, 7. BQSR | File write was interrupted — delete and re-run |
OutOfMemoryError: Java heap space |
6. MarkDup, 8. HC | Increase heap: picard -Xmx8g / gatk --java-options "-Xmx8g" |
VCF malformed: N genotypes while header requires M |
7. BQSR | Extract sites-only VCF (see Section 7) |
Contig 'X' is not defined |
7. BQSR, 10. bcftools | Match chromosome names: head reference.fasta.fai vs VCF chromosomes |
GVCF Premature end of file |
8. HC | Delete truncated GVCF; re-run HaplotypeCaller |
An index is required but was not found |
8. HC | gatk IndexFeatureFile -I sample1.g.vcf.gz |
Could not retrieve index file (bcftools) |
10. bcftools | tabix -p vcf file.vcf.gz |
| Region query returns no results | 10. bcftools | Check chromosome name: Ensembl TAIR10 uses 1, not Chr1 |
SnpEff Genome 'X' not found |
11. SnpEff | snpEff databases \| grep -i arabidopsis to find correct name |
PLINK Invalid chromosome code 'Mt' |
12. PLINK | Add --allow-extra-chr to every PLINK command |
PLINK At least 2 people required |
12. PLINK | Need 2+ samples for PCA / LD analyses |
References
| Resource | Link |
|---|---|
| GATK Best Practices | broadinstitute.github.io/gatk |
| BWA-MEM2 paper | Md et al. 2019, iScience |
| fastp paper | Chen et al. 2018, Bioinformatics |
| SAM flag decoder | Picard SAM Flags |
| VCF specification | samtools.github.io/hts-specs |
| SnpEff documentation | pcingola.github.io/SnpEff |
| BCFtools manual | samtools.github.io/bcftools |