πΊοΈ HPC series β you are here
1. HPC Cluster basics β SSH, file transfer, Micromamba, Slurm, dependencies 2. π΅ Resequencing pipeline (this page) β BWA-MEM2 + GATK, variant calling 3. ChIP-Seq pipeline on HPC β minimap2 + MACS3, peak calling 4. RNA-Seq pipeline on HPC β STAR + featureCounts, DE analysis
β Before you start β pre-class checklist
Skim this once before running any code below:
- You can
ssh <netid>@pronghorn.rc.unr.eduand see the Pronghorn promptsacctmgr show user $USER withassocshows accountcpu-s5-bch709-6/ partitioncpu-core-0~/scratchsymlink exists and points under/data/gpfs/assoc/bch709-6/<netid>micromamba --versionworks on the login node (shell hook in~/.bashrc)- You ran the laptop version of Resequencing Tutorial at least once so the biology makes sense
Any box unchecked? β go back to the HPC Cluster lesson first.
Overview β Why Parallelize on HPC?
On your laptop, the basic resequencing tutorial runs serially, one step at a time, for each sample:
sample1 β align β markdup β BQSR β HaplotypeCaller β ...
sample2 β align β markdup β BQSR β HaplotypeCaller β ...
For 2 samples this takes 4-6 hours. For 100 samples on one laptop, it would take weeks. HPC lets you run all samples simultaneously, so 100 samples take about the same wall-clock time as 1 sample.
Three HPC parallelization strategies
| Strategy | What it parallelizes | Slurm feature |
|---|---|---|
| Array jobs | Same command, different samples | #SBATCH --array=0-N |
| Scatter-gather | One sample, split by chromosome | Multiple jobs + one merge job |
| Job dependencies | Pipeline stages (align β markdup β β¦) | sbatch --dependency=afterok:JOBID |
We use all three in this tutorial.
Pipeline DAG (Directed Acyclic Graph)
[align array] β parallel across N samples
β
[markdup array] β parallel across N samples
β
[BQSR array] β parallel across N samples
β
[HaplotypeCaller scatter] β parallel across N Γ 7 chromosomes
β
[gather per sample] β merge chr GVCFs β per-sample GVCF
β
[CombineGVCFs + GenotypeGVCFs] β single job (serial)
β
[Filter + Annotate + PLINK] β single job
0. Setup on Pronghorn
Connect and check your account
ssh <YOUR_NET_ID>@pronghorn.rc.unr.edu
# Confirm your Slurm account/partition (required for sbatch)
sacctmgr show user $USER format=user,account,partition,qos%-30 \
withassoc where user=$USER
You should see: cpu-s5-bch709-6 / cpu-core-0 / student. Use those values in every #SBATCH directive below.
Create the resequencing environment
micromamba create -n reseq_bch709 -c conda-forge -c bioconda python=3.11 -y
micromamba activate reseq_bch709
micromamba install -c conda-forge -c bioconda \
fastqc 'fastp>=0.24' bwa-mem2 \
'samtools>=1.20' 'bcftools>=1.20' 'tabix>=1.11' \
openjdk=21 'picard>=3' gatk4 snpeff plink -y
# Note: we deliberately do NOT install sra-tools. Bioconda's sra-tools 3.x is
# built against GLIBC 2.27+, which is newer than Pronghorn's system libc β the
# binary fails with "GLIBC_2.27 not found" on the compute nodes. The download
# script in Step 1 pulls FASTQ directly from ENA over HTTPS, so sra-tools is
# not needed.
# Upgrade pip first β older pip can't find the prebuilt `tiktoken`
# manylinux wheel (a transitive multiqc dep), tries to build it from
# Rust source, fails on Pronghorn (no Rust compiler).
pip install --upgrade pip
# MultiQC + pinned deps. `tiktoken<0.8` is the safety pin β older
# tiktoken has stable cp311 linux wheels.
# IMPORTANT: don't skip this β step 8 needs multiqc.
pip install --prefer-binary \
'numpy<2.0' 'pyarrow<17' 'tiktoken<0.8' 'multiqc<1.34'
Java versions and snpEff (read this if step 7 ever fails with
class file version 65.0)Recent bioconda
snpeff(5.2+) is compiled with Java 21 (class file 65). GATK 4.6 and Picard 3 are compiled for Java 17 (class file 61), but Java is forward-compatible β a Java 21 JVM runs Java 17 jars fine. The reverse breaks: a Java 17 JVM trying to load a Java 21 jar exits withUnsupportedClassVersionError: class file version 65.0The recipe above already pins
openjdk=21so this Just Works. But if you (or a classmate) builtreseq_bch709earlier withopenjdk=17, follow ALL THREE steps below β a singlemicromamba install openjdk=21often βcompletesβ without actually replacing the old JVM, because the conda solver may keep the old openjdk if other packages have soft pins on it.Step 1 β verify the current Java
micromamba run -n reseq_bch709 java -version 2>&1 | head -1If you see
openjdk version "17."β patch needed. Ifopenjdk version "21."β already fixed, skip the rest.Step 2 β force-upgrade openjdk + snpeff together (single command, both channels, no soft-pin):
micromamba install -n reseq_bch709 -c conda-forge -c bioconda \ 'openjdk=21' snpeff -yIf that still leaves Java 17 in place after step 3, run the more aggressive form:
micromamba remove -n reseq_bch709 openjdk -y 2>/dev/null micromamba install -n reseq_bch709 -c conda-forge -c bioconda \ 'openjdk=21' snpeff -yStep 3 β verify it actually changed
micromamba run -n reseq_bch709 java -version 2>&1 | head -1 micromamba run -n reseq_bch709 snpEff -version 2>&1 | head -1Both lines must report 21 and a 5.x version respectively. If
java -versionstill says 17, the install above failed silently β re-run Step 2βs aggressive form.
MultiQC version (read this if step 8 ever fails with
rich.panel AttributeErrororRSEM int('#')ValueError)MultiQC 1.34 (released 2026) has two crash bugs in its module loader. The recipe above pins
'multiqc<1.34', so a fresh env is fine. Existing envs built earlier may still have 1.34 β patch with the verifyβinstallβre-verify pattern (micromamba installfirst; fall back topiponly if the conda solver refuses to downgrade):# Step 1 β verify current micromamba run -n reseq_bch709 multiqc --version # Step 2 β preferred: conda-side install (overrides any pip-installed copy) micromamba install -n reseq_bch709 -c bioconda 'multiqc<1.34' -y # Step 3 β re-verify. If still 1.34, the solver kept the old version # (soft-pin from another package). Force it with pip: micromamba run -n reseq_bch709 multiqc --version micromamba run -n reseq_bch709 pip install -U --force-reinstall 'multiqc<1.34' # Step 4 β final verify; should print 1.33.x or earlier. micromamba run -n reseq_bch709 multiqc --version
Patch libcrypto so samtools / bcftools run (do this now, not after they crash):
Biocondaβs samtools / bcftools / tabix are linked against libcrypto.so.1.0.0, but the current OpenSSL package in the env ships libcrypto.so.3 (or .1.1). Without a symlink, youβll see:
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file
Create the symlink once, right after activating:
# Must be run with the env ACTIVE β $CONDA_PREFIX points at the env folder
cd "$CONDA_PREFIX/lib"
if [ -f libcrypto.so.1.1 ]; then ln -sf libcrypto.so.1.1 libcrypto.so.1.0.0
elif [ -f libcrypto.so.3 ]; then ln -sf libcrypto.so.3 libcrypto.so.1.0.0
fi
cd - > /dev/null
# Verify the three most-used HTS tools all run
samtools --version | head -1 # β samtools 1.xx
bcftools --version | head -1 # β bcftools 1.xx
tabix --version | head -1 # β tabix (htslib) 1.xx
Expected output:
samtools 1.23.1
bcftools 1.23.1
tabix (htslib) 1.23.1
If all three print a version number (no libcrypto error), your environment is ready. If any still error, ask the instructor before moving on.
π Activate once in your login shell β every
sbatchinherits the environmentDo this ONCE in the login shell before running any
sbatchcommand:micromamba activate reseq_bch709 which bwa-mem2 # confirm: should print a path inside ~/micromamba/envs/reseq_bch709/ which samtools which gatkBy default,
sbatchsubmits jobs with--export=ALL, which means each Slurm job inherits your current shellβs environment β includingPATHpointing at the activated env. So you donβt need to putmicromamba activateinside every batch script.Sanity check: submit a tiny test job and confirm the tool is found on the compute node too:
sbatch -A cpu-s5-bch709-6 -p cpu-core-0 --time=00:05:00 --wrap="which bwa-mem2 && bwa-mem2 version" # check the slurm-<jobid>.out log β should show the same path + versionExpected
slurm-<jobid>.out:~/micromamba/envs/reseq_bch709/bin/bwa-mem2 Looking to launch executable ".../bwa-mem2.avx2", simd = .avx2 Launching executable ".../bwa-mem2.avx2" 2.2.1If you open a new SSH session, the activation is lost β just run
micromamba activate reseq_bch709again before submitting.
Create the project directory on scratch
Never put big BAM/VCF files in $HOME β it has tight quotas and slow I/O. Use ~/scratch.
mkdir -p ~/scratch/reseq/{raw,trim,bam,vcf,logs,scripts}
cd ~/scratch/reseq
| Folder | Contents |
|---|---|
raw/ |
Downloaded FASTQ files |
trim/ |
Adapter-trimmed FASTQ |
bam/ |
Aligned / markdup / recalibrated BAMs |
vcf/ |
GVCFs and final VCFs |
logs/ |
Slurm .out / .err log files |
scripts/ |
All the .sh batch scripts below |
Define your sample sheet
Each row is TAB-separated (NOT spaces): sample_id <TAB> SRR_accession. Downstream scripts use cut -f1 and awk -F'\t', both of which expect real tabs β if you accidentally use spaces, every script breaks.
Use printf so the \t is unambiguously a tab character (a heredoc with literal tabs often gets converted to spaces during copy-paste from a browser):
printf 'sample1\tSRR519585\nsample2\tSRR519586\n' > ~/scratch/reseq/samples.tsv
Verify itβs actually tab-separated (this should print just 2 β meaning every row has exactly 2 tab-separated fields):
awk -F'\t' '{print NF}' ~/scratch/reseq/samples.tsv | sort -u
# Expected output: 2
# If you see 1, you have spaces instead of tabs β re-run the printf above.
cat ~/scratch/reseq/samples.tsv # quick visual check
Adding more samples later is trivial β just append rows with
printf 'sampleN\tSRR_NNN\n' >> samples.tsv(use>>to append, not>which overwrites!). The array size in every script automatically scales (you only change one number).
1. Download FASTQ in parallel (Array Job)
One Slurm array task per sample. Each task downloads its own R1/R2 independently and in parallel.
scripts/01_download.sh
#!/bin/bash
#SBATCH --job-name=download
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=4g
#SBATCH --time=12:00:00
#SBATCH --array=1-2 # one task per sample (rows in samples.tsv)
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/01_download_%A_%a.out # %A=array job id, %a=task index
set -euo pipefail
cd ~/scratch/reseq
# Pick the N-th line from samples.tsv using the array task ID
LINE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv)
SAMPLE=$(echo "$LINE" | cut -f1)
SRR=$(echo "$LINE" | cut -f2)
echo "[task ${SLURM_ARRAY_TASK_ID}] Downloading ${SAMPLE} (${SRR}) from ENA"
# Download paired-end FASTQ from ENA (mirrors NCBI SRA, gzipped FASTQ ready-to-use).
# We deliberately avoid `prefetch` / `fasterq-dump` β bioconda's sra-tools 3.x is
# built against GLIBC 2.27+, which is newer than Pronghorn's system libc, so the
# binary fails with "GLIBC_2.27 not found" on the compute nodes.
mkdir -p raw
# Ask ENA for the exact fastq URLs (handles paired-end / single-end / multi-file).
META_URL="https://www.ebi.ac.uk/ena/portal/api/filereport?accession=${SRR}&result=read_run&fields=fastq_ftp&format=tsv"
URLS=$(curl -fsSL --retry 3 --max-time 60 "${META_URL}" \
| tail -n +2 \
| awk -F'\t' '{print $NF}' \
| tr ';' '\n' \
| sed '/^$/d')
[ -n "${URLS}" ] || { echo "ERROR: ENA returned no fastq URLs for ${SRR}"; exit 1; }
i=1
# --retry-all-errors + -C -: ENA's CDN occasionally drops mid-transfer with an
# SSL eof; --retry-all-errors retries on transport errors and -C - resumes from
# the partial file instead of restarting at byte 0 (see issue #64).
for U in ${URLS}; do
OUT="raw/${SAMPLE}_R${i}.fastq.gz"
echo "[task ${SLURM_ARRAY_TASK_ID}] -> https://${U}"
curl -fsSL --retry 5 --retry-all-errors --retry-delay 30 --max-time 3600 -C - -o "${OUT}" "https://${U}"
# Don't use a `[ -s "${OUT}" ]` partial-file skip-check here: with -C - a
# short file from a prior failed attempt would falsely pass. curl -f exits
# non-zero on real failures, which set -e turns into a hard stop.
i=$((i+1))
done
ls -lh raw/${SAMPLE}_R*.fastq.gz
Submit (capture the job ID so downstream steps can depend on it):
# Make sure the env is active in THIS shell (once per login session)
micromamba activate reseq_bch709
cd ~/scratch/reseq
DL_JID=$(sbatch --parsable scripts/01_download.sh)
echo "Download job: ${DL_JID}"
squeue -u $USER # you should see 2 tasks (download_1, download_2) running in parallel
Expected output of logs/01_download_<jobid>_1.out (sample1 task β sample2βs log is similar with SRR519586):
[task 1] Downloading sample1 (SRR519585) from ENA
[task 1] -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_1.fastq.gz
[task 1] -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR519/SRR519585/SRR519585_2.fastq.gz
-rw-r--r-- 1 <netid> rc-bch709-6 2.0G <date> raw/sample1_R1.fastq.gz
-rw-r--r-- 1 <netid> rc-bch709-6 2.0G <date> raw/sample1_R2.fastq.gz
Sample2 ends up around 2.6 G per mate (SRR519586 is a slightly deeper run). If either file shows 0 bytes, ENA hiccupped β re-submit just that array task with sbatch --array=2 scripts/01_download.sh.
#SBATCH option |
Purpose |
|---|---|
--array=1-2 |
Launches 2 tasks, each with its own $SLURM_ARRAY_TASK_ID (1 and 2) |
%A_%a in log name |
%A=array job ID, %a=task index β keeps each taskβs log separate |
Scaling up: for 100 samples, change
--array=1-2to--array=1-100. Slurm will run as many in parallel as your QoS allows.
2. Download and Index Reference (Single Job)
The reference is shared across all samples, so we only build it once.
scripts/02_reference.sh
#!/bin/bash
#SBATCH --job-name=reference
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=06:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/02_reference_%j.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
# ---- Download TAIR10 from TAIR (www.arabidopsis.org) ----
# Primary: TAIR (canonical source) β uses self-signed cert, so curl needs -k
# Fallback: EBI Ensembl Plants mirror (most stable mirror; NCBI is last resort)
TAIR_URL="https://www.arabidopsis.org/api/download-files/download?filePath=Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz"
EBI_URL="https://ftp.ebi.ac.uk/ensemblgenomes/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
NCBI_URL="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz"
OUT=reference.fasta.gz
rm -f "$OUT"
for URL in "$TAIR_URL" "$EBI_URL" "$NCBI_URL"; do
echo "[ref] trying $URL"
if curl -kfsSL --retry 3 --max-time 900 -o "$OUT" "$URL" && [ -s "$OUT" ]; then
echo "[ref] downloaded from $URL"
break
fi
rm -f "$OUT"
done
[ -s "$OUT" ] || { echo "ERROR: TAIR10 download failed from all sources"; exit 1; }
gunzip -f "$OUT"
# Normalize chromosome names to match the 1001genomes VCF (1..5, Mt, Pt):
# - TAIR fasta uses Chr1..Chr5, ChrM, ChrC
# - NCBI RefSeq uses NC_003070.9 β¦
# - EBI Ensembl already uses 1..5, Mt, Pt (no rename needed)
if grep -q '^>Chr' reference.fasta; then
sed -i -E 's/^>Chr([0-9]+).*/>\1/; s/^>ChrM.*/>Mt/; s/^>ChrC.*/>Pt/' reference.fasta
elif grep -q '^>NC_' reference.fasta; then
awk 'BEGIN{
m["NC_003070.9"]="1"; m["NC_003071.7"]="2"; m["NC_003074.8"]="3"
m["NC_003075.7"]="4"; m["NC_003076.8"]="5"
m["NC_037304.1"]="Mt"; m["NC_000932.1"]="Pt"
}
/^>/{
name=substr($1,2); print ">" (name in m ? m[name] : name); next
}
{print}' reference.fasta > reference.renamed.fa && mv -f reference.renamed.fa reference.fasta
fi
# Build all indices. bwa-mem2 index / samtools faidx overwrite their outputs by
# default, but Picard CreateSequenceDictionary refuses to overwrite an existing
# .dict and aborts with `file:///.../reference.dict already exists. Delete this
# file and try again`. rm -f makes the whole step idempotent so a re-run of
# 02_reference.sh after a partial failure just works.
bwa-mem2 index reference.fasta
samtools faidx reference.fasta
rm -f reference.dict
picard CreateSequenceDictionary R=reference.fasta O=reference.dict
# Download and prepare known variants (sites-only to avoid malformed VCF)
# 1001genomes.org is sometimes slow β give curl up to 30 min and retry 3Γ
KSITES_URL="https://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz"
# The 1001genomes VCF is ~19 GB; on Pronghorn outbound (~3 MB/s) the download
# takes ~3 hours. We use `-C -` to resume on retry instead of restarting at
# byte 0, and drop --max-time so a single attempt isn't capped (Slurm --time
# already bounds the whole job).
# --retry-all-errors + -C -: 1001genomes occasionally drops the connection
# with an SSL eof on a 19 GB download; --retry-all-errors retries on transport
# errors and -C - resumes from the partial file (see issue #64).
curl -fsSL --retry 5 --retry-all-errors --retry-delay 30 -C - \
-o 1001genomes_snp-short-indel_only_ACGTN.vcf.gz "${KSITES_URL}"
# Atomic rename: write to a .tmp file, only promote to known_sites.vcf.gz once
# bgzip + tabix BOTH succeed and outputs are non-empty. If anything in this
# block fails, the original 1001genomes file is preserved so the student can
# re-run step 02 without re-downloading 19 GB.
zcat 1001genomes_snp-short-indel_only_ACGTN.vcf.gz \
| awk 'BEGIN{OFS="\t"} /^##/{print; next} /^#CHROM/{print $1,$2,$3,$4,$5,$6,$7,$8; next} {print $1,$2,$3,$4,$5,$6,$7,$8}' \
| bgzip > known_sites.vcf.gz.tmp
tabix -p vcf known_sites.vcf.gz.tmp
[ -s known_sites.vcf.gz.tmp ] && [ -s known_sites.vcf.gz.tmp.tbi ] || {
echo "ERROR: known_sites.vcf.gz build failed (empty output). Re-run step 02." >&2
rm -f known_sites.vcf.gz.tmp known_sites.vcf.gz.tmp.tbi
exit 1
}
mv -f known_sites.vcf.gz.tmp known_sites.vcf.gz
mv -f known_sites.vcf.gz.tmp.tbi known_sites.vcf.gz.tbi
rm -f 1001genomes_snp-short-indel_only_ACGTN.vcf.gz
echo "Reference prep done."
Submit (independent of download, so it can run in parallel with Step 1):
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
echo "Reference job: ${REF_JID}"
Expected output of logs/02_reference_<jobid>.out (truncated β first lines + tail):
[ref] trying https://www.arabidopsis.org/api/download-files/download?filePath=Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
[ref] downloaded from https://www.arabidopsis.org/api/download-files/download?filePath=...TAIR10_chr_all.fas.gz
Looking to launch executable ".../bwa-mem2.avx2", simd = .avx2
[bwa_index] Pack FASTA... 0.55 sec
* Entering FMI_search
ref seq len = 239337268
build suffix-array ticks = 62781215970
build fm-index ticks = 13953625410
Total time taken: 41.8652
INFO <date> CreateSequenceDictionary
...
[<date>] CreateSequenceDictionary OUTPUT=reference.dict REFERENCE=reference.fasta ...
[<date>] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.02 minutes.
Reference prep done.
Quick sanity check after the job finishes β you should see the FASTA, the bwa-mem2 index files, the .fai, and the Picard .dict:
ls -lh ~/scratch/reseq/reference.*
Expected output:
-rw-r--r-- 1 <netid> rc-bch709-6 116M <date> reference.fasta
-rw-r--r-- 1 <netid> rc-bch709-6 229M <date> reference.fasta.0123
-rw-r--r-- 1 <netid> rc-bch709-6 7.4K <date> reference.fasta.amb
-rw-r--r-- 1 <netid> rc-bch709-6 230 <date> reference.fasta.ann
-rw-r--r-- 1 <netid> rc-bch709-6 371M <date> reference.fasta.bwt.2bit.64
-rw-r--r-- 1 <netid> rc-bch709-6 175 <date> reference.fasta.fai
-rw-r--r-- 1 <netid> rc-bch709-6 29M <date> reference.fasta.pac
-rw-r--r-- 1 <netid> rc-bch709-6 1.0K <date> reference.dict
After known_sites.vcf.gz finishes building (the long curl + bgzip + tabix step at the end of the script), confirm both the VCF and its tabix index exist:
ls -lh ~/scratch/reseq/known_sites.vcf.gz*
Expected output:
-rw-r--r-- 1 <netid> rc-bch709-6 52M <date> known_sites.vcf.gz
-rw-r--r-- 1 <netid> rc-bch709-6 220K <date> known_sites.vcf.gz.tbi
If the .tbi is missing, re-run step 02 β BaseRecalibrator in step 04 will fail without it.
3. Trim and Align in parallel (Array Job with Dependency)
fastp and bwa-mem2 each accept multiple threads. We give each array task 8 cores to cut wall time.
scripts/03_align.sh
#!/bin/bash
#SBATCH --job-name=align
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=8
#SBATCH --mem=32g
#SBATCH --time=12:00:00
#SBATCH --array=1-2
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/03_align_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
LINE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv)
SAMPLE=$(echo "$LINE" | cut -f1)
# ---- Trim ----
fastp \
--in1 raw/${SAMPLE}_R1.fastq.gz \
--in2 raw/${SAMPLE}_R2.fastq.gz \
--out1 trim/${SAMPLE}_R1.trimmed.fq.gz \
--out2 trim/${SAMPLE}_R2.trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--thread ${SLURM_CPUS_PER_TASK} \
--html trim/${SAMPLE}_fastp.html \
--json trim/${SAMPLE}_fastp.json
# ---- Align + sort ----
set -o pipefail
bwa-mem2 mem \
-t ${SLURM_CPUS_PER_TASK} \
-R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib_${SAMPLE}\tPU:unit_${SAMPLE}" \
reference.fasta \
trim/${SAMPLE}_R1.trimmed.fq.gz \
trim/${SAMPLE}_R2.trimmed.fq.gz \
| samtools sort -@ ${SLURM_CPUS_PER_TASK} -T /tmp/${SAMPLE}_sort \
-o bam/${SAMPLE}.bam
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.bam
samtools quickcheck bam/${SAMPLE}.bam && echo "BAM OK"
Expected output of logs/03_align_<jobid>_1.out (sample1 β fastp summary at the top, bwa-mem2 + samtools sort at the bottom; truncated):
Detecting adapter sequence for read1...
>TruSeq2_PE_r
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
Read1 before filtering:
total reads: 29418122
total bases: 2235777272
Q20 bases: 2090582549(93.5059%)
Q30 bases: 1935916821(86.5881%)
Read1 after filtering:
total reads: 26525643
total bases: 2003802159
Q20 bases: 1942580727(96.9447%)
Q30 bases: 1815798338(90.6176%)
Filtering result:
reads passed filter: 53051286
reads failed due to low quality: 4575222
reads failed due to too many N: 64736
reads failed due to too short: 1145000
# ...
fastp v1.3.3, time used: 137 seconds
Looking to launch executable ".../bwa-mem2.avx2", simd = .avx2
Time taken for main_mem function: 1797.12 sec
[bam_sort_core] merging from 2 files and 8 in-memory blocks...
BAM OK
The trim/sample1_fastp.json summary captures the same numbers in a parseable form (the HTML report is the human-friendly view of the same data):
ls -lh ~/scratch/reseq/trim/sample1_fastp.* ~/scratch/reseq/bam/sample1.bam
Expected output:
-rw-r--r-- 1 <netid> rc-bch709-6 444K <date> trim/sample1_fastp.html
-rw-r--r-- 1 <netid> rc-bch709-6 109K <date> trim/sample1_fastp.json
-rw-r--r-- 1 <netid> rc-bch709-6 2.9G <date> bam/sample1.bam
A quick samtools flagstat confirms the BAM looks reasonable (>98% of reads mapped, ~91% properly paired):
samtools flagstat ~/scratch/reseq/bam/sample1.bam
Expected output:
53076354 + 0 in total (QC-passed reads + QC-failed reads)
53051286 + 0 primary
0 + 0 secondary
25068 + 0 supplementary
0 + 0 duplicates
52354774 + 0 mapped (98.64% : N/A)
52329706 + 0 primary mapped (98.64% : N/A)
53051286 + 0 paired in sequencing
26525643 + 0 read1
26525643 + 0 read2
48411760 + 0 properly paired (91.25% : N/A)
52276894 + 0 with itself and mate mapped
52812 + 0 singletons (0.10% : N/A)
490848 + 0 with mate mapped to a different chr
129225 + 0 with mate mapped to a different chr (mapQ>=5)
Submit with dependencies β wait for both download (raw FASTQ) and reference prep (index) to succeed before aligning:
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
echo "Align: ${ALIGN_JID}"
squeue -u $USER # ALIGN_JID should show state PD with reason (Dependency)
If you skipped capturing DL_JID or REF_JID earlier, you can look them up with squeue -u $USER or just depend on whichever you know β Slurm only needs valid predecessor IDs.
${SLURM_CPUS_PER_TASK}is automatically set by Slurm to match--cpus-per-task. Using it everywhere means changing one#SBATCHline automatically updates every thread count below.
-T /tmp/${SAMPLE}_sortputssamtools sorttemp files on each nodeβs local disk (fast) instead of shared scratch (slower, and can fill up).
4. Mark Duplicates + BQSR (Array Job)
MarkDuplicates and BQSR arenβt heavily multi-threaded, but we still get per-sample parallelism.
scripts/04_markdup_bqsr.sh
#!/bin/bash
#SBATCH --job-name=markdup_bqsr
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=08:00:00
#SBATCH --array=1-2
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/04_markdup_bqsr_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv | cut -f1)
# ---- MarkDuplicates ----
picard -Xmx12g MarkDuplicates \
I=bam/${SAMPLE}.bam \
O=bam/${SAMPLE}.markdup.bam \
M=bam/${SAMPLE}.markdup.metrics \
VALIDATION_STRINGENCY=SILENT
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.markdup.bam
# ---- Preflight: known_sites.vcf.gz must exist (built by step 02) ----
if [ ! -s known_sites.vcf.gz ] || [ ! -s known_sites.vcf.gz.tbi ]; then
echo "ERROR: known_sites.vcf.gz (and .tbi) not found in $(pwd)." >&2
echo " Step 02 (02_reference.sh) did not finish β re-run it before step 04." >&2
ls -lh 1001genomes_*.vcf.gz known_sites.vcf.gz* 2>/dev/null || true
exit 1
fi
# ---- BaseRecalibrator ----
gatk --java-options "-Xmx12g" BaseRecalibrator \
-I bam/${SAMPLE}.markdup.bam \
-R reference.fasta \
--known-sites known_sites.vcf.gz \
-O bam/${SAMPLE}.recal.table
# ---- ApplyBQSR ----
gatk --java-options "-Xmx12g" ApplyBQSR \
-I bam/${SAMPLE}.markdup.bam \
-R reference.fasta \
--bqsr-recal-file bam/${SAMPLE}.recal.table \
-O bam/${SAMPLE}.recal.bam
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.recal.bam
# Clean up intermediate markdup BAM
rm -f bam/${SAMPLE}.bam bam/${SAMPLE}.bam.bai
rm -f bam/${SAMPLE}.markdup.bam bam/${SAMPLE}.markdup.bam.bai
Expected tail of logs/04_markdup_bqsr_<jobid>_1.out β Picard wraps up MarkDuplicates, then GATK starts BaseRecalibrator and ApplyBQSR (truncated):
INFO <date> MarkDuplicates Marking 14734338 records as duplicates.
INFO <date> MarkDuplicates Found 0 optical duplicate clusters.
INFO <date> MarkDuplicates Written 50,000,000 records. Elapsed time: 00:08:35s.
INFO <date> MarkDuplicates Writing complete. Closing input iterator.
[<date>] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 13.21 minutes.
Using GATK jar .../gatk-package-4.6.2.0-local.jar
Running:
java ... -Xmx12g ... BaseRecalibrator -I bam/sample1.markdup.bam -R reference.fasta --known-sites known_sites.vcf.gz -O bam/sample1.recal.table
INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.6.2.0
INFO BaseRecalibrationEngine - The covariates being used here:
INFO BaseRecalibrationEngine - ReadGroupCovariate
INFO BaseRecalibrationEngine - QualityScoreCovariate
INFO BaseRecalibrationEngine - ContextCovariate
INFO BaseRecalibrationEngine - CycleCovariate
INFO ProgressMeter - Starting traversal
INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
INFO ProgressMeter - 1:1523668 0.2 371000 2221335.2
# ...
INFO ProgressMeter - done. <X.X> <NNNNNNNN> <rate>
INFO BaseRecalibrator - Finished. Elapsed time: <NN.NN> minutes.
INFO ApplyBQSR - Done. Elapsed time: <NN.NN> minutes.
Picard writes a metrics file β open bam/sample1.markdup.metrics to see the duplication rate per library (the ## METRICS CLASS block is the part GATK and MultiQC parse):
head -10 ~/scratch/reseq/bam/sample1.markdup.metrics
Expected output:
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[bam/sample1.bam] OUTPUT=bam/sample1.markdup.bam METRICS_FILE=bam/sample1.markdup.metrics ...
## htsjdk.samtools.metrics.StringHeader
# Started on: <date>
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
lib_sample1 52812 26138447 25068 721580 16888 7358725 0 0.281567 37224722
PERCENT_DUPLICATION of ~0.28 (28%) is on the high side β these public SRA libraries were sequenced deeply on a small genome, so some duplication is expected. After BQSR finishes, the recalibrated BAM is what every downstream step uses:
ls -lh ~/scratch/reseq/bam/sample1.recal.bam*
Expected output:
# example output (your numbers will differ)
-rw-r--r-- 1 <netid> rc-bch709-6 2.7G <date> bam/sample1.recal.bam
-rw-r--r-- 1 <netid> rc-bch709-6 350K <date> bam/sample1.recal.bam.bai
Submit:
MARKDUP_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} scripts/04_markdup_bqsr.sh)
echo "MarkDup+BQSR: ${MARKDUP_JID}"
5. HaplotypeCaller β Scatter by Chromosome (2-D Array)
HaplotypeCaller is the slowest step. For Arabidopsis (7 chromosomes Γ N samples), we use a 2-dimensional array: each task handles one (sample, chromosome) pair. This is the biggest wall-time win.
For N=2 samples and 7 chromosomes (1, 2, 3, 4, 5, Mt, Pt) β 14 parallel tasks instead of 2 serial ones.
scripts/05a_haplotypecaller.sh
#!/bin/bash
#SBATCH --job-name=hc_scatter
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=12:00:00
#SBATCH --array=1-14 # N_SAMPLES * N_CHROMS = 2 * 7
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/05a_hc_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
# Map the linear task ID to (sample_index, chr_index)
CHROMS=(1 2 3 4 5 Mt Pt)
N_CHROMS=${#CHROMS[@]}
# 0-indexed within the array
IDX=$((SLURM_ARRAY_TASK_ID - 1))
SAMPLE_IDX=$((IDX / N_CHROMS + 1)) # 1-indexed line in samples.tsv
CHR_IDX=$((IDX % N_CHROMS))
SAMPLE=$(sed -n "${SAMPLE_IDX}p" samples.tsv | cut -f1)
CHR=${CHROMS[$CHR_IDX]}
echo "Task ${SLURM_ARRAY_TASK_ID}: sample=${SAMPLE}, chr=${CHR}"
mkdir -p vcf/scatter
gatk --java-options "-Xmx12g" HaplotypeCaller \
-R reference.fasta \
-I bam/${SAMPLE}.recal.bam \
-O vcf/scatter/${SAMPLE}.${CHR}.g.vcf.gz \
-L ${CHR} \
-ERC GVCF \
--native-pair-hmm-threads ${SLURM_CPUS_PER_TASK}
Expected tail of logs/05a_hc_<jobid>_1.out (sample1, chromosome 1 β the GVCF mode is verbose; the key lines are the ProgressMeter and the final summary):
# example output (your numbers will differ)
INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.6.2.0
INFO HaplotypeCaller - HTSJDK Version: 4.2.0
INFO HaplotypeCaller - Picard Version: 3.4.0
INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
INFO HaplotypeCaller - Defragmenting 100 events for 1 samples
INFO ProgressMeter - Starting traversal
INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
INFO ProgressMeter - 1:1523668 0.5 53000 106000.0
# ...
INFO ProgressMeter - 1:30425192 14.8 2920000 197297.3
INFO ProgressMeter - traversal complete. Processed <N> total regions in <M.M> minutes.
INFO HaplotypeCaller - Shutting down engine
INFO HaplotypeCaller - Done. Elapsed time: <NN.NN> minutes.
After the array finishes, you should see 14 per-(sample, chromosome) GVCFs (2 samples Γ 7 chromosomes):
ls -lh ~/scratch/reseq/vcf/scatter/
Expected output:
# example output (your numbers will differ)
-rw-r--r-- 1 <netid> rc-bch709-6 12M <date> sample1.1.g.vcf.gz
-rw-r--r-- 1 <netid> rc-bch709-6 280K <date> sample1.1.g.vcf.gz.tbi
-rw-r--r-- 1 <netid> rc-bch709-6 8.0M <date> sample1.2.g.vcf.gz
# ... one pair per (sample, chr) β 14 GVCFs + 14 .tbi total
-rw-r--r-- 1 <netid> rc-bch709-6 1.1M <date> sample2.Pt.g.vcf.gz
-rw-r--r-- 1 <netid> rc-bch709-6 18K <date> sample2.Pt.g.vcf.gz.tbi
scripts/05b_gather_gvcf.sh β Gather per-chromosome GVCFs into one per-sample GVCF
#!/bin/bash
#SBATCH --job-name=hc_gather
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=02:00:00
#SBATCH --array=1-2 # one task per sample
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/05b_gather_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv | cut -f1)
# Merge the 7 chromosome GVCFs for this sample into one
CHROMS=(1 2 3 4 5 Mt Pt)
INPUTS=""
for CHR in "${CHROMS[@]}"; do
INPUTS+=" -I vcf/scatter/${SAMPLE}.${CHR}.g.vcf.gz"
done
# MergeVcfs sorts inputs against the reference sequence dictionary, so
# the order of -I arguments does not matter. (GatherVcfs is faster but
# requires inputs to be in strict dict order β TAIR10 has Pt before Mt,
# which is easy to get wrong.)
gatk --java-options "-Xmx6g" MergeVcfs \
${INPUTS} \
-O vcf/${SAMPLE}.g.vcf.gz
# Verify and clean up scatter files
ls -lh vcf/${SAMPLE}.g.vcf.gz
rm -rf vcf/scatter/${SAMPLE}.*.g.vcf.gz vcf/scatter/${SAMPLE}.*.g.vcf.gz.tbi
Expected tail of logs/05b_gather_<jobid>_1.out β MergeVcfs sorts the per-chromosome GVCFs against the reference dict and writes a .tbi index in one step:
# example output (your numbers will differ)
INFO MergeVcfs - Checking inputs.
[<date>] picard.vcf.MergeVcfs done. Elapsed time: 0.10 minutes.
-rw-r--r-- 1 <netid> rc-bch709-6 90M <date> vcf/sample1.g.vcf.gz
Submit with chained dependencies:
HC_JID=$(sbatch --parsable --dependency=afterok:${MARKDUP_JID} scripts/05a_haplotypecaller.sh)
GATHER_JID=$(sbatch --parsable --dependency=afterok:${HC_JID} scripts/05b_gather_gvcf.sh)
echo "HaplotypeCaller: ${HC_JID} | Gather: ${GATHER_JID}"
Array size formula:
N_SAMPLES Γ N_CHROMS. For 10 samples Γ 7 chroms, use--array=1-70. For human (25 chroms) Γ 10 samples,--array=1-250.
Why scatter-gather beats per-sample: HaplotypeCaller on a whole Arabidopsis genome takes ~60 min. Per chromosome it takes 5-15 min. 7 chromosomes in parallel finish in ~15 min (limited by the slowest chromosome). For humans the speedup is much larger.
6. Joint Genotyping (Single Job)
Once all per-sample GVCFs exist, combine them and run joint genotyping. This step is serial but fast (minutes).
scripts/06_joint_genotype.sh
#!/bin/bash
#SBATCH --job-name=joint_geno
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=32g
#SBATCH --time=04:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/06_joint_%j.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
# Build the -V argument list from samples.tsv (auto-scales with sample count)
INPUTS=""
while read SAMPLE SRR; do
INPUTS+=" -V vcf/${SAMPLE}.g.vcf.gz"
done < samples.tsv
gatk --java-options "-Xmx24g" CombineGVCFs \
-R reference.fasta \
${INPUTS} \
-O vcf/cohort.g.vcf.gz
gatk --java-options "-Xmx24g" GenotypeGVCFs \
-R reference.fasta \
-V vcf/cohort.g.vcf.gz \
-O vcf/cohort.vcf.gz
echo "Joint genotyping done. Variants:"
bcftools stats vcf/cohort.vcf.gz | grep "^SN"
Expected tail of logs/06_joint_<jobid>.out β CombineGVCFs then GenotypeGVCFs, ending with the bcftools stats summary:
# example output (your numbers will differ)
INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.6.2.0
INFO ProgressMeter - Starting traversal
INFO ProgressMeter - traversal complete. Processed <N> total variants in <M.M> minutes.
INFO CombineGVCFs - Shutting down engine
INFO CombineGVCFs - Done. Elapsed time: <NN.NN> minutes.
INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.6.2.0
INFO ProgressMeter - Starting traversal
INFO ProgressMeter - traversal complete. Processed <N> total variants in <M.M> minutes.
INFO GenotypeGVCFs - Done. Elapsed time: <NN.NN> minutes.
Joint genotyping done. Variants:
SN 0 number of samples: 2
SN 0 number of records: 1219245
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 1005193
SN 0 number of MNPs: 0
SN 0 number of indels: 215941
SN 0 number of others: 0
SN 0 number of multiallelic sites: 15281
SN 0 number of multiallelic SNP sites: 3032
For Arabidopsis with 2 deeply-sequenced samples, expect on the order of ~1β2 M raw cohort variants before filtering. You can confirm with a one-liner once the job completes:
bcftools view -H ~/scratch/reseq/vcf/cohort.vcf.gz | wc -l
Submit:
JOINT_JID=$(sbatch --parsable --dependency=afterok:${GATHER_JID} scripts/06_joint_genotype.sh)
For >100 samples use
GenomicsDBImportinstead ofCombineGVCFsβ itβs dramatically faster for large cohorts. See GATK docs on joint calling.
7. Filter, Annotate, and Population Analysis (Single Job)
Everything from here is fast and has no further parallelism to exploit.
scripts/07_filter_annotate.sh
#!/bin/bash
#SBATCH --job-name=filter_ann
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=04:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/07_filter_%j.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with bwa-mem2, samtools, gatk, β¦) is inherited automatically.
cd ~/scratch/reseq
# Pre-download the snpEff Arabidopsis database if it's not already present.
# snpEff would auto-download on first use, but doing it explicitly here makes
# the failure mode obvious if the compute node has no outbound HTTPS, and
# prevents the download from adding minutes to every fresh student's first
# step-7 run silently.
if [ ! -d "${HOME}/snpeff_data/Arabidopsis_thaliana" ]; then
snpEff download -dataDir "${HOME}/snpeff_data" Arabidopsis_thaliana
fi
# ---- Split SNPs / Indels ----
gatk SelectVariants -R reference.fasta -V vcf/cohort.vcf.gz \
--select-type-to-include SNP -O vcf/cohort.snps.vcf.gz
gatk SelectVariants -R reference.fasta -V vcf/cohort.vcf.gz \
--select-type-to-include INDEL -O vcf/cohort.indels.vcf.gz
# ---- Hard filter ----
gatk VariantFiltration -R reference.fasta -V vcf/cohort.snps.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O vcf/cohort.snps.filtered.vcf.gz
gatk VariantFiltration -R reference.fasta -V vcf/cohort.indels.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O vcf/cohort.indels.filtered.vcf.gz
# ---- PASS-only VCF ----
bcftools view -f PASS -Oz -o vcf/cohort.snps.pass.vcf.gz vcf/cohort.snps.filtered.vcf.gz
tabix -p vcf vcf/cohort.snps.pass.vcf.gz
# ---- SnpEff annotation ----
# -csvStats writes vcf/snpEff_summary.csv β REQUIRED for MultiQC's snpeff
# module (it parses the CSV, not the HTML summary).
snpEff -csvStats vcf/snpEff_summary.csv \
-dataDir ${HOME}/snpeff_data Arabidopsis_thaliana \
vcf/cohort.snps.pass.vcf.gz > vcf/cohort.snps.annotated.vcf 2> logs/snpeff.log
# ---- PLINK (requires 2+ samples) ----
cd vcf
plink --vcf cohort.snps.pass.vcf.gz --make-bed --out cohort --allow-extra-chr
plink --bfile cohort --pca 10 --out cohort.pca --allow-extra-chr
plink --bfile cohort --indep-pairwise 50 10 0.2 --out cohort.ld --allow-extra-chr
echo "Pipeline complete."
Expected tail of logs/07_filter_<jobid>.out β VariantFiltration annotates the FILTER column, bcftools view -f PASS keeps only the survivors, then snpEff and PLINK run:
# example output (your numbers will differ)
INFO VariantFiltration - The Genome Analysis Toolkit (GATK) v4.6.2.0
INFO ProgressMeter - Starting traversal
INFO ProgressMeter - traversal complete. Processed <N> total variants in <M.M> minutes.
INFO VariantFiltration - Done. Elapsed time: <N.NN> minutes.
[snpEff] Reading database for genome: 'Arabidopsis_thaliana'
[snpEff] Analysis done.
Pipeline complete.
snpEff also writes a summary HTML/CSV next to the working directory (snpEff_summary.html, snpEff_genes.txt) β open the HTML in a browser, or peek at the summary table that snpEff prints to stderr / logs/snpeff.log:
head -25 ~/scratch/reseq/logs/snpeff.log
Expected output:
# example output (your numbers will differ β these are the section headings snpEff always prints)
Number_of_variants_processed <NNNNNN>
Number_of_effects <NNNNNNN>
# Effects by impact
HIGH <NNNN>
LOW <NNNNNN>
MODERATE <NNNNN>
MODIFIER <NNNNNNN>
# Effects by functional class
MISSENSE <NNNNN>
NONSENSE <NNN>
SILENT <NNNNN>
# Effects by type
intron_variant <NNNNNN>
synonymous_variant <NNNNN>
missense_variant <NNNNN>
stop_gained <NNN>
# ...
The PLINK PCA writes cohort.pca.eigenvec (per-sample PC scores) and cohort.pca.eigenval (variance per PC). With only 2 samples PCA is degenerate β for real cohorts (8+ samples) the first 2 PCs already separate populations:
cat ~/scratch/reseq/vcf/cohort.pca.eigenvec
Expected output:
# example output β columns are: FID IID PC1 PC2 ... PC10
sample1 sample1 -0.7071 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sample2 sample2 0.7071 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Finally, count the PASS-filter SNPs in your cohort VCF β this is the βheadline numberβ your downstream analysis (GWAS, popgen, annotation) starts from:
bcftools view -H ~/scratch/reseq/vcf/cohort.snps.pass.vcf.gz | wc -l
bcftools stats ~/scratch/reseq/vcf/cohort.snps.pass.vcf.gz | grep "^SN"
Expected output:
# example output (your numbers will differ; for 2 deep Arabidopsis samples expect ~0.5β1.5 M PASS SNPs)
898373
SN 0 number of samples: 2
SN 0 number of records: 898373
SN 0 number of SNPs: 898373
SN 0 number of indels: 0
SN 0 number of multiallelic sites: 6659
SN 0 number of multiallelic SNP sites: 6659
Submit:
FILTER_JID=$(sbatch --parsable --dependency=afterok:${JOINT_JID} scripts/07_filter_annotate.sh)
echo "Filter+Annotate: ${FILTER_JID}"
8. Aggregate QC with MultiQC
Each upstream step writes a MultiQC-parseable artifact (fastp JSON, Picard MarkDuplicates metrics, GATK BQSR table, snpEff summary, bcftools stats). MultiQC walks the working dir and rolls them all into one HTML report β the single file you can open to see how the cohort behaved end to end.
scripts/08_multiqc.sh
#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=01:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/08_multiqc_%j.out
set -euo pipefail
cd ~/scratch/reseq
mkdir -p qc
# bcftools stats over the final cohort VCF β MultiQC parses the .vchk
mkdir -p qc/bcftools
# Step 7 emits SNPs and INDELs as two separate filtered files. Concat them
# into one cohort.filtered.vcf.gz so bcftools stats (and MultiQC's bcftools
# module) sees a single record set covering both variant classes. The
# guard makes the step idempotent on re-runs.
if [ ! -s vcf/cohort.filtered.vcf.gz ]; then
bcftools concat -a -Oz \
-o vcf/cohort.filtered.vcf.gz \
vcf/cohort.snps.filtered.vcf.gz \
vcf/cohort.indels.filtered.vcf.gz
bcftools index -t vcf/cohort.filtered.vcf.gz
fi
bcftools stats vcf/cohort.filtered.vcf.gz > qc/bcftools/cohort.filtered.vchk
# Aggregate everything multiqc can find under the working dir.
# NOTE: --module gatk and --module rsem are intentionally OMITTED, and
# also explicitly excluded with --exclude as belt-and-suspenders.
# MultiQC 1.34's GATK base_recalibrator parser hits a pydantic
# ValidationError on the BQSR scatter (points.X.name is None) and
# crashes the whole report β the BQSR table is human-readable, inspect
# bam/*.recal.table directly (see Section 8 below).
# MultiQC 1.34's RSEM parser also misreads stray '#'-prefixed files in
# sibling dirs (e.g. ~/scratch/rnaseq/) and ValueErrors out. We don't
# run RSEM in this pipeline at all, so excluding it is free insurance.
# Re-enable both modules once the upstream MultiQC bugs are fixed.
multiqc . -o qc/ -n reseq_report --force \
--module fastp \
--module picard \
--module snpeff \
--module bcftools \
--exclude rsem \
--exclude gatk
echo "MultiQC report: qc/reseq_report.html"
β οΈ Do NOT run a bare
multiqc .from~/scratch/or any parent directory. MultiQC 1.34 will scan everything below the cwd and try to parse files from sibling pipelines (e.g.~/scratch/rnaseq/), tripping known module bugs (rsem.pyValueError on stray#lines, the GATK BQSRrich.panelAttributeError). The safest path is just:sbatch scripts/08_multiqc.shIf you must run it interactively, mirror the scriptβs flag set exactly β
cdinto the resequencing workspace and pass the same module whitelist + excludes:cd ~/scratch/reseq multiqc . -o qc/ -n reseq_report --force \ --module fastp --module picard --module snpeff --module bcftools \ --exclude rsem --exclude gatk
What this picks up:
| Module | Source files | What you see |
|---|---|---|
fastp |
trim/*_fastp.json |
Q20/Q30 rates, duplication %, adapter trimming per sample |
picard |
bam/*.markdup.metrics |
Optical / PCR duplication rate per library |
gatk |
bam/*.recal.table is human-readable β head bam/sample1.recal.table shows BQSR before/after |
(MultiQC 1.34βs GATK module crashes on this report; see comment in 08_multiqc.sh) |
snpeff |
vcf/snpEff_summary.csv (the HTML/genes.txt files in CWD are for human inspection only β MultiQC parses the CSV) |
HIGH/MODERATE/LOW/MODIFIER variant impact distribution |
bcftools |
qc/bcftools/cohort.filtered.vchk |
SNP/indel counts, Ts/Tv, singleton stats |
Submit:
MQC_JID=$(sbatch --parsable --dependency=afterok:${FILTER_JID} scripts/08_multiqc.sh)
echo "MultiQC: ${MQC_JID}"
Once it finishes, copy the report to your laptop with scp and open it in a browser:
scp <netid>@pronghorn.rc.unr.edu:~/scratch/reseq/qc/reseq_report.html ./
open reseq_report.html # or: double-click in your file browser
Expected qc/reseq_report.html (key sections you should see in the rendered report):
General Statistics β one row per sample, mapped %, dup %, Q30, mean coverage
fastp β adapter trimming, read filtering, Q20/Q30 distributions
Picard MarkDuplicates β % Duplication, Estimated Library Size
snpEff β variant impact pie + summary table per sample
bcftools stats β SNPs / indels / Ts:Tv per sample and cohort-wide
# GATK BQSR is omitted from the MultiQC report (MultiQC 1.34 GATK module bug);
# inspect bam/*.recal.table directly to see empirical vs reported quality.
If a module section is missing, the underlying file wasnβt written β re-check that stepβs log before continuing.
9. Submit the Entire Pipeline with One Script
Instead of manually chaining each step, use a driver script that submits everything with correct dependencies in one go.
scripts/run_all.sh
#!/bin/bash
set -euo pipefail
cd ~/scratch/reseq
# Activate the env in THIS shell so every sbatch below inherits the PATH
# (sbatch --export=ALL is the default β the submitted jobs see the same tools)
export MAMBA_ROOT_PREFIX="${MAMBA_ROOT_PREFIX:-$HOME/micromamba}"
eval "$(micromamba shell hook --shell=bash)"
micromamba activate reseq_bch709
# --parsable returns just the job ID, perfect for chaining
DL_JID=$(sbatch --parsable scripts/01_download.sh)
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
MARK_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} scripts/04_markdup_bqsr.sh)
HC_JID=$(sbatch --parsable --dependency=afterok:${MARK_JID} scripts/05a_haplotypecaller.sh)
GATHER_JID=$(sbatch --parsable --dependency=afterok:${HC_JID} scripts/05b_gather_gvcf.sh)
JOINT_JID=$(sbatch --parsable --dependency=afterok:${GATHER_JID} scripts/06_joint_genotype.sh)
FILT_JID=$(sbatch --parsable --dependency=afterok:${JOINT_JID} scripts/07_filter_annotate.sh)
MQC_JID=$(sbatch --parsable --dependency=afterany:${FILT_JID} scripts/08_multiqc.sh)
echo "Submitted pipeline:"
printf ' %-20s %s\n' "download" "${DL_JID}"
printf ' %-20s %s\n' "reference" "${REF_JID}"
printf ' %-20s %s\n' "align" "${ALIGN_JID}"
printf ' %-20s %s\n' "markdup+bqsr" "${MARK_JID}"
printf ' %-20s %s\n' "haplotypecaller" "${HC_JID}"
printf ' %-20s %s\n' "gather" "${GATHER_JID}"
printf ' %-20s %s\n' "joint_genotype" "${JOINT_JID}"
printf ' %-20s %s\n' "filter+annotate" "${FILT_JID}"
printf ' %-20s %s\n' "multiqc" "${MQC_JID}"
echo ""
echo "Monitor with: squeue -u \$USER"
echo "Cancel all: scancel ${DL_JID} ${REF_JID} ${ALIGN_JID} ${MARK_JID} ${HC_JID} ${GATHER_JID} ${JOINT_JID} ${FILT_JID} ${MQC_JID}"
echo ""
echo "Final report (after pipeline finishes): ~/scratch/reseq/qc/reseq_report.html"
Run it:
chmod +x scripts/*.sh
bash scripts/run_all.sh
Youβll see 8 job IDs printed. Slurm queues them; each waits for its predecessor to succeed before starting.
| Dependency type | Meaning |
|---|---|
afterok:JID |
Start only if JID finished successfully |
afterany:JID |
Start after JID finishes, regardless of exit status |
afternotok:JID |
Start only if JID failed (for error-handling scripts) |
singleton |
Run after all other jobs with the same name finish |
π§βπ» Hands-on walkthrough β submit the pipeline step-by-step
If you want to see exactly what run_all.sh does (or debug one step), you can submit each stage manually. Every sbatch returns a job ID that the next step depends on.
Do this first (login shell β one time):
micromamba activate reseq_bch709
cd ~/scratch/reseq
Then submit each step β each line is one command:
# --- Step 1: download FASTQs (no prerequisites) ---
DL_JID=$(sbatch --parsable scripts/01_download.sh)
echo "download β $DL_JID"
# --- Step 2: reference prep (no prerequisites, runs in parallel with Step 1) ---
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
echo "reference β $REF_JID"
# --- Step 3: align (waits for BOTH download and reference) ---
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
echo "align β $ALIGN_JID"
# --- Step 4: markdup + BQSR (waits for align) ---
MARK_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} scripts/04_markdup_bqsr.sh)
echo "markdup+bqsr β $MARK_JID"
# --- Step 5a: HaplotypeCaller (waits for markdup) ---
HC_JID=$(sbatch --parsable --dependency=afterok:${MARK_JID} scripts/05a_haplotypecaller.sh)
echo "haplocaller β $HC_JID"
# --- Step 5b: gather per-sample GVCFs (waits for HaplotypeCaller) ---
GATHER_JID=$(sbatch --parsable --dependency=afterok:${HC_JID} scripts/05b_gather_gvcf.sh)
echo "gather β $GATHER_JID"
# --- Step 6: joint genotyping (waits for gather) ---
JOINT_JID=$(sbatch --parsable --dependency=afterok:${GATHER_JID} scripts/06_joint_genotype.sh)
echo "joint_geno β $JOINT_JID"
# --- Step 7: filter + annotate (waits for joint genotyping) ---
FILT_JID=$(sbatch --parsable --dependency=afterok:${JOINT_JID} scripts/07_filter_annotate.sh)
echo "filter+ann β $FILT_JID"
# Check that everything is queued
squeue -u $USER
# Steps 3-7 should show state PD with reason (Dependency)
After pasting the block, squeue shows all 8 jobs β some running, most pending. Slurm takes care of the ordering.
Why copy the commands into your terminal, not a script?
The hands-on walkthrough is literally what
run_all.shdoes β but by typing each line you see each job ID appear and can inspect things in between. Once youβre comfortable, just runbash scripts/run_all.shnext time.
9. Monitoring the Pipeline
Live status
squeue -u $USER # see all your jobs
squeue -u $USER --start # see expected start times for pending jobs
sacct -u $USER --starttime=today --format=JobID,JobName,State,Elapsed,MaxRSS,ExitCode
Watch logs in real time
# Latest log from any step (full path so this works from any cwd)
tail -f ~/scratch/reseq/logs/03_align_*.out | head -200
# Specifically task 2 of the align array job:
tail -f ~/scratch/reseq/logs/03_align_${ALIGN_JID}_2.out
Kill the whole pipeline if something is wrong
# Cancel all your jobs at once
scancel -u $USER
# Or cancel just the downstream ones (if e.g. alignment is still fine)
scancel ${MARK_JID} ${HC_JID} ${GATHER_JID} ${JOINT_JID} ${FILT_JID}
10. Resource Tuning
After your first successful run, check actual usage with sacct and right-size your requests:
sacct -u $USER --starttime=today \
--format=JobID,JobName,State,Elapsed,MaxRSS,ReqMem,ReqCPUs
| Observed | Action |
|---|---|
MaxRSS « ReqMem |
Lower --mem β reserves less, starts sooner |
MaxRSS β ReqMem |
Leave ~20% headroom to avoid OUT_OF_MEMORY |
Elapsed « ReqTime |
Lower --time |
TIMEOUT |
Raise --time, or check for stuck loops |
OUT_OF_MEMORY |
Raise --mem (suggest 2Γ MaxRSS) AND the -Xmx flag inside the script |
GATK / Picard heap tip: Always set
-Xmx2-4 GB below your#SBATCH --mem. E.g.,--mem=16gβ-Xmx12g. Java needs headroom for off-heap stack/GC; if-Xmxequals--mem, jobs die withOut of memory.
11. Expected Wall Time
For 2 Arabidopsis samples (~10-15x coverage each):
| Step | CPU cores Γ array size | Wall time |
|---|---|---|
| Download | 2 Γ 2 | ~10 min |
| Reference prep | 4 | ~15 min |
| Align + trim | 8 Γ 2 | ~30 min |
| MarkDup + BQSR | 4 Γ 2 | ~20 min |
| HaplotypeCaller (scatter) | 4 Γ 14 | ~15 min |
| MergeVcfs | 2 Γ 2 | ~2 min |
| Joint genotyping | 4 | ~5 min |
| Filter + annotate | 4 | ~5 min |
| Total (pipelined) | Β | ~90 min end-to-end |
Compared to ~8 hours on a laptop for the same 2 samples. For 100 samples, HPC finishes in about the same 90 minutes; the laptop would need ~1 week.
12. Common HPC-Specific Errors
Batch job submission failed: Invalid account or account/partition combinationCause: Typo in
--accountor--partition. Fix: Re-check withsacctmgr show user $USER format=user,account,partition,qos%-30 withassoc where user=$USER.
Array task fails with
sed: no input filesor empty SAMPLE variableCause:
samples.tsvis missing, empty, or has trailing blank lines. Fix:cat -A samples.tsv # show hidden chars, should have one $ per line, no blanks wc -l samples.tsv # line count must match --array range
CANCELLED by 0 ExitCode 0:15in a dependent jobCause: A parent job in the dependency chain failed, so Slurm auto-cancels all downstream jobs. Fix: Find the first
FAILEDjob withsacct -u $USER, fix it, then resubmit from that step onward (donβt re-run successful steps). See the next callout for the exact per-step commands.
Re-running a single failed step (drop
--dependency=)When only one step failed and every upstream step is already in
COMPLETEDstate, submit just the failed script with no--dependency=flag. The chained driver in Section 9 only needs--dependency=...because it submits everything at once withsbatch --parsablecapturing job IDs that donβt yet exist as completed jobs.Use these commands. Each is independent β pick the one for the step that failed, drop the rest:
sbatch scripts/01_download.sh # array job β all samples sbatch scripts/02_reference.sh # download + index TAIR10 + known_sites sbatch scripts/03_align.sh # array β fastp + bwa-mem2 sbatch scripts/04_markdup_bqsr.sh # array β Picard MarkDup + GATK BQSR sbatch scripts/05a_haplotypecaller.sh # 2-D array β sample Γ chromosome sbatch scripts/05b_gather_gvcf.sh # array β MergeVcfs per sample sbatch scripts/06_joint_genotype.sh # GenotypeGVCFs across cohort sbatch scripts/07_filter_annotate.sh # VariantFiltration + snpEff + PLINK sbatch scripts/08_multiqc.sh # final aggregated reportRe-run ONLY a subset of array tasks (e.g. sample 2 failed but sample 1 succeeded in step 5a):
sbatch --array=2 scripts/05a_haplotypecaller.shoverrides the
--array=line in the script header and runs only the listed task IDs.Special cases that need extra setup before re-running:
- Step 7 (snpEff failure with
class file version 65.0) β patch the env once before resubmitting:micromamba install -n reseq_bch709 -c bioconda 'snpeff<5.2' -y sbatch scripts/07_filter_annotate.sh- Step 5b (
GatherVcfs ... is not after first record) β re-copyscripts/05b_gather_gvcf.shfrom the lesson (now usesMergeVcfs), then resubmit.- Step 2 (
reference.dict already existsorcurl --max-time 1800timeout) β re-copyscripts/02_reference.shfrom the lesson (the current version hasrm -f reference.dictand removed the--max-timecap), then resubmit.Sanity check before resubmitting β verify upstream outputs exist:
sacct -u $USER --format=JobID,JobName%-25,State,ExitCode --starttime today ls -lh ~/scratch/reseq/{bam,vcf,reference.fasta*}
Jobs stuck in
(Dependency)state after a parent job succeededCause: Parent had a non-zero exit (even a warning) β
afteroktreats it as failed. Fix: If the βfailureβ was a harmless stderr warning, use--dependency=afterany:${JID}instead ofafterok. But first check: was the job actually successful?
Disk quota exceededin$HOMECause: You accidentally wrote BAMs/VCFs to
~instead of~/scratch. Fix:du -sh ~/* | sort -hto find the offender, then move it to scratch.
samtools sortruns out of/tmpon compute nodesCause: Local
/tmpis small (often <20 GB). Large BAM sorts can exceed it. Fix: Use Slurmβs job-local scratch variable instead:samtools sort -T ${TMPDIR:-/tmp}/${SAMPLE}_sort ...
13. Whatβs Next
- Scaling to 1000s of samples: switch to
GenomicsDBImport+ VQSR instead of hard filters. - Split big BAMs further: for human WGS, scatter HaplotypeCaller into 25-50 chunks per sample using interval lists.
- GWAS: once your cohort VCF is PASS-only and PLINK-converted, proceed to the GWAS tutorial.
- Workflow managers: for production pipelines, consider Nextflow or Snakemake β they handle all this dependency bookkeeping automatically.
References
| Resource | Link |
|---|---|
| Slurm Array Job guide | slurm.schedmd.com/job_array.html |
| Slurm Job Dependencies | slurm.schedmd.com/sbatch.html#OPT_dependency |
| GATK parallelism docs | gatk.broadinstitute.org/hc/en-us/articles/360035890411 |
| Pronghorn user guide | UNR Research Computing |
| Basic resequencing (laptop) | Resequencing Tutorial |
| HPC Cluster fundamentals | HPC Cluster Lesson |