πΊοΈ HPC series β you are here
1. HPC Cluster basics β SSH, file transfer, Micromamba, Slurm, dependencies 2. Resequencing pipeline on HPC β BWA-MEM2 + GATK, variant calling 3. π΅ ChIP-Seq pipeline (this page) β minimap2 + MACS3, peak calling 4. RNA-Seq pipeline on HPC β STAR + featureCounts, DE analysis
β Before you start β pre-class checklist
Skim this once before running any code below:
- You can
ssh <netid>@pronghorn.rc.unr.eduand see the Pronghorn promptsacctmgr show user $USER withassocshows accountcpu-s5-bch709-6/ partitioncpu-core-0~/scratchsymlink exists and points under/data/gpfs/assoc/bch709-6/<netid>micromamba --versionworks on the login node (shell hook in~/.bashrc)- You ran the laptop version of ChIP-Seq Tutorial at least once so the biology makes sense
Any box unchecked? β go back to the HPC Cluster lesson first.
Overview β Why Parallelize ChIP-Seq on HPC?
ChIP-Seq experiments usually have multiple conditions Γ multiple replicates Γ 1 input per condition. A 2-condition Γ 3-replicate study already means 8 samples (6 ChIP + 2 Input), and each needs its own alignment, dedup, peak call, and coverage track.
Running this serially on a laptop takes 12β24 hours. With Slurm array jobs on Pronghorn, all 8 samples are processed in parallel and the whole pipeline finishes in roughly the time of a single sample β about 45 minutes for small references, 1β2 hours for the full human genome.
Three HPC parallelization strategies used here
| Strategy | What it parallelizes | Slurm feature |
|---|---|---|
| Array jobs | Same command, different samples | #SBATCH --array=0-N |
| Job dependencies | Pipeline stages (align β dedup β peak call) | sbatch --dependency=afterok:JOBID |
| Paired treatment/control | MACS3 takes (ChIP_rep, matching Input) from one sample sheet | awk / join at submission time |
Pipeline DAG
[download array] β parallel across all FASTQs
β
[trim+align array] β parallel across all samples
β
[filter+dedup array] β parallel across all samples
β
[bamCoverage array] β per-sample BigWig
β
[MACS3 peak call array] β ChIP vs Input pairs in parallel
β
[QC + motif + IDR single job] β aggregate
0. Setup on Pronghorn
Connect and check your Slurm account
ssh <YOUR_NET_ID>@pronghorn.rc.unr.edu
sacctmgr show user $USER format=user,account,partition,qos%-30 \
withassoc where user=$USER
Use the values shown there (typically cpu-s5-bch709-6 / cpu-core-0 / student) in every #SBATCH line below.
Create the ChIP-Seq environment
# Use python 3.10 β bioconda has prebuilt wheels of macs3 / deeptools /
# idr / pysam for py3.10 that are linked against the conda sysroot's
# glibc 2.17 and run on Pronghorn's CentOS 7 compute nodes.
# Building these from source under py3.11 hits gcc 15's C23 default
# (`__isoc23_strtol@GLIBC_2.38`) and the cluster cannot load them.
micromamba create -n chipseq_bch709 -c conda-forge -c bioconda python=3.10 -y
micromamba activate chipseq_bch709
# Everything from bioconda β single install, no compilers, no pip builds.
# `numpy<1.24` because IDR 2.0.3 still uses numpy.int (alias removed in
# NumPy 1.24); deeptools also isn't NumPy-2 ready. Tightest working
# window is 1.20 β€ numpy < 1.24.
micromamba install -c conda-forge -c bioconda \
fastqc 'fastp>=0.24' minimap2 \
'samtools>=1.20' bedtools 'tabix>=1.11' \
openjdk=17 'picard>=3' homer \
'pysam>=0.22' macs3 deeptools idr \
'numpy<1.24' -y
# Upgrade pip first β older pip can't find the prebuilt `tiktoken`
# manylinux wheel (a transitive multiqc dep), tries to build from Rust
# source, and fails on Pronghorn (no Rust compiler).
pip install --upgrade pip
# Pip-only tools: multiqc + tiktoken. `tiktoken<0.8` pin avoids the Rust build.
pip install --prefer-binary 'tiktoken<0.8' 'multiqc<1.34'
MultiQC version (read this if the post-processing MultiQC step ever fails with
rich.panel AttributeErrororRSEM int('#')ValueError)MultiQC 1.34 (released 2026) has two crash bugs in its module loader. The recipe above pins
'multiqc<1.34', so a fresh env is fine. Existing envs built earlier may still have 1.34 β patch with the verifyβinstallβre-verify pattern (micromamba installfirst; fall back topiponly if the conda solver refuses to downgrade):# Step 1 β verify current micromamba run -n chipseq_bch709 multiqc --version # Step 2 β preferred: conda-side install (overrides any pip-installed copy) micromamba install -n chipseq_bch709 -c bioconda 'multiqc<1.34' -y # Step 3 β re-verify. If still 1.34, the solver kept the old version # (soft-pin from another package). Force it with pip: micromamba run -n chipseq_bch709 multiqc --version micromamba run -n chipseq_bch709 pip install -U --force-reinstall 'multiqc<1.34' # Step 4 β final verify; should print 1.33.x or earlier. micromamba run -n chipseq_bch709 multiqc --version
Patch libcrypto so samtools runs (do this now, not after it crashes):
Biocondaβs samtools is linked against libcrypto.so.1.0.0, but the current OpenSSL package ships libcrypto.so.3 (or .1.1). Without a symlink, you get:
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file
Create the symlink once, right after activating:
# Must be run with the env ACTIVE β $CONDA_PREFIX points at the env folder
cd "$CONDA_PREFIX/lib"
if [ -f libcrypto.so.1.1 ]; then ln -sf libcrypto.so.1.1 libcrypto.so.1.0.0
elif [ -f libcrypto.so.3 ]; then ln -sf libcrypto.so.3 libcrypto.so.1.0.0
fi
cd - > /dev/null
# Verify samtools works
samtools --version | head -1
# β should print: samtools 1.xx (no error about libcrypto)
If you see samtools 1.xx youβre done. If samtools --version still errors, ask the instructor before moving on.
π Activate once in your login shell β every
sbatchinherits the environmentDo this ONCE in the login shell before running any
sbatchcommand:micromamba activate chipseq_bch709 which fastp # confirm: should print a path inside ~/micromamba/envs/chipseq_bch709/By default,
sbatchsubmits jobs with--export=ALL, which means each Slurm job inherits your current shellβs environment β includingPATHpointing at the activated env. So you donβt need to putmicromamba activateinside every batch script.Sanity check: submit a tiny test job and confirm the tool is found on the compute node too:
sbatch -A cpu-s5-bch709-6 -p cpu-core-0 --time=00:05:00 --wrap="which fastp && fastp --version" # check the slurm-<jobid>.out log β should show the same fastp path + versionIf you open a new SSH session, the activation is lost β just run
micromamba activate chipseq_bch709again before submitting.
Create the project directory on scratch
mkdir -p ~/scratch/chipseq/{raw,trim,bam,bw,peaks,qc,logs,scripts}
cd ~/scratch/chipseq
| Folder | Contents |
|---|---|
raw/ |
Downloaded FASTQ files |
trim/ |
Adapter-trimmed FASTQ |
bam/ |
Aligned / filtered / deduplicated BAMs |
bw/ |
BigWig signal tracks |
peaks/ |
MACS3 narrowPeak / broadPeak results |
qc/ |
fingerprint, correlation, motif outputs |
logs/ |
Slurm log files |
scripts/ |
All batch scripts below |
Define your sample sheet
samples.tsv lists every sample (ChIP and Input) with its ENCODE accession. A separate column names the matching input so MACS3 can look it up at submission time.
Each row is TAB-separated (NOT spaces). Downstream scripts use cut -f1 and awk -F'\t' '$3=="chip"' β they break if you accidentally use spaces. Use printf so the \t is unambiguously a tab:
printf 'sample\taccession\ttype\tcontrol\n' > ~/scratch/chipseq/samples.tsv
printf 'ctcf_rep1\tENCFF001HTO\tchip\tinput_k562\n' >> ~/scratch/chipseq/samples.tsv
printf 'ctcf_rep2\tENCFF001HTP\tchip\tinput_k562\n' >> ~/scratch/chipseq/samples.tsv
printf 'input_k562\tENCFF001HTT\tinput\t-\n' >> ~/scratch/chipseq/samples.tsv
Verify itβs tab-separated (every row should have exactly 4 fields):
awk -F'\t' '{print NF}' ~/scratch/chipseq/samples.tsv | sort -u
# Expected output: 4
# If you see 1, you pasted spaces instead of tabs β re-run the printf lines above.
column -t -s $'\t' ~/scratch/chipseq/samples.tsv # pretty-print to verify content
| Column | Description |
|---|---|
sample |
Short name used in output filenames |
accession |
ENCODE file accession for wget download |
type |
chip or input β MACS3 runs only on chip rows |
control |
Which sample row to use as the -c input for MACS3; - for input rows |
Scaling up: add more rows for additional replicates or conditions. All the
--arrayranges below are derived fromwc -l samples.tsvso everything scales automatically.
1. Download FASTQ in parallel (Array Job)
scripts/01_download.sh
#!/bin/bash
#SBATCH --job-name=chip_download
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=4g
#SBATCH --time=12:00:00
#SBATCH --array=1-3 # = number of samples (data rows in samples.tsv)
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/01_download_%A_%a.out
set -euo pipefail
cd ~/scratch/chipseq
# Row 1 is header β add 1 to task ID
ROW=$((SLURM_ARRAY_TASK_ID + 1))
LINE=$(sed -n "${ROW}p" samples.tsv)
SAMPLE=$(echo "$LINE" | cut -f1)
ACC=$(echo "$LINE" | cut -f2)
URL="https://www.encodeproject.org/files/${ACC}/@@download/${ACC}.fastq.gz"
echo "[task ${SLURM_ARRAY_TASK_ID}] Downloading ${SAMPLE} (${ACC})"
# --retry-all-errors + -C -: harden against SSL eof drops on large
# transfers; resume partials in place. See HPC_RNA_SEQ.md fastq-dump
# (#64) for the underlying issue.
curl -fsSL --retry 5 --retry-all-errors --retry-delay 10 --max-time 3600 \
-C - -o raw/${SAMPLE}.fastq.gz "${URL}"
# Validate the gzip immediately β prevents silent truncation
gunzip -t raw/${SAMPLE}.fastq.gz
ls -lh raw/${SAMPLE}.fastq.gz
echo "${SAMPLE} OK"
Submit (capture the job ID so downstream steps can depend on it):
# Make sure the env is active in THIS shell (once per login session)
micromamba activate chipseq_bch709
cd ~/scratch/chipseq
DL_JID=$(sbatch --parsable scripts/01_download.sh)
echo "Download job: ${DL_JID}"
squeue -u $USER
2. Reference Preparation (Single Job)
#!/bin/bash
#SBATCH --job-name=chip_ref
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=02:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/02_reference_%j.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
# hg19 genome (example data is ENCODE-legacy aligned to hg19)
# Use HTTPS + curl with retries β UCSC redirects HTTP to HTTPS and a
# plain wget can fail on the redirect. Two mirror choices:
# primary: hgdownload.soe.ucsc.edu (US west)
# mirror : hgdownload-euro.soe.ucsc.edu (Europe β faster from EU)
UCSC_URL="https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz"
UCSC_MIRROR="https://hgdownload-euro.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz"
# --retry-all-errors + -C -: harden against SSL eof drops on large
# transfers; resume partials in place. See HPC_RNA_SEQ.md fastq-dump
# (#64) for the underlying issue.
curl -fsSL --retry 5 --retry-all-errors --retry-delay 10 --max-time 3600 \
-C - -o hg19.fa.gz "${UCSC_URL}" \
|| curl -fsSL --retry 5 --retry-all-errors --max-time 3600 \
-C - -o hg19.fa.gz "${UCSC_MIRROR}"
gunzip -f hg19.fa.gz
mv -f hg19.fa reference.fasta
samtools faidx reference.fasta
minimap2 -d reference.mmi reference.fasta
# TSS BED for later heatmaps
REFG_URL="https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz"
REFG_MIRROR="https://hgdownload-euro.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz"
curl -fsSL --retry 3 --retry-delay 10 --max-time 600 -o refGene.txt.gz "${REFG_URL}" \
|| curl -fsSL --retry 3 --max-time 600 -o refGene.txt.gz "${REFG_MIRROR}"
gunzip -f refGene.txt.gz
awk 'BEGIN{OFS="\t"} {
if ($4=="+") print $3, $5, $5+1, $2, "0", $4;
else print $3, $6-1, $6, $2, "0", $4
}' refGene.txt | sort -k1,1 -k2,2n | uniq > TSS.bed
echo "Reference prep done."
Save as scripts/02_reference.sh and submit (independent of download β runs in parallel with Step 1):
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
echo "Reference job: ${REF_JID}"
3. Trim + Align (Array Job)
scripts/03_align.sh β trims adapters with fastp, aligns with minimap2 -ax sr, sorts+indexes in one pipe.
#!/bin/bash
#SBATCH --job-name=chip_align
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=8
#SBATCH --mem=32g
#SBATCH --time=12:00:00
#SBATCH --array=1-3
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/03_align_%A_%a.out
set -euo pipefail
set -o pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
ROW=$((SLURM_ARRAY_TASK_ID + 1))
SAMPLE=$(sed -n "${ROW}p" samples.tsv | cut -f1)
# ---- fastp ----
fastp \
--in1 raw/${SAMPLE}.fastq.gz \
--out1 trim/${SAMPLE}.trimmed.fq.gz \
--qualified_quality_phred 20 \
--length_required 25 \
--thread ${SLURM_CPUS_PER_TASK} \
--html trim/${SAMPLE}_fastp.html \
--json trim/${SAMPLE}_fastp.json
# ---- minimap2 align + sort ----
minimap2 -ax sr -t ${SLURM_CPUS_PER_TASK} reference.mmi trim/${SAMPLE}.trimmed.fq.gz \
2> logs/${SAMPLE}.minimap2.log \
| samtools sort -@ ${SLURM_CPUS_PER_TASK} -T /tmp/${SAMPLE}_sort \
-o bam/${SAMPLE}.bam
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.bam
samtools quickcheck bam/${SAMPLE}.bam && echo "${SAMPLE}.bam OK"
samtools flagstat bam/${SAMPLE}.bam > bam/${SAMPLE}.flagstat
Submit with dependencies β wait for both download (raw FASTQ) and reference prep (index) to succeed:
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
echo "Align: ${ALIGN_JID}"
4. Filter MAPQ + Remove Duplicates (Array Job)
#!/bin/bash
#SBATCH --job-name=chip_dedup
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=06:00:00
#SBATCH --array=1-3
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/04_dedup_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
ROW=$((SLURM_ARRAY_TASK_ID + 1))
SAMPLE=$(sed -n "${ROW}p" samples.tsv | cut -f1)
# ---- Filter by MAPQ>=30 and drop unmapped ----
samtools view -b -q 30 -F 4 -@ ${SLURM_CPUS_PER_TASK} \
bam/${SAMPLE}.bam > bam/${SAMPLE}.filt.bam
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.filt.bam
# ---- Remove PCR duplicates ----
picard -Xmx12g MarkDuplicates \
I=bam/${SAMPLE}.filt.bam \
O=bam/${SAMPLE}.dedup.bam \
M=bam/${SAMPLE}.dedup.metrics \
REMOVE_DUPLICATES=true \
VALIDATION_STRINGENCY=SILENT
samtools index -@ ${SLURM_CPUS_PER_TASK} bam/${SAMPLE}.dedup.bam
# ---- Cleanup ----
rm -f bam/${SAMPLE}.bam bam/${SAMPLE}.bam.bai
rm -f bam/${SAMPLE}.filt.bam bam/${SAMPLE}.filt.bam.bai
Submit:
DEDUP_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} scripts/04_dedup.sh)
5. BigWig Signal Tracks (Array Job)
Every sample β including inputs β gets its own RPKM-normalized BigWig.
#!/bin/bash
#SBATCH --job-name=chip_bw
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=8
#SBATCH --mem=16g
#SBATCH --time=04:00:00
#SBATCH --array=1-3
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/05_bw_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
ROW=$((SLURM_ARRAY_TASK_ID + 1))
SAMPLE=$(sed -n "${ROW}p" samples.tsv | cut -f1)
bamCoverage \
--bam bam/${SAMPLE}.dedup.bam \
--outFileName bw/${SAMPLE}.bw \
--normalizeUsing RPKM \
--binSize 10 \
--numberOfProcessors ${SLURM_CPUS_PER_TASK}
Submit:
BW_JID=$(sbatch --parsable --dependency=afterok:${DEDUP_JID} scripts/05_bigwig.sh)
6. MACS3 Peak Calling (ChIP-only Array Job)
Only the chip rows in samples.tsv get peak-called. Each task looks up its control column to find the matching Input BAM.
#!/bin/bash
#SBATCH --job-name=chip_macs3
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=04:00:00
#SBATCH --array=1-2 # number of `chip`-type rows (ctcf_rep1, ctcf_rep2)
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/06_macs3_%A_%a.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
# Pull only the `chip`-type rows (skip header and input rows)
CHIP_ROWS=($(awk -F'\t' 'NR>1 && $3=="chip" {print $1":"$4}' samples.tsv))
IDX=$((SLURM_ARRAY_TASK_ID - 1))
PAIR=${CHIP_ROWS[$IDX]}
SAMPLE=${PAIR%:*}
CONTROL=${PAIR#*:}
echo "Task ${SLURM_ARRAY_TASK_ID}: ChIP=${SAMPLE} Control=${CONTROL}"
macs3 callpeak \
-t bam/${SAMPLE}.dedup.bam \
-c bam/${CONTROL}.dedup.bam \
--format BAM \
--gsize hs \
--name ${SAMPLE} \
--outdir peaks/${SAMPLE} \
--qvalue 0.05 \
--nomodel --extsize 147 # skip model building; use for low-depth / small-reference data
echo "Peaks written to peaks/${SAMPLE}/"
wc -l peaks/${SAMPLE}/${SAMPLE}_peaks.narrowPeak
Dynamic array size tip: if you donβt want to manually count ChIP rows, set the array size at submission time:
N_CHIP=$(awk -F'\t' 'NR>1 && $3=="chip"' samples.tsv | wc -l) sbatch --array=1-${N_CHIP} --dependency=afterok:${DEDUP_JID} scripts/06_macs3.sh
Submit:
N_CHIP=$(awk -F'\t' 'NR>1 && $3=="chip"' samples.tsv | wc -l)
MACS_JID=$(sbatch --parsable --array=1-${N_CHIP} --dependency=afterok:${DEDUP_JID} scripts/06_macs3.sh)
7. QC Aggregation (Single Job β runs after everything else)
Combines fingerprint, correlation, IDR, and MultiQC into one post-processing job.
#!/bin/bash
#SBATCH --job-name=chip_qc
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=8
#SBATCH --mem=16g
#SBATCH --time=04:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/07_qc_%j.out
set -euo pipefail
# Environment is activated in the login shell before `sbatch` is called β
# the PATH (with fastp, minimap2, samtools, β¦) is inherited automatically.
cd ~/scratch/chipseq
# ---- deepTools fingerprint (global IP quality) ----
BAMS=$(awk -F'\t' 'NR>1 {print "bam/"$1".dedup.bam"}' samples.tsv | tr '\n' ' ')
LABELS=$(awk -F'\t' 'NR>1 {print $1}' samples.tsv | tr '\n' ' ')
plotFingerprint \
--bamfiles ${BAMS} \
--labels ${LABELS} \
--plotFile qc/fingerprint.png \
--outRawCounts qc/fingerprint.tab \
--numberOfProcessors ${SLURM_CPUS_PER_TASK}
# ---- Replicate correlation ----
multiBamSummary bins \
--bamfiles ${BAMS} \
--labels ${LABELS} \
--outFileName qc/multibam.npz \
--numberOfProcessors ${SLURM_CPUS_PER_TASK}
plotCorrelation \
--corData qc/multibam.npz \
--corMethod pearson \
--skipZeros \
--whatToPlot heatmap \
--colorMap RdYlBu \
--plotFile qc/correlation.png \
--outFileCorMatrix qc/correlation.txt
# ---- TSS heatmap for every ChIP sample ----
CHIP_BWS=$(awk -F'\t' 'NR>1 && $3=="chip" {print "bw/"$1".bw"}' samples.tsv | tr '\n' ' ')
CHIP_LABELS=$(awk -F'\t' 'NR>1 && $3=="chip" {print $1}' samples.tsv | tr '\n' ' ')
computeMatrix reference-point \
--scoreFileName ${CHIP_BWS} \
--regionsFileName TSS.bed \
--referencePoint center \
--upstream 3000 --downstream 3000 \
--binSize 50 \
--numberOfProcessors ${SLURM_CPUS_PER_TASK} \
--samplesLabel ${CHIP_LABELS} \
-o qc/tss_matrix.gz
plotHeatmap \
--matrixFile qc/tss_matrix.gz \
--outFileName qc/tss_heatmap.png \
--colorMap Blues
# ---- IDR (only if exactly 2 replicates per condition) ----
# Generalize this loop for your conditions.
if [ -f peaks/ctcf_rep1/ctcf_rep1_peaks.narrowPeak ] && \
[ -f peaks/ctcf_rep2/ctcf_rep2_peaks.narrowPeak ]; then
sort -k5,5rn peaks/ctcf_rep1/ctcf_rep1_peaks.narrowPeak > qc/rep1_sorted.np
sort -k5,5rn peaks/ctcf_rep2/ctcf_rep2_peaks.narrowPeak > qc/rep2_sorted.np
idr \
--samples qc/rep1_sorted.np qc/rep2_sorted.np \
--input-file-type narrowPeak \
--rank p.value \
--output-file qc/idr_peaks.txt \
--plot \
--log-output-file qc/idr.log
fi
# ---- MultiQC aggregates fastp + flagstat + MarkDup + MACS3 + deepTools fingerprint ----
# NOTE: Explicit --module whitelist matches what this pipeline actually emits
# (no rsem, no gatk). We *also* pass --exclude rsem --exclude gatk as
# belt-and-suspenders: MultiQC 1.34's rsem parser ValueErrors on stray
# '#'-prefixed files in sibling dirs (e.g. stale ~/scratch/rnaseq/ outputs
# from RNA-Seq lessons), and the gatk module hits a rich.panel
# AttributeError on BQSR scatters β neither is relevant here, but we don't
# want either to load even if the whitelist parser changes upstream.
# Re-evaluate once the upstream MultiQC bugs are fixed.
multiqc . -o qc/ -n chipseq_report --force \
--module fastp \
--module samtools \
--module picard \
--module deeptools \
--module macs2 \
--exclude rsem \
--exclude gatk
β οΈ Do NOT run a bare
multiqc .from~/scratch/or any parent directory. MultiQC 1.34 will scan everything below the cwd and try to parse files from sibling pipelines (e.g.~/scratch/rnaseq/), tripping known module bugs (rsem.pyValueError on stray#lines, the GATK BQSRrich.panelAttributeError). The safest path is just:sbatch scripts/07_qc.shIf you must run it interactively, mirror the scriptβs flag set exactly β
cdinto the chipseq workspace and pass the same module whitelist + excludes:cd ~/scratch/chipseq multiqc . -o qc/ -n chipseq_report --force \ --module fastp --module samtools --module picard \ --module deeptools --module macs2 \ --exclude rsem --exclude gatk
Submit:
QC_JID=$(sbatch --parsable --dependency=afterok:${MACS_JID}:${BW_JID}:${REF_JID}:${ALIGN_JID} scripts/07_qc.sh)
8. Submit the Entire Pipeline with One Script
#!/bin/bash
# scripts/run_all.sh
set -euo pipefail
cd ~/scratch/chipseq
# Activate the env in THIS shell so every sbatch below inherits the PATH
# (sbatch --export=ALL is the default β the submitted jobs see the same tools)
export MAMBA_ROOT_PREFIX="${MAMBA_ROOT_PREFIX:-$HOME/micromamba}"
eval "$(micromamba shell hook --shell=bash)"
micromamba activate chipseq_bch709
N_SAMPLES=$(awk 'NR>1' samples.tsv | wc -l)
N_CHIP=$(awk -F'\t' 'NR>1 && $3=="chip"' samples.tsv | wc -l)
DL_JID=$(sbatch --parsable --array=1-${N_SAMPLES} scripts/01_download.sh)
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
ALIGN_JID=$(sbatch --parsable --array=1-${N_SAMPLES} --dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
DEDUP_JID=$(sbatch --parsable --array=1-${N_SAMPLES} --dependency=afterok:${ALIGN_JID} scripts/04_dedup.sh)
BW_JID=$(sbatch --parsable --array=1-${N_SAMPLES} --dependency=afterok:${DEDUP_JID} scripts/05_bigwig.sh)
MACS_JID=$(sbatch --parsable --array=1-${N_CHIP} --dependency=afterok:${DEDUP_JID} scripts/06_macs3.sh)
QC_JID=$(sbatch --parsable --dependency=afterok:${MACS_JID}:${BW_JID}:${REF_JID}:${ALIGN_JID} scripts/07_qc.sh)
cat <<EOF
Submitted ChIP-Seq pipeline (${N_SAMPLES} samples, ${N_CHIP} ChIP):
download ${DL_JID}
reference ${REF_JID}
trim+align ${ALIGN_JID}
filter+dedup ${DEDUP_JID}
bigwig ${BW_JID}
macs3 ${MACS_JID}
qc ${QC_JID}
Monitor: squeue -u \$USER
Cancel all: scancel ${DL_JID} ${REF_JID} ${ALIGN_JID} ${DEDUP_JID} ${BW_JID} ${MACS_JID} ${QC_JID}
EOF
Run:
chmod +x scripts/*.sh
bash scripts/run_all.sh
π§βπ» Hands-on walkthrough β submit the pipeline step-by-step
If you want to see exactly what run_all.sh does (or debug one step), submit each stage manually. Every sbatch returns a job ID that the next step depends on.
Do this first (login shell β one time):
micromamba activate chipseq_bch709
cd ~/scratch/chipseq
N_SAMPLES=$(awk 'NR>1' samples.tsv | wc -l)
N_CHIP=$(awk -F'\t' 'NR>1 && $3=="chip"' samples.tsv | wc -l)
echo "Samples: $N_SAMPLES | ChIP: $N_CHIP"
Then submit each step β each line is one command:
# --- Step 1: download FASTQs (no prerequisites) ---
DL_JID=$(sbatch --parsable --array=1-${N_SAMPLES} scripts/01_download.sh)
echo "download β $DL_JID"
# --- Step 2: reference prep (independent β runs in parallel with Step 1) ---
REF_JID=$(sbatch --parsable scripts/02_reference.sh)
echo "reference β $REF_JID"
# --- Step 3: trim + align (waits for BOTH download and reference) ---
ALIGN_JID=$(sbatch --parsable --array=1-${N_SAMPLES} \
--dependency=afterok:${DL_JID}:${REF_JID} scripts/03_align.sh)
echo "align β $ALIGN_JID"
# --- Step 4: filter MAPQ + dedup (waits for align) ---
DEDUP_JID=$(sbatch --parsable --array=1-${N_SAMPLES} \
--dependency=afterok:${ALIGN_JID} scripts/04_dedup.sh)
echo "dedup β $DEDUP_JID"
# --- Step 5: bigwig (waits for dedup) ---
BW_JID=$(sbatch --parsable --array=1-${N_SAMPLES} \
--dependency=afterok:${DEDUP_JID} scripts/05_bigwig.sh)
echo "bigwig β $BW_JID"
# --- Step 6: MACS3 peak calling (ChIP only, waits for dedup) ---
MACS_JID=$(sbatch --parsable --array=1-${N_CHIP} \
--dependency=afterok:${DEDUP_JID} scripts/06_macs3.sh)
echo "macs3 β $MACS_JID"
# --- Step 7: QC aggregation (waits for bigwig, MACS3, reference TSS.bed, and align outputs) ---
QC_JID=$(sbatch --parsable --dependency=afterok:${MACS_JID}:${BW_JID}:${REF_JID}:${ALIGN_JID} scripts/07_qc.sh)
echo "qc β $QC_JID"
# Check that everything is queued
squeue -u $USER
# Steps 3-7 should show state PD with reason (Dependency)
After pasting the block, squeue shows all 7 jobs β some running, most pending. Slurm takes care of the ordering.
Why copy the commands into your terminal, not a script?
The hands-on walkthrough is literally what
run_all.shdoes β but by typing each line you see each job ID appear and can inspect things in between. Once youβre comfortable, just runbash scripts/run_all.shnext time.
9. Monitoring
squeue -u $USER # live queue
sacct -u $USER --starttime=today --format=JobID,JobName,State,Elapsed,MaxRSS,ExitCode
tail -f logs/06_macs3_${MACS_JID}_1.out # follow one task's output
Cancel the whole pipeline if somethingβs wrong:
scancel -u $USER
10. Resource Tuning
After the first successful run, review actual usage and shrink the requests:
sacct -u $USER --starttime=today \
--format=JobID,JobName,State,Elapsed,MaxRSS,ReqMem,ReqCPUs
| If you see | Then | Why |
|---|---|---|
MaxRSS βͺ ReqMem |
Lower --mem |
Job queues faster on smaller requests |
Elapsed βͺ ReqTime |
Lower --time |
Same |
OUT_OF_MEMORY |
Raise --mem to ~2Γ MaxRSS |
Leave headroom for Java GC |
TIMEOUT on align |
Raise --time, check fastq size |
Very deep BAMs take longer |
Heap vs Slurm memory: Picard and GATK obey
-Xmx(Java heap), not--mem. Always set-Xmx2β4 GB below--mem(e.g.--mem=16gβ-Xmx12g) so the JVM has room for off-heap stack/GC.
11. Expected Wall Time
For 3 samples (2 ChIP + 1 Input, ~35M reads each, full hg19):
| Step | Cores Γ array | Wall time |
|---|---|---|
| Download | 2 Γ 3 | ~10 min |
| Reference | 4 | ~10 min |
| Trim + align | 8 Γ 3 | ~25 min |
| Filter + dedup | 4 Γ 3 | ~10 min |
| BigWig | 8 Γ 3 | ~5 min |
| MACS3 | 4 Γ 2 | ~5 min |
| QC aggregation | 8 | ~5 min |
| End-to-end | Β | ~45 min |
Versus ~10 hours on a laptop for the same 3 samples.
12. HPC-Specific Errors
Batch job submission failed: Invalid account or account/partition combinationFix: re-run
sacctmgr show user $USER ... withassocand copy values exactly into each script.
MACS3 array task
sed: -e expression: unmatched ... samples.tsvCause:
samples.tsvhas blank trailing lines or the task index doesnβt match a real row. Fix:cat -A samples.tsv # look for stray $ or tabs awk 'NF' samples.tsv > samples.tsv.clean && mv samples.tsv.clean samples.tsv
MACS3 hangs or fails with
peaks: 0Cause: ChIP too shallow to build the fragment model; or input far deeper than ChIP. Fix: our script already passes
--nomodel --extsize 147. If peaks are still zero, check FRiP and confirm the ChIP/Input pair is correct.
IDR AttributeError: module 'numpy' has no attribute 'int'Cause: IDR 2.0.3 still uses
numpy.int(alias removed in NumPy 1.24+).Fix: the env recipe above pins
'numpy<1.24'so a fresh env is fine. Existing envs built earlier (with the oldnumpy<2.0pin that let 1.24+ slip in) β patch withmicromamba installfirst (keeps the conda metadata consistent); only fall back topip installif the conda solver refuses to downgrade.# Step 1 β verify current numpy micromamba run -n chipseq_bch709 python -c "import numpy; print(numpy.__version__)" # Step 2 β preferred: conda-side downgrade micromamba install -n chipseq_bch709 -c conda-forge 'numpy<1.24' -y # Step 3 β re-verify; if still β₯1.24, the solver kept the old version # (soft-pin from another package). Force it with pip: micromamba run -n chipseq_bch709 pip install -U --force-reinstall 'numpy<1.24' # Step 4 β final verify (should print 1.23.x or earlier) micromamba run -n chipseq_bch709 python -c "import numpy; print(numpy.__version__)"
Pipeline queued but downstream jobs stuck in
(Dependency)state foreverCause: A parent array task exited non-zero.
afteroktreats the whole array as failed if any single task fails. Fix: find the failing task withsacct -u $USER --starttime=today | grep FAILED, fix it, resubmit only that step and update downstream--dependencyto the new job ID. See the next callout for the exact per-step commands when every upstream step is alreadyCOMPLETED.
Re-running a single failed step (drop
--dependency=)When only one step failed and every upstream step is already in
COMPLETEDstate, submit just the failed script with no--dependency=flag. The chained driver in Section 8 only needs--dependency=...because it submits everything at once withsbatch --parsablecapturing job IDs that donβt yet exist as completed jobs.Use these commands. Each is independent β pick the one for the step that failed, drop the rest:
sbatch scripts/01_download.sh # array β FASTQ download (one task per row in samples.tsv) sbatch scripts/02_reference.sh # download + minimap2 index + TSS.bed sbatch scripts/03_align.sh # array β fastp + minimap2 -ax sr sbatch scripts/04_dedup.sh # array β MAPQ filter + samtools markdup sbatch scripts/05_bigwig.sh # array β deepTools bamCoverage β bw/ sbatch scripts/06_macs3.sh # array (ChIP rows only) β MACS3 callpeak sbatch scripts/07_qc.sh # aggregated QC β fingerprint + IDR + MultiQCRe-run ONLY a subset of array tasks (e.g. only the 2nd ChIP replicate failed in step 6):
sbatch --array=2 scripts/06_macs3.shoverrides the
--array=line in the script header and runs only the listed task IDs.Special cases that need extra setup before re-running:
- Step 6 (
06_macs3.sh) β needs both the ChIP and the matching Input dedup BAM (bam/${SAMPLE}.dedup.bamandbam/${CONTROL}.dedup.bam). If only one ChIP replicate failed, confirm the input BAM (e.g.bam/input_k562.dedup.bam) still exists before resubmitting.- Step 7 (
07_qc.sh) β IDR step is gated onpeaks/ctcf_rep1/ctcf_rep1_peaks.narrowPeakandpeaks/ctcf_rep2/...narrowPeak. If MACS3 was rerun, those files exist β IDR will run. If you want the QC report even when MACS3 partly failed, resubmit with--dependency=afteranyinstead ofafterok.Sanity check before resubmitting β verify upstream outputs exist:
sacct -u $USER --format=JobID,JobName%-25,State,ExitCode --starttime today ls -lh ~/scratch/chipseq/{bam,bw,peaks}
13. Next Steps
- Differential binding: with multiple conditions, run DiffBind in R on the dedup BAMs + narrowPeak files.
- Motif discovery: after the pipeline, run
findMotifsGenome.pl(HOMER) on the PASS peaks. - Peak annotation: use
ChIPseekerin R to annotate peaks relative to genes/TSS. - Scaling to 100+ samples: add more rows to
samples.tsvβ theN_SAMPLES/N_CHIPvariables inrun_all.shpick up the new count automatically. - Workflow managers: for production use, look at Snakemake or Nextflow β they handle this dependency bookkeeping natively.
References
| Resource | Link |
|---|---|
| Slurm Array Jobs | slurm.schedmd.com/job_array.html |
| Slurm Job Dependencies | slurm.schedmd.com/sbatch.html#OPT_dependency |
| MACS3 | macs3-project.github.io/MACS |
| deepTools | deeptools.readthedocs.io |
| Basic ChIP-Seq (laptop) | ChIP-Seq Tutorial |
| HPC Cluster fundamentals | HPC Cluster Lesson |
| Basic Resequencing HPC | Resequencing HPC |