🤖 BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: HPC RNA-Seq

🗺️ HPC series — you are here

1. HPC Cluster basics — SSH, file transfer, Micromamba, Slurm, dependencies 2. Resequencing pipeline on HPC — BWA-MEM2 + GATK, variant calling 3. ChIP-Seq pipeline on HPC — minimap2 + MACS3, peak calling 4. 🔵 RNA-Seq pipeline (this page) — STAR + featureCounts, DE analysis

✅ Before you start — pre-class checklist

  • You can ssh <netid>@pronghorn.rc.unr.edu and see the Pronghorn prompt
  • sacctmgr show user $USER withassoc shows account cpu-s5-bch709-6 / partition cpu-core-0
  • ~/scratch symlink exists and points under /data/gpfs/assoc/bch709-6/<netid>
  • micromamba --version works on the login node (shell hook in ~/.bashrc)
  • You ran the laptop version of RNA-Seq tutorial at least once so the biology makes sense

Any box unchecked? → go back to the HPC Cluster lesson first.


Using Pronghorn (High-Performance Computing)

Pronghorn is the University of Nevada, Reno’s new High-Performance Computing (HPC) cluster. The GPU-accelerated system is designed, built and maintained by the Office of Information Technology’s HPC Team. Pronghorn and the HPC Team supports general research across the Nevada System of Higher Education (NSHE).

Pronghorn is composed of CPU, GPU, and Storage subsystems interconnected by a 100Gb/s non-blocking Intel Omni-Path fabric. The CPU partition features 93 nodes, 2,976 CPU cores, and 21TiB of memory. The GPU partition features 44 NVIDIA Tesla P100 GPUs, 352 CPU cores, and 2.75TiB of memory. The storage system uses the IBM SpectrumScale file system to provide 1PB of high-performance storage. The computational and storage capabilities of Pronghorn will regularly expand to meet NSHE computing demands.

Pronghorn is collocated at the Switch Citadel Campus located 25 miles East of the University of Nevada, Reno. Switch is the definitive leader of sustainable data center design and operation. The Switch Citadel is rated Tier 5 Platinum, and will be the largest, most advanced data center campus on the planet.

Pronghorn system map

Slurm Start Tutorial

Resource sharing on a supercomputer dedicated to technical and/or scientific computing is often organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are scheduled and allocated resources (CPU time, memory, etc.) by the resource manager.

Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.

Gathering information Slurm offers many commands you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the squeue command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes (computers dedicated to… computing) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.

sinfo

sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu-s2-core-0     up 14-00:00:0      2    mix cpu-[8-9]
cpu-s2-core-0     up 14-00:00:0      7  alloc cpu-[1-2,4-6,78-79]
cpu-s2-core-0     up 14-00:00:0     44   idle cpu-[0,3,7,10-47,64,76-77]
cpu-s3-core-0*    up    2:00:00      2    mix cpu-[8-9]
cpu-s3-core-0*    up    2:00:00      7  alloc cpu-[1-2,4-6,78-79]
cpu-s3-core-0*    up    2:00:00     44   idle cpu-[0,3,7,10-47,64,76-77]
gpu-s2-core-0     up 14-00:00:0     11   idle gpu-[0-10]
cpu-s6-core-0     up      15:00      2   idle cpu-[65-66]
cpu-s1-pgl-0      up 14-00:00:0      1    mix cpu-49
cpu-s1-pgl-0      up 14-00:00:0      1  alloc cpu-48
cpu-s1-pgl-0      up 14-00:00:0      2   idle cpu-[50-51]

In the above example, we see two partitions, named batch and debug. The latter is the default partition as it is marked with an asterisk. All nodes of the debug partition are idle, while two of the batch partition are being used.

The sinfo command also lists the time limit (column TIMELIMIT) to which jobs are subject. On every cluster, jobs are limited to a maximum run time, to allow job rotation and let every user a chance to see their job being started. Generally, the larger the cluster, the smaller the maximum allowed time. You can find the details on the cluster page.

You can actually specify precisely what information you would like sinfo to output by using its –format argument. For more details, have a look at the command manpage with man sinfo.

squeue

The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as ‘R’) or waiting for resources (noted as ‘PD’, short for PENDING).

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            983204 cpu-s2-co    neb_K jzhang23  R 6-09:05:47      1 cpu-6
            983660 cpu-s2-co   RT3.sl yinghanc  R   12:56:17      1 cpu-9
            983659 cpu-s2-co   RT4.sl yinghanc  R   12:56:21      1 cpu-8
            983068 cpu-s2-co Gd-bound   dcantu  R 7-06:16:01      2 cpu-[78-79]
            983067 cpu-s2-co Gd-unbou   dcantu  R 1-17:41:56      2 cpu-[1-2]
            983472 cpu-s2-co   ub-all   dcantu  R 3-10:05:01      2 cpu-[4-5]
            982604 cpu-s1-pg     wrap     wyim  R 12-14:35:23      1 cpu-49
            983585 cpu-s1-pg     wrap     wyim  R 1-06:28:29      1 cpu-48
            983628 cpu-s1-pg     wrap     wyim  R   13:44:46      1 cpu-49

Text editor

SBATCH

Now the question is: How do you create a job?

A job consists in two parts: resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch.

Important

The SBATCH directives must appear at the top of the submission file, before any other line except for the very first line which should be the shebang (e.g. #!/bin/bash). The script itself is a job step. Other job steps are created with the srun command. For instance, the following script, hypothetically named submit.sh,

nano submit.sh
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=8:10:00
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH -o test_%j.out

for i in {1..1000}; 
do 
    echo $i;
    sleep 1; 
done

would request one CPU for 10 minutes, along with 1g of RAM, in the default queue. When started, the job would run a first job step srun hostname, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the sleep command. Note that the –job-name parameter allows giving a meaningful name to the job and the –output parameter defines the file to which the output of the job must be sent.

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

chmod 775 submit.sh
sbatch submit.sh
sbatch: Submitted batch job 99999999

How to cancel the job?

scancel <JOB ID>

Scratch disk space — already set up

You already created ~/scratch (symlink to /data/gpfs/assoc/bch709-6/<your_netid>) in the HPC Cluster lesson. All paths in this lesson use ~/scratch directly. If ls -la ~/scratch doesn’t show a symlink to your scratch directory, go back and create it first.

Pronghorn system map

Importing Data from the NCBI Sequence Read Archive (SRA) using the DE

Working path

cd ~/scratch
pwd

Micromamba environment

We use Micromamba for package management on Pronghorn — see the HPC Cluster lesson for installation.

micromamba create -n RNASEQ_bch709 -c conda-forge -c bioconda python=3.11 -y
micromamba activate RNASEQ_bch709

micromamba install -c conda-forge -c bioconda \
    minimap2 star 'samtools>=1.20' subread \
    openjdk=17 'trinity>=2.15' gffread seqkit kraken2 'fastp>=0.24' \
    perl-dbi perl-dbd-sqlite perl-html-parser \
    pandas numpy -y
# NOTE 1: `perl-bioperl` is intentionally NOT installed. Its current bioconda
# build pins libzlib<1.3, which conflicts with modern samtools/Trinity/kraken2.
# Trinity assembly itself does not need BioPerl — it is only required by a
# few legacy auxiliary scripts. If you ever need BioPerl, install it later
# in a SEPARATE env: `micromamba create -n bioperl -c bioconda perl-bioperl`
# NOTE 2: we deliberately do NOT install `sra-tools`. Bioconda's sra-tools 3.x
# is built against GLIBC 2.27+, newer than Pronghorn's system libc — so
# `prefetch` / `fastq-dump` crash on the compute nodes with
# "GLIBC_2.27 not found". The download steps below pull FASTQ from ENA over
# HTTPS with `curl`, which works regardless of the system GLIBC.

# Upgrade pip first — older pip can't find the prebuilt `tiktoken`
# manylinux wheel (a transitive multiqc dep), tries to build it from
# Rust source, fails on Pronghorn (no Rust compiler).
pip install --upgrade pip

# MultiQC + pinned deps. `tiktoken<0.8` is the safety pin — older
# tiktoken has stable cp311 linux wheels.
pip install --prefer-binary \
    'numpy<2.0' 'pyarrow<17' 'tiktoken<0.8' multiqc

Patch libcrypto so samtools runs (do this now, not after it crashes):

Bioconda’s samtools is linked against libcrypto.so.1.0.0, but the current OpenSSL package in the env ships libcrypto.so.3 (or .1.1). Without a symlink you get:

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file

Create the symlink once, right after activating:

# Env must be ACTIVE so $CONDA_PREFIX points at the env folder
cd "$CONDA_PREFIX/lib"
if   [ -f libcrypto.so.1.1 ]; then ln -sf libcrypto.so.1.1 libcrypto.so.1.0.0
elif [ -f libcrypto.so.3   ]; then ln -sf libcrypto.so.3   libcrypto.so.1.0.0
fi
cd - > /dev/null

# Verify
samtools --version | head -1     # → samtools 1.xx (no libcrypto error)

If samtools --version prints a version, you’re done. If it still errors, ask the instructor before moving on.

SRA

Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys.

Searching the SRA: Searching the SRA can be complicated. Often a paper or reference will specify the accession number(s) connected to a dataset. You can search flexibly using a number of terms (such as the organism name) or the filters (e.g. DNA vs. RNA). The SRA Help Manual provides several useful explanations. It is important to know is that projects are organized and related at several levels, and some important terms include:

Bioproject: A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium of coordinating organizations; see for example Bio Project 272719 Bio Sample: A description of the source materials for a project Run: These are the actual sequencing runs (usually starting with SRR); see for example SRR1761506

Publication (Arabidopsis)

Kim JS et al., “ROS1-Dependent DNA Demethylation Is Required for ABA-Inducible NIC3 Expression.”, Plant Physiol, 2019 Apr;179(4):1810-1821

SRA Bioproject site

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA272719

Runinfo

Run ReleaseDate LoadDate spots bases spots_with_mates avgLength size_MB AssemblyName download_path Experiment LibraryName LibraryStrategy LibrarySelection LibrarySource LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject Study_Pubmed_id ProjectID Sample BioSample SampleType TaxID ScientificName SampleName g1k_pop_code source g1k_analysis_group Subject_ID Sex Disease Tumor Affection_Status Analyte_Type Histological_Type Body_Site CenterName Submission dbgap_study_accession Consent RunHash ReadHash
SRR1761506 1/15/2016 15:51 1/15/2015 12:43 7379945 1490748890 7379945 202 899   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761506/SRR1761506.1 SRX844600   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820503 SAMN03285048 simple 3702 Arabidopsis thaliana GSM1585887             no         GEO SRA232612   public F335FB96DDD730AC6D3AE4F6683BF234 12818EB5275BCB7BCB815E147BFD0619
SRR1761507 1/15/2016 15:51 1/15/2015 12:43 9182965 1854958930 9182965 202 1123   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761507/SRR1761507.1 SRX844601   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820504 SAMN03285045 simple 3702 Arabidopsis thaliana GSM1585888             no         GEO SRA232612   public 00FD62759BF7BBAEF123BF5960B2A616 A61DCD3B96AB0796AB5E969F24F81B76
SRR1761508 1/15/2016 15:51 1/15/2015 12:47 19060611 3850243422 19060611 202 2324   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761508/SRR1761508.1 SRX844602   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820505 SAMN03285046 simple 3702 Arabidopsis thaliana GSM1585889             no         GEO SRA232612   public B75A3E64E88B1900102264522D2281CB 657987ABC8043768E99BD82947608CAC
SRR1761509 1/15/2016 15:51 1/15/2015 12:51 16555739 3344259278 16555739 202 2016   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761509/SRR1761509.1 SRX844603   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820506 SAMN03285049 simple 3702 Arabidopsis thaliana GSM1585890             no         GEO SRA232612   public 27CA2B82B69EEF56EAF53D3F464EEB7B 2B56CA09F3655F4BBB412FD2EE8D956C
SRR1761510 1/15/2016 15:51 1/15/2015 12:46 12700942 2565590284 12700942 202 1552   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761510/SRR1761510.1 SRX844604   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820508 SAMN03285050 simple 3702 Arabidopsis thaliana GSM1585891             no         GEO SRA232612   public D3901795C7ED74B8850480132F4688DA 476A9484DCFCF9FFFDAADAAF4CE5D0EA
SRR1761511 1/15/2016 15:51 1/15/2015 12:44 13353992 2697506384 13353992 202 1639   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761511/SRR1761511.1 SRX844605   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820507 SAMN03285047 simple 3702 Arabidopsis thaliana GSM1585892             no         GEO SRA232612   public 5078379601081319FCBF67C7465C404A E3B4195AFEA115ACDA6DEF6E4AA7D8DF
SRR1761512 1/15/2016 15:51 1/15/2015 12:44 8134575 1643184150 8134575 202 1067   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761512/SRR1761512.1 SRX844606   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820509 SAMN03285051 simple 3702 Arabidopsis thaliana GSM1585893             no         GEO SRA232612   public DDB8F763B71B1E29CC9C1F4C53D88D07 8F31604D3A4120A50B2E49329A786FA6
SRR1761513 1/15/2016 15:51 1/15/2015 12:43 7333641 1481395482 7333641 202 960   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761513/SRR1761513.1 SRX844607   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820510 SAMN03285053 simple 3702 Arabidopsis thaliana GSM1585894             no         GEO SRA232612   public 4068AE245EB0A81DFF02889D35864AF2 8E05C4BC316FBDFEBAA3099C54E7517B
SRR1761514 1/15/2016 15:51 1/15/2015 12:44 6160111 1244342422 6160111 202 807   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761514/SRR1761514.1 SRX844608   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820511 SAMN03285059 simple 3702 Arabidopsis thaliana GSM1585895             no         GEO SRA232612   public 0A1F3E9192E7F9F4B3758B1CE514D264 81BFDB94C797624B34AFFEB554CE4D98
SRR1761515 1/15/2016 15:51 1/15/2015 12:44 7988876 1613752952 7988876 202 1048   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761515/SRR1761515.1 SRX844609   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820512 SAMN03285054 simple 3702 Arabidopsis thaliana GSM1585896             no         GEO SRA232612   public 39B37A0BD484C736616C5B0A45194525 85B031D74DF90AD1815AA1BBBF1F12BD
SRR1761516 1/15/2016 15:51 1/15/2015 12:44 8770090 1771558180 8770090 202 1152   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761516/SRR1761516.1 SRX844610   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820514 SAMN03285055 simple 3702 Arabidopsis thaliana GSM1585897             no         GEO SRA232612   public E4728DFBF0F9F04B89A5B041FA570EB3 B96545CB9C4C3EE1C9F1E8B3D4CE9D24
SRR1761517 1/15/2016 15:51 1/15/2015 12:44 8229157 1662289714 8229157 202 1075   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761517/SRR1761517.1 SRX844611   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820513 SAMN03285058 simple 3702 Arabidopsis thaliana GSM1585898             no         GEO SRA232612   public C05BC519960B075038834458514473EB 4EF7877FC59FF5214DBF2E2FE36D67C5
SRR1761518 1/15/2016 15:51 1/15/2015 12:44 8760931 1769708062 8760931 202 1072   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761518/SRR1761518.1 SRX844612   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820515 SAMN03285052 simple 3702 Arabidopsis thaliana GSM1585899             no         GEO SRA232612   public 7D8333182062545CECD5308A222FF506 382F586C4BF74E474D8F9282E36BE4EC
SRR1761519 1/15/2016 15:51 1/15/2015 12:44 6643107 1341907614 6643107 202 811   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761519/SRR1761519.1 SRX844613   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820516 SAMN03285056 simple 3702 Arabidopsis thaliana GSM1585900             no         GEO SRA232612   public 163BD8073D7E128D8AD1B253A722DD08 DFBCC891EB5FA97490E32935E54C9E14
SRR1761520 1/15/2016 15:51 1/15/2015 12:44 8506472 1718307344 8506472 202 1040   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761520/SRR1761520.1 SRX844614   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820517 SAMN03285062 simple 3702 Arabidopsis thaliana GSM1585901             no         GEO SRA232612   public 791BD0D8840AA5F1D74E396668638DA1 AF4694425D34F84095F6CFD6F4A09936
SRR1761521 1/15/2016 15:51 1/15/2015 12:46 13166085 2659549170 13166085 202 1609   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761521/SRR1761521.1 SRX844615   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820518 SAMN03285057 simple 3702 Arabidopsis thaliana GSM1585902             no         GEO SRA232612   public 47C40480E9B7DB62B4BEE0F2193D16B3 1443C58A943C07D3275AB12DC31644A9
SRR1761522 1/15/2016 15:51 1/15/2015 12:49 9496483 1918289566 9496483 202 1162   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761522/SRR1761522.1 SRX844616   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820519 SAMN03285061 simple 3702 Arabidopsis thaliana GSM1585903             no         GEO SRA232612   public BB05DF11E1F95427530D69DB5E0FA667 7706862FB2DF957E4041D2064A691CF6
SRR1761523 1/15/2016 15:51 1/15/2015 12:46 14999315 3029861630 14999315 202 1832   https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761523/SRR1761523.1 SRX844617   RNA-Seq cDNA TRANSCRIPTOMIC PAIRED 0 0 ILLUMINA Illumina HiSeq 2500 SRP052302 PRJNA272719 3 272719 SRS820520 SAMN03285060 simple 3702 Arabidopsis thaliana GSM1585904             no         GEO SRA232612   public 101D3A151E632224C09A702BD2F59CF5 0AC99FAA6B8941F89FFCBB8B1910696E

Subset of data

Sample information Run
WT_rep1 SRR1761506
WT_rep2 SRR1761507
WT_rep3 SRR1761508
ABA_rep1 SRR1761509
ABA_rep2 SRR1761510
ABA_rep3 SRR1761511

🔁 Already ran fastq-dump + trim in HPC_cluster?

The HPC Cluster lesson Workflow Step 1·2 downloads the same Arabidopsis SRR1761506-511 dataset and trims with fastp into the shared workspace at ~/scratch/rnaseq/raw_data/ and ~/scratch/rnaseq/trim/. To reuse those results here (no need to re-download or re-trim):

# Sanity-check the shared workspace already has the trimmed reads
ls ~/scratch/rnaseq/trim/SRR1761506_1.trimmed.fq.gz \
   ~/scratch/rnaseq/trim/SRR1761506_2.trimmed.fq.gz

# Set up the ATH project sub-directory under the shared parent
mkdir -p ~/scratch/rnaseq/ATH
cd ~/scratch/rnaseq/ATH
ln -s ~/scratch/rnaseq/raw_data raw_data    # reuse existing FASTQ
ln -s ~/scratch/rnaseq/trim     trim        # reuse trimmed reads
mkdir -p reference bam logs                  # only the new directories (logs/ for Slurm stdout)

If the ls above prints both files, skip to “Reference downloads” below — STAR index + alignment is where this lesson really starts. Otherwise (fresh start, no HPC_cluster prerequisites done), follow the standard setup below.

mkdir -p ~/scratch/rnaseq/ATH
cd ~/scratch/rnaseq/ATH
mkdir -p raw_data trim reference bam logs
pwd

FASTQ download submission (from ENA)

Bioconda’s sra-tools 3.x crashes on Pronghorn (built against GLIBC 2.27+, newer than the system libc). We pull the same FASTQ from ENA, which mirrors every SRA run as ready-to-use .fastq.gz over HTTPS — no SRA toolkit needed.

cd ~/scratch/rnaseq/ATH
nano fastq-dump.sh
#!/bin/bash
#SBATCH --job-name=fastqdump_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o fastq-dump.out # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

set -euo pipefail
mkdir -p ./raw_data

for SRR in SRR1761506 SRR1761507 SRR1761508 SRR1761509 SRR1761510 SRR1761511; do
  URLS=$(curl -fsSL --retry 3 --max-time 60 \
          "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=${SRR}&result=read_run&fields=fastq_ftp&format=tsv" \
          | tail -n +2 | awk -F'\t' '{print $NF}' | tr ';' '\n' | sed '/^$/d')
  [ -n "${URLS}" ] || { echo "ERROR: ENA returned no fastq URLs for ${SRR}"; exit 1; }
  for U in ${URLS}; do
    OUT=./raw_data/$(basename "${U}")
    [ -s "${OUT}" ] && { echo "[fastq] ${OUT} already present, skipping"; continue; }
    echo "[fastq] ${SRR} -> https://${U}"
    curl -fsSL --retry 3 --retry-delay 30 --max-time 3600 -o "${OUT}" "https://${U}"
  done
done

Read Trimming with fastp

cd  ~/scratch/rnaseq/ATH
nano trim.sh
#!/bin/bash
#SBATCH --job-name=trim_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

fastp --in1 raw_data/SRR1761506_1.fastq.gz --in2 raw_data/SRR1761506_2.fastq.gz --out1 trim/SRR1761506_1.trimmed.fq.gz --out2 trim/SRR1761506_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761506_fastp.html --json trim/SRR1761506_fastp.json
fastp --in1 raw_data/SRR1761507_1.fastq.gz --in2 raw_data/SRR1761507_2.fastq.gz --out1 trim/SRR1761507_1.trimmed.fq.gz --out2 trim/SRR1761507_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761507_fastp.html --json trim/SRR1761507_fastp.json
fastp --in1 raw_data/SRR1761508_1.fastq.gz --in2 raw_data/SRR1761508_2.fastq.gz --out1 trim/SRR1761508_1.trimmed.fq.gz --out2 trim/SRR1761508_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761508_fastp.html --json trim/SRR1761508_fastp.json
fastp --in1 raw_data/SRR1761509_1.fastq.gz --in2 raw_data/SRR1761509_2.fastq.gz --out1 trim/SRR1761509_1.trimmed.fq.gz --out2 trim/SRR1761509_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761509_fastp.html --json trim/SRR1761509_fastp.json
fastp --in1 raw_data/SRR1761510_1.fastq.gz --in2 raw_data/SRR1761510_2.fastq.gz --out1 trim/SRR1761510_1.trimmed.fq.gz --out2 trim/SRR1761510_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761510_fastp.html --json trim/SRR1761510_fastp.json
fastp --in1 raw_data/SRR1761511_1.fastq.gz --in2 raw_data/SRR1761511_2.fastq.gz --out1 trim/SRR1761511_1.trimmed.fq.gz --out2 trim/SRR1761511_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR1761511_fastp.html --json trim/SRR1761511_fastp.json

Reference downloads

We download the Arabidopsis TAIR10 genome and annotation directly from TAIR (www.arabidopsis.org). No JGI/Phytozome account required, no zip-bundle to unpack.

cd ~/scratch/rnaseq/ATH
mkdir -p bam reference logs
cd reference
pwd

Download Arabidopsis thaliana TAIR10 from TAIR

TAIR’s API serves the canonical TAIR10 chromosome FASTA and GFF3. The host uses a self-signed certificate, so we pass -k to curl (same as wget --no-check-certificate).

cd ~/scratch/rnaseq/ATH/reference

# Genome FASTA (gzipped, ~35 MB)
curl -kfsSL --retry 3 --max-time 600 \
    -o TAIR10_chr_all.fas.gz \
    "https://www.arabidopsis.org/api/download-files/download?filePath=Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz"
gunzip -f TAIR10_chr_all.fas.gz

# Gene annotation (GFF3)
curl -kfsSL --retry 3 --max-time 600 \
    -o TAIR10_GFF3_genes.gff \
    "https://www.arabidopsis.org/api/download-files/download?filePath=Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff"

ls -lh TAIR10_chr_all.fas TAIR10_GFF3_genes.gff
seqkit stats TAIR10_chr_all.fas

Convert GFF to GTF

STAR’s --sjdbGTFfile expects GTF, so we convert the GFF3 with gffread:

cd ~/scratch/rnaseq/ATH/reference
gffread TAIR10_GFF3_genes.gff -T -F --keep-exon-attrs -o TAIR10_GFF3_genes.gtf

Create reference index

cd  ~/scratch/rnaseq/ATH/reference
ls -algh
nano index.sh
#!/bin/bash
#SBATCH --job-name=index_ATH
#SBATCH --cpus-per-task=12
#SBATCH --time=2-15:00:00
#SBATCH --mem=48g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o index.out # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir . --genomeFastaFiles TAIR10_chr_all.fas --sjdbGTFfile TAIR10_GFF3_genes.gtf --sjdbOverhang 99 --genomeSAindexNbases 12

Mapping the reads to genome index

cd  ~/scratch/rnaseq/ATH/
ls -algh
nano align.sh
#!/bin/bash
#SBATCH --job-name=align_ATH
#SBATCH --cpus-per-task=8
#SBATCH --time=2-15:00:00
#SBATCH --mem=32g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o align.out # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
# NOTE: do NOT hard-code --dependency here. Pass it on the `sbatch` command line
# so the job ID is filled in automatically — see the "Submit the pipeline with
# dependency chaining" section below.

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761506_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761506_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761506.bam

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761507_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761507_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761507.bam

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761508_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761508_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761508.bam

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761509_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761509_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761509.bam

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761510_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761510_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761510.bam

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir ~/scratch/rnaseq/ATH/reference/ --readFilesIn ~/scratch/rnaseq/ATH/trim/SRR1761511_1.trimmed.fq.gz ~/scratch/rnaseq/ATH/trim/SRR1761511_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/ATH/bam/SRR1761511.bam

Counting reads with featureCountsfeatureCounts.sh

Once every BAM is sorted by coordinate, count read pairs against the TAIR10 GTF. With subread ≥ 2.0.2, paired-end fragment counting requires both -p (paired-end) and --countReadPairs (count pairs as 1 rather than 2). The output ATH.featureCount.cnt is what the MultiQC and DESeq2/EdgeR steps below consume.

cd ~/scratch/rnaseq/ATH
nano featureCounts.sh
#!/bin/bash
#SBATCH --job-name=featurecounts_ATH
#SBATCH --cpus-per-task=8
#SBATCH --time=06:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/featurecounts_%j.out
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

set -euo pipefail
PROJECT=~/scratch/rnaseq/ATH
cd "$PROJECT/bam"

# Hard-coded SRR list — matches the cohort used in fastq-dump.sh / align.sh above.
BAMS="SRR1761506.bamAligned.sortedByCoord.out.bam \
      SRR1761507.bamAligned.sortedByCoord.out.bam \
      SRR1761508.bamAligned.sortedByCoord.out.bam \
      SRR1761509.bamAligned.sortedByCoord.out.bam \
      SRR1761510.bamAligned.sortedByCoord.out.bam \
      SRR1761511.bamAligned.sortedByCoord.out.bam"

featureCounts \
    -T 8 \
    -p --countReadPairs \
    -a "$PROJECT/reference/TAIR10_GFF3_genes.gtf" \
    -o ATH.featureCount.cnt \
    ${BAMS}

Submit & inspect (standalone):

sbatch featureCounts.sh
# when done:
cat ~/scratch/rnaseq/ATH/bam/ATH.featureCount.cnt.summary

(run_all.sh below wires this in automatically — you don’t need to submit it by hand.)

Submit the entire pipeline with one script — run_all.sh

Rather than running each step by hand (submit → wait → submit → wait…), put every step into a driver script that submits them all at once. Slurm queues each job in the right order using --dependency; the whole pipeline runs unattended.

Pipeline DAG:

  fastq-dump ──┐
               ├─→ trim ─→ align ─→ featureCounts ─→ multiqc
  index   ─────┘

(Download + index run in parallel; trim waits on download; align waits on both trim and index; featureCounts waits on align; multiqc waits on featureCounts.)

Save as run_all.sh:

#!/bin/bash
# run_all.sh — submit the entire Arabidopsis RNA-Seq pipeline with one command.
# Slurm enforces the correct order via --dependency; you can walk away.
set -euo pipefail

PROJECT=~/scratch/rnaseq/ATH
cd "$PROJECT"

# Activate the env in THIS shell so every sbatch below inherits the PATH
# (sbatch --export=ALL is the default — the submitted jobs see the same tools)
export MAMBA_ROOT_PREFIX="${MAMBA_ROOT_PREFIX:-$HOME/micromamba}"
eval "$(micromamba shell hook --shell=bash)"
micromamba activate RNASEQ_bch709

# 1. Download FASTQs (no prerequisites)
DUMP_JID=$(sbatch --parsable fastq-dump.sh)

# 2. Build STAR index (independent of download — runs in parallel)
IDX_JID=$(cd "$PROJECT/reference" && sbatch --parsable index.sh)

# 3. Trim reads (waits for download)
TRIM_JID=$(sbatch --parsable --dependency=afterok:${DUMP_JID} trim.sh)

# 4. Align to genome (waits for BOTH trim and index)
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${TRIM_JID}:${IDX_JID} align.sh)

# 5. featureCounts (waits for align)
FC_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} featureCounts.sh)

# 6. MultiQC aggregation (waits for featureCounts; afterany lets it run even if FC partially failed)
MQC_JID=$(sbatch --parsable --dependency=afterany:${FC_JID} multiqc.sh)

cat <<EOF
Submitted RNA-Seq pipeline (Arabidopsis):
  fastq-dump      ${DUMP_JID}
  index           ${IDX_JID}
  trim            ${TRIM_JID}
  align           ${ALIGN_JID}
  featureCounts   ${FC_JID}
  multiqc         ${MQC_JID}

Monitor with:  squeue -u \$USER
Cancel all:    scancel ${DUMP_JID} ${IDX_JID} ${TRIM_JID} ${ALIGN_JID} ${FC_JID} ${MQC_JID}
Final report (after pipeline finishes): ~/scratch/rnaseq/ATH/qc/ATH_report.html
EOF

Run it:

chmod +x run_all.sh
bash run_all.sh
squeue -u $USER   # later jobs show state PD with reason (Dependency)

You’ll see 4 job IDs printed immediately. Close your laptop — Slurm takes over. When everything finishes, check ls bam/ for sorted BAM outputs and log files for Finished successfully.

🧑‍💻 Hands-on walkthrough — submit the pipeline step-by-step

If you want to see exactly what run_all.sh does (or debug one step), submit each stage manually. Every sbatch returns a job ID that the next step depends on.

Do this first (login shell — one time):

micromamba activate RNASEQ_bch709
cd ~/scratch/rnaseq/ATH

Then submit each step — each line is one command:

# --- Step 1: download FASTQs (no prerequisites) ---
DUMP_JID=$(sbatch --parsable fastq-dump.sh)
echo "fastq-dump → $DUMP_JID"

# --- Step 2: build STAR index (no prerequisites, runs in parallel with Step 1) ---
IDX_JID=$(cd reference && sbatch --parsable index.sh)
echo "index      → $IDX_JID"

# --- Step 3: trim (waits for fastq-dump) ---
TRIM_JID=$(sbatch --parsable --dependency=afterok:${DUMP_JID} trim.sh)
echo "trim       → $TRIM_JID"

# --- Step 4: align (waits for BOTH trim and index) ---
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${TRIM_JID}:${IDX_JID} align.sh)
echo "align      → $ALIGN_JID"

# --- Step 5: featureCounts (waits for align) ---
FC_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} featureCounts.sh)
echo "featureCounts → $FC_JID"

# --- Step 6: MultiQC (waits for featureCounts) ---
MQC_JID=$(sbatch --parsable --dependency=afterany:${FC_JID} multiqc.sh)
echo "multiqc    → $MQC_JID"

# Check that everything is queued
squeue -u $USER
# Steps 3-6 should show state PD with reason (Dependency)

Why copy the commands into your terminal, not a script?

The hands-on walkthrough is literally what run_all.sh does — but by typing each line you see each job ID appear and can inspect things in between. Once you’re comfortable, just run bash run_all.sh next time.

Same pattern for every other organism

For Drosophila, Mouse, Tomato, Mosquito, etc., copy run_all.sh into that organism’s project directory, update PROJECT=~/scratch/rnaseq/<ORG>, and run it. The script structure doesn’t change — only the path.

Don’t hard-code dependencies inside the #SBATCH block

Some older examples had #SBATCH --dependency=afterok:<PREVIOUS_JOBID(trim_ATH)> inside align.sh. That’s fragile — you’d have to edit the file and paste the previous job’s ID every single time. Instead, pass --dependency on the sbatch command line (as shown in run_all.sh above). If you see a #SBATCH --dependency=... line inside any script, delete it.

For a full explanation of --dependency, afterok vs afterany, and the --parsable flag, see the Job dependencies section in the HPC Cluster lesson.

micromamba install -c conda-forge tree

Drosophila

Publication (Drosophila)

Not formally published — data deposited at NCBI as PRJNA770108 (Gene expression profiling of D. melanogaster larval brains after chronic alcohol exposure).

✅ Before you start the Drosophila walkthrough — pre-class checklist

  • You finished the Arabidopsis section above (or at least understand sbatch, --dependency, and run_all.sh)
  • micromamba activate RNASEQ_bch709 works in your login shell (STAR, fastp, subread/featureCounts, multiqc all on PATH)
  • ~/scratch/rnaseq/ already exists from the Arabidopsis run; we’ll add Drosophila/ next to ATH/
  • You have ~8 GB free under ~/scratch (6 PE samples × ~30 M reads + STAR index + BAMs)
  • You replaced <YOUR_EMAIL> in your previous SBATCH headers — keep doing that here

Any box unchecked? → re-read the Arabidopsis run_all.sh walkthrough before continuing.

🪰 Drosophila-specific notes (vs. Arabidopsis)

  • Longer introns--alignIntronMax 100000 (Arabidopsis used 10 000). The longest D. melanogaster introns reach ~70 kb.
  • Smaller genome (~143 Mb) → --genomeSAindexNbases 12 (same value used for ATH; smaller-than-default 14 keeps the SA index in-RAM)
  • Cohort: 6 paired-end samples (~25–35 M read pairs each), 3 ethanol-treated vs 3 controls — perfect for a 2-group DESeq2 contrast
  • Reference: FlyBase r6.42 (FB2021_05). Genome FASTA + GTF come straight from ftp.flybase.net; Ensembl BDGP6.32 r104 is kept as a fallback mirror in fastq-dump.sh style.

SRA Bioproject site

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA770108

Gene expression profiling of Drosophila melanogaster larval brains after chronic alcohol exposure (fruit fly)

We sequenced mRNA extracted from brains of (1) D. melanogaster larvae exposed to food containing 5% ethanol (v/v) for 6 consecutive days, and (2) age-matched untreated control larvae that grew in regular food. Differential gene expression between the two groups was calculated and reported. Each group consisted of 3 biological replicates of 30 brains each. Overall design: examination of mRNA levels in brains of D. melanogaster larvae after chronic ethanol exposure was performed using next generation sequencing (RNA-seq).

Subset of data

Sample information Run
Control SRR16287545
Control SRR16287546
Control SRR16287547
Ethanol treatment SRR16287548
Ethanol treatment SRR16287549
Ethanol treatment SRR16287550

Project layout

mkdir -p ~/scratch/rnaseq
cd ~/scratch/rnaseq
mkdir -p Drosophila && cd Drosophila
mkdir -p raw_data trim bam reference logs qc
pwd
# example output (your <netid> will differ)
/data/gpfs/assoc/bch709-6/<netid>/scratch/rnaseq/Drosophila

samples.txt — one place to list the cohort

Every script below reads samples.txt so you only edit the cohort once. Tab-delimited: sample_name<TAB>SRR<TAB>condition. The header row exists so the awk loops can simply NR>1 to skip it.

cd ~/scratch/rnaseq/Drosophila
nano samples.txt

Paste exactly (real tab characters between columns — nano writes them literally):

sample	srr	condition
ctrl_rep1	SRR16287545	Control
ctrl_rep2	SRR16287546	Control
ctrl_rep3	SRR16287547	Control
etoh_rep1	SRR16287548	Ethanol
etoh_rep2	SRR16287549	Ethanol
etoh_rep3	SRR16287550	Ethanol

Verify:

cat samples.txt
awk -F'\t' 'NR>1{print $2}' samples.txt   # SRR list the loops will iterate over
sample	srr	condition
ctrl_rep1	SRR16287545	Control
ctrl_rep2	SRR16287546	Control
ctrl_rep3	SRR16287547	Control
etoh_rep1	SRR16287548	Ethanol
etoh_rep2	SRR16287549	Ethanol
etoh_rep3	SRR16287550	Ethanol

SRR16287545
SRR16287546
SRR16287547
SRR16287548
SRR16287549
SRR16287550

fastq download (from ENA)

Same ENA-over-HTTPS approach as Arabidopsis (no broken sra-tools GLIBC dependency). The loop reads SRR IDs from samples.txt instead of being hard-coded.

cd ~/scratch/rnaseq/Drosophila
nano fastq-dump.sh
#!/bin/bash
#SBATCH --job-name=fastqdump_Drosophila
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/fastq-dump_%j.out  # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

set -euo pipefail
PROJECT=~/scratch/rnaseq/Drosophila
cd "$PROJECT"
mkdir -p raw_data logs

# Drive the loop from samples.txt (column 2 = SRR; NR>1 skips the header)
SRRS=$(awk -F'\t' 'NR>1{print $2}' samples.txt)

for SRR in ${SRRS}; do
  URLS=$(curl -fsSL --retry 3 --max-time 60 \
          "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=${SRR}&result=read_run&fields=fastq_ftp&format=tsv" \
          | tail -n +2 | awk -F'\t' '{print $NF}' | tr ';' '\n' | sed '/^$/d')
  [ -n "${URLS}" ] || { echo "ERROR: ENA returned no fastq URLs for ${SRR}"; exit 1; }
  for U in ${URLS}; do
    OUT=raw_data/$(basename "${U}")
    [ -s "${OUT}" ] && { echo "[fastq] ${OUT} already present, skipping"; continue; }
    echo "[fastq] ${SRR} -> https://${U}"
    curl -fsSL --retry 3 --retry-delay 30 --max-time 3600 -o "${OUT}" "https://${U}"
  done
done

ls -lh raw_data/*.fastq.gz

Submit & inspect:

sbatch fastq-dump.sh
# wait for completion, then:
tail -n 20 logs/fastq-dump_<jobid>.out
ls -lh raw_data/
# example tail (your numbers will differ)
[fastq] SRR16287545 -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/045/SRR16287545/SRR16287545_1.fastq.gz
[fastq] SRR16287545 -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/045/SRR16287545/SRR16287545_2.fastq.gz
[fastq] SRR16287546 -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/046/SRR16287546/SRR16287546_1.fastq.gz
...
[fastq] SRR16287550 -> https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/050/SRR16287550/SRR16287550_2.fastq.gz
-rw-r--r-- 1 <netid> users 1.6G ... raw_data/SRR16287545_1.fastq.gz
-rw-r--r-- 1 <netid> users 1.7G ... raw_data/SRR16287545_2.fastq.gz
-rw-r--r-- 1 <netid> users 1.5G ... raw_data/SRR16287550_2.fastq.gz

Read trimming with fastp — loop driven by samples.txt

Instead of 6 hard-coded fastp lines, loop over the SRR column of samples.txt. The behaviour is identical (same Q20, same length filter, same paired-end adapter detection), but adding/removing a sample is a one-line edit to samples.txt.

cd ~/scratch/rnaseq/Drosophila
nano trim.sh
#!/bin/bash
#SBATCH --job-name=trim_Drosophila
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/trim_%j.out  # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

set -euo pipefail
PROJECT=~/scratch/rnaseq/Drosophila
cd "$PROJECT"
mkdir -p trim logs

# Loop over every SRR listed in samples.txt (skip header)
while IFS=$'\t' read -r SAMPLE SRR COND; do
  echo "[trim] ${SAMPLE} (${SRR}, ${COND})"
  fastp \
      --in1  raw_data/${SRR}_1.fastq.gz \
      --in2  raw_data/${SRR}_2.fastq.gz \
      --out1 trim/${SRR}_1.trimmed.fq.gz \
      --out2 trim/${SRR}_2.trimmed.fq.gz \
      --detect_adapter_for_pe \
      --qualified_quality_phred 20 \
      --length_required 50 \
      --thread 2 \
      --html trim/${SRR}_fastp.html \
      --json trim/${SRR}_fastp.json
done < <(awk -F'\t' 'NR>1' samples.txt)

Expected per-sample fastp summary (printed to STDERR/STDOUT for each iteration):

# example fastp summary (your numbers will differ)
Read1 before filtering:
total reads: 28,432,117
total bases: 4,264,817,550
Q20 bases: 4,164,128,001 (97.64%)
Q30 bases: 3,981,724,902 (93.36%)

Read1 after filtering:
total reads: 27,946,201
Q20 rate: 98.61%
Q30 rate: 95.04%

Filtering result:
reads passed filter: 55,612,408
reads failed due to low quality: 488,802
reads failed due to too short: 122,306
reads with adapter trimmed: 1,884,109

Duplication rate: 6.83%
JSON report: trim/SRR16287545_fastp.json
HTML report: trim/SRR16287545_fastp.html
Click to see the equivalent expanded form (one fastp call per sample) — useful for understanding what the loop unrolls to The loop above produces exactly the same six commands as this expanded version. It is shown here because seeing the unrolled form helps connect "what the loop runs" with "what `fastp` actually executes" the first time you read the script. ```bash fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287545_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287545_2.fastq.gz --out1 trim/SRR16287545_1.trimmed.fq.gz --out2 trim/SRR16287545_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287545_fastp.html --json trim/SRR16287545_fastp.json fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287546_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287546_2.fastq.gz --out1 trim/SRR16287546_1.trimmed.fq.gz --out2 trim/SRR16287546_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287546_fastp.html --json trim/SRR16287546_fastp.json fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287547_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287547_2.fastq.gz --out1 trim/SRR16287547_1.trimmed.fq.gz --out2 trim/SRR16287547_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287547_fastp.html --json trim/SRR16287547_fastp.json fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287548_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287548_2.fastq.gz --out1 trim/SRR16287548_1.trimmed.fq.gz --out2 trim/SRR16287548_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287548_fastp.html --json trim/SRR16287548_fastp.json fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287549_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287549_2.fastq.gz --out1 trim/SRR16287549_1.trimmed.fq.gz --out2 trim/SRR16287549_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287549_fastp.html --json trim/SRR16287549_fastp.json fastp --in1 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287550_1.fastq.gz --in2 ~/scratch/rnaseq/Drosophila/raw_data/SRR16287550_2.fastq.gz --out1 trim/SRR16287550_1.trimmed.fq.gz --out2 trim/SRR16287550_2.trimmed.fq.gz --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --thread 2 --html trim/SRR16287550_fastp.html --json trim/SRR16287550_fastp.json ```

Reference download

cd ~/scratch/rnaseq/Drosophila/reference

# FlyBase r6.42 (FB2021_05) — pinned for reproducibility. The dmel_r6.42
# directory is still hosted by FlyBase but only via HTTPS in newer releases;
# the legacy http:// URL sometimes 301-redirects in a way that wget mishandles.
# Use HTTPS + curl with retries; add an Ensembl mirror as fallback.
FLY_FA="https://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.42_FB2021_05/fasta/dmel-all-chromosome-r6.42.fasta.gz"
FLY_GTF="https://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.42_FB2021_05/gtf/dmel-all-r6.42.gtf.gz"
ENS_FA="https://ftp.ensembl.org/pub/release-104/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.32.dna.toplevel.fa.gz"
ENS_GTF="https://ftp.ensembl.org/pub/release-104/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.104.gtf.gz"

curl -fsSL --retry 3 --max-time 1800 -o dmel.fasta.gz  "${FLY_FA}"  || curl -fsSL --retry 3 --max-time 1800 -o dmel.fasta.gz  "${ENS_FA}"
curl -fsSL --retry 3 --max-time 600  -o dmel.gtf.gz    "${FLY_GTF}" || curl -fsSL --retry 3 --max-time 600  -o dmel.gtf.gz    "${ENS_GTF}"
gunzip -f dmel.fasta.gz dmel.gtf.gz
ls -lh dmel.fasta dmel.gtf
seqkit stats dmel.fasta
# example output (your numbers will differ slightly between releases)
-rw-r--r-- 1 <netid> users 145M ... dmel.fasta
-rw-r--r-- 1 <netid> users  41M ... dmel.gtf

file        format  type  num_seqs      sum_len  min_len     avg_len     max_len
dmel.fasta  FASTA   DNA      1,870  143,726,002       54   76,858.8  32,079,331

Reference index (STAR genomeGenerate)

cd ~/scratch/rnaseq/Drosophila/reference
nano index.sh
#!/bin/bash
#SBATCH --job-name=index_Drosophila
#SBATCH --cpus-per-task=12
#SBATCH --time=2-15:00:00
#SBATCH --mem=48g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o ../logs/index_%j.out  # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

STAR --runThreadN 12 \
     --runMode genomeGenerate \
     --genomeDir . \
     --genomeFastaFiles dmel.fasta \
     --sjdbGTFfile dmel.gtf \
     --sjdbOverhang 99 \
     --genomeSAindexNbases 12

Submit & monitor:

cd ~/scratch/rnaseq/Drosophila/reference
sbatch index.sh
# when done:
tail -n 20 Log.out
ls -lh SA SAindex Genome
# example STAR Log.out tail (your timestamps will differ)
Apr 28 <date> ..... started STAR run
Apr 28 <date> ... starting to generate Genome files
Apr 28 <date> ... starting to sort Suffix Array. This may take a long time...
Apr 28 <date> ... loading chunks from disk, packing SA...
Apr 28 <date> ... finished generating suffix array
Apr 28 <date> ... finished generating Suffix Array index
Apr 28 <date> ..... processing annotations GTF
Apr 28 <date> ..... inserting junctions into the genome indices
Apr 28 <date> ... writing Genome to disk ...
Apr 28 <date> ... writing Suffix Array to disk ...
Apr 28 <date> ... writing SAindex to disk
Apr 28 <date> ..... finished successfully
DONE: Genome generation, EXITING

Mapping the reads to genome index — loop driven by samples.txt

Same loop pattern as trim.sh: read SRR from samples.txt, call STAR once per sample. The Drosophila-specific flag is --alignIntronMax 100000 (introns up to 100 kb).

cd ~/scratch/rnaseq/Drosophila
nano mapping.sh
#!/bin/bash
#SBATCH --job-name=align_Drosophila
#SBATCH --cpus-per-task=8
#SBATCH --time=2-15:00:00
#SBATCH --mem=32g
#SBATCH --mail-type=all
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/align_%j.out  # STDOUT & STDERR
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
# NOTE: do NOT hard-code --dependency here. Pass it on the `sbatch` command line,
# e.g.  ALIGN=$(sbatch --parsable --dependency=afterok:${TRIM_JID}:${IDX_JID} mapping.sh)

set -euo pipefail
PROJECT=~/scratch/rnaseq/Drosophila
cd "$PROJECT"
mkdir -p bam logs

while IFS=$'\t' read -r SAMPLE SRR COND; do
  echo "[align] ${SAMPLE} (${SRR}, ${COND})"
  STAR --runMode alignReads \
       --runThreadN 8 \
       --readFilesCommand zcat \
       --outFilterMultimapNmax 10 \
       --alignIntronMin 25 \
       --alignIntronMax 100000 \
       --genomeDir   "$PROJECT/reference/" \
       --readFilesIn "$PROJECT/trim/${SRR}_1.trimmed.fq.gz" \
                     "$PROJECT/trim/${SRR}_2.trimmed.fq.gz" \
       --outSAMtype BAM SortedByCoordinate \
       --outFileNamePrefix "$PROJECT/bam/${SRR}.bam"
done < <(awk -F'\t' 'NR>1' samples.txt)

Expected Log.final.out excerpt (per sample, written to bam/<SRR>.bamLog.final.out):

# example STAR Log.final.out (your numbers will differ)
Number of input reads |	27,946,201
Average input read length |	300
                          UNIQUE READS:
Uniquely mapped reads number |	24,011,203
Uniquely mapped reads % |	85.92%
                          MULTI-MAPPING READS:
Number of reads mapped to multiple loci |	2,210,884
% of reads mapped to multiple loci |	7.91%
...
Number of splices: Total |	14,902,331
Number of splices: GT/AG |	14,752,019
% of reads unmapped: too short |	5.62%
% of reads unmapped: other |	0.45%
Click to see the equivalent expanded form (one STAR call per sample) ```bash STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287545_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287545_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287545.bam STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287546_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287546_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287546.bam STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287547_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287547_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287547.bam STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287548_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287548_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287548.bam STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287549_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287549_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287549.bam STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir ~/scratch/rnaseq/Drosophila/reference/ --readFilesIn ~/scratch/rnaseq/Drosophila/trim/SRR16287550_1.trimmed.fq.gz ~/scratch/rnaseq/Drosophila/trim/SRR16287550_2.trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ~/scratch/rnaseq/Drosophila/bam/SRR16287550.bam ```

Counting reads with featureCountsfeatureCounts.sh

Once every BAM is sorted by coordinate, count read pairs against the FlyBase GTF. With subread ≥ 2.0.2, paired-end fragment counting requires both -p (paired-end) and --countReadPairs (count pairs as 1 rather than 2).

cd ~/scratch/rnaseq/Drosophila
nano featureCounts.sh
#!/bin/bash
#SBATCH --job-name=featurecounts_Drosophila
#SBATCH --cpus-per-task=8
#SBATCH --time=06:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/featurecounts_%j.out
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0

set -euo pipefail
PROJECT=~/scratch/rnaseq/Drosophila
cd "$PROJECT/bam"

# Build the BAM list from samples.txt so the file order matches the cohort definition
BAMS=$(awk -F'\t' 'NR>1{printf "%s.bamAligned.sortedByCoord.out.bam ", $2}' "$PROJECT/samples.txt")

featureCounts \
    -T 8 \
    -p --countReadPairs \
    -a "$PROJECT/reference/dmel.gtf" \
    -o Drosophila.featureCount.cnt \
    ${BAMS}

Submit & inspect:

sbatch featureCounts.sh
# when done:
cat Drosophila.featureCount.cnt.summary
head -3 Drosophila.featureCount.cnt | cut -f1-8
# example summary (your numbers will differ)
Status                     SRR16287545.bam...  SRR16287546.bam...  SRR16287547.bam...  SRR16287548.bam...  SRR16287549.bam...  SRR16287550.bam...
Assigned                   19842310            21055812            20018736            21349204            20177102            19998841
Unassigned_NoFeatures       2381204             2412017             2354611             2466093             2390411             2331109
Unassigned_Ambiguity         901844              918310              894217              927519              908744              879618
Unassigned_MultiMapping     1882104             1922001             1880411             1933214             1900328             1855402
...
# Headers + first gene row (your formatting will differ)
Geneid	Chr	Start	End	Strand	Length	SRR16287545.bamAligned.sortedByCoord.out.bam	SRR16287546.bam...
FBgn0031208	2L	7529	9484	+	1955	412	441

% Assigned typically lands around 78–84 % for this dataset; the dominant unassigned class is NoFeatures (intergenic) followed by MultiMapping (rRNA loci, mostly).

MultiQC summary — multiqc.sh

Same one-stop QC report idea as Arabidopsis: walk the project, parse fastp JSON / STAR Log.final.out / featureCounts .summary / FastQC, and render a single HTML.

cd ~/scratch/rnaseq/Drosophila
nano multiqc.sh
#!/bin/bash
#SBATCH --job-name=multiqc_Drosophila
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=01:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/multiqc_%j.out

set -euo pipefail
cd ~/scratch/rnaseq/Drosophila

mkdir -p qc

# Pull together everything multiqc can parse under the project dir
multiqc . -o qc/ -n Drosophila_report --force \
    --module fastp \
    --module star \
    --module featureCounts

Run it (after featureCounts has finished):

sbatch multiqc.sh
tail -n 15 logs/multiqc_<jobid>.out
# example multiqc log tail (your numbers will differ)
[INFO   ]         multiqc : This is MultiQC v1.21
[INFO   ]         multiqc : Search path : /data/gpfs/assoc/bch709-6/<netid>/scratch/rnaseq/Drosophila
[INFO   ]           fastp : Found 6 reports
[INFO   ]            star : Found 6 reports
[INFO   ]  featureCounts : Found 1 reports
...
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : qc/Drosophila_report.html
[INFO   ]         multiqc : Data        : qc/Drosophila_report_data
[INFO   ]         multiqc : MultiQC complete

Copy the report to your laptop and open it in a browser:

scp <netid>@pronghorn.rc.unr.edu:~/scratch/rnaseq/Drosophila/qc/Drosophila_report.html ./
open Drosophila_report.html

Submit the entire Drosophila pipeline with one script — run_all.sh

Same DAG as the Arabidopsis pipeline; the only changes are the project path and the addition of featureCounts between align and multiqc.

Pipeline DAG:

  fastq-dump ──┐
               ├─→ trim ─→ align ─→ featureCounts ─→ multiqc
  index   ─────┘

Save as run_all.sh:

#!/bin/bash
# run_all.sh — submit the entire Drosophila RNA-Seq pipeline with one command.
# Slurm enforces the correct order via --dependency; you can walk away.
set -euo pipefail

PROJECT=~/scratch/rnaseq/Drosophila
cd "$PROJECT"
mkdir -p logs qc

# Activate the env in THIS shell so every sbatch below inherits the PATH
export MAMBA_ROOT_PREFIX="${MAMBA_ROOT_PREFIX:-$HOME/micromamba}"
eval "$(micromamba shell hook --shell=bash)"
micromamba activate RNASEQ_bch709

# 1. Download FASTQs (no prerequisites)
DUMP_JID=$(sbatch --parsable fastq-dump.sh)

# 2. Build STAR index (independent of download — runs in parallel)
IDX_JID=$(cd "$PROJECT/reference" && sbatch --parsable index.sh)

# 3. Trim reads (waits for download)
TRIM_JID=$(sbatch --parsable --dependency=afterok:${DUMP_JID} trim.sh)

# 4. Align to genome (waits for BOTH trim and index)
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${TRIM_JID}:${IDX_JID} mapping.sh)

# 5. Count reads (waits for align)
FC_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} featureCounts.sh)

# 6. MultiQC aggregation (waits for featureCounts; afterany lets it run even if FC partially failed)
MQC_JID=$(sbatch --parsable --dependency=afterany:${FC_JID} multiqc.sh)

cat <<EOF
Submitted RNA-Seq pipeline (Drosophila):
  fastq-dump     ${DUMP_JID}
  index          ${IDX_JID}
  trim           ${TRIM_JID}
  align          ${ALIGN_JID}
  featurecounts  ${FC_JID}
  multiqc        ${MQC_JID}

Monitor with:  squeue -u \$USER
Cancel all:    scancel ${DUMP_JID} ${IDX_JID} ${TRIM_JID} ${ALIGN_JID} ${FC_JID} ${MQC_JID}
Final report (after pipeline finishes): ~/scratch/rnaseq/Drosophila/qc/Drosophila_report.html
EOF

Run it:

chmod +x run_all.sh
bash run_all.sh
squeue -u $USER
# example output (your job IDs will differ)
Submitted RNA-Seq pipeline (Drosophila):
  fastq-dump     <jobid>
  index          <jobid>
  trim           <jobid>
  align          <jobid>
  featurecounts  <jobid>
  multiqc        <jobid>

Monitor with:  squeue -u $USER
Cancel all:    scancel <jobid> <jobid> <jobid> <jobid> <jobid> <jobid>
Final report (after pipeline finishes): ~/scratch/rnaseq/Drosophila/qc/Drosophila_report.html

🧑‍💻 Hands-on walkthrough — submit the Drosophila pipeline step-by-step

If you want to see exactly what run_all.sh does (or debug one step), submit each stage manually. Every sbatch returns a job ID that the next step depends on.

Do this first (login shell — one time):

micromamba activate RNASEQ_bch709
cd ~/scratch/rnaseq/Drosophila
mkdir -p logs qc

Then submit each step — each line is one command:

# --- Step 1: download FASTQs (no prerequisites) ---
DUMP_JID=$(sbatch --parsable fastq-dump.sh)
echo "fastq-dump    -> $DUMP_JID"

# --- Step 2: build STAR index (parallel with Step 1) ---
IDX_JID=$(cd reference && sbatch --parsable index.sh)
echo "index         -> $IDX_JID"

# --- Step 3: trim (waits for fastq-dump) ---
TRIM_JID=$(sbatch --parsable --dependency=afterok:${DUMP_JID} trim.sh)
echo "trim          -> $TRIM_JID"

# --- Step 4: align (waits for BOTH trim and index) ---
ALIGN_JID=$(sbatch --parsable --dependency=afterok:${TRIM_JID}:${IDX_JID} mapping.sh)
echo "align         -> $ALIGN_JID"

# --- Step 5: featureCounts (waits for align) ---
FC_JID=$(sbatch --parsable --dependency=afterok:${ALIGN_JID} featureCounts.sh)
echo "featurecounts -> $FC_JID"

# --- Step 6: MultiQC (waits for featureCounts) ---
MQC_JID=$(sbatch --parsable --dependency=afterany:${FC_JID} multiqc.sh)
echo "multiqc       -> $MQC_JID"

# Check that everything is queued
squeue -u $USER
# Steps 3-6 should show state PD with reason (Dependency)
# example squeue (your job IDs and times will differ)
JOBID   PARTITION    NAME                  USER      ST  TIME  NODES NODELIST(REASON)
<jobid> cpu-core-0   fastqdump_Drosophi    <netid>   R   0:42  1     cpu-12
<jobid> cpu-core-0   index_Drosophila      <netid>   R   0:42  1     cpu-13
<jobid> cpu-core-0   trim_Drosophila       <netid>   PD  0:00  1     (Dependency)
<jobid> cpu-core-0   align_Drosophila      <netid>   PD  0:00  1     (Dependency)
<jobid> cpu-core-0   featurecounts_Drosop  <netid>   PD  0:00  1     (Dependency)
<jobid> cpu-core-0   multiqc_Drosophila    <netid>   PD  0:00  1     (Dependency)

Why type each step instead of just running run_all.sh?

Both produce the same dependency chain. The hands-on walkthrough lets you see each ${JID} appear and inspect outputs/logs in between. Once you’re comfortable, just run bash run_all.sh next time.

➡️ You now have a counts matrix — head over to differential expression

bam/Drosophila.featureCount.cnt is the only file the DESeq2 / EdgeR analysis needs. Continue with the Drosophila DEG subsection below the ## ATH DEG walkthrough — same samples.txt pattern, just point --matrix at Drosophila.featureCount_count_only.cnt.

Re-running a single failed step (drop --dependency=)

When only one step failed and every upstream step is already in COMPLETED state, submit just the failed script with no --dependency= flag. The run_all.sh driver only needs --dependency=... because it submits everything at once with sbatch --parsable capturing job IDs that don’t yet exist as completed jobs. If the upstream outputs (FASTQs in raw_data/, trimmed reads in trim/, STAR index in reference/, BAMs in bam/) are already on disk, you don’t need that wiring.

Arabidopsis pipelinecd ~/scratch/rnaseq/ATH first, then run only the line for the step that failed:

sbatch fastq-dump.sh                       # download FASTQs
(cd reference && sbatch index.sh)          # build STAR index
sbatch trim.sh                             # fastp trimming
sbatch align.sh                            # STAR alignment
sbatch multiqc.sh                          # final aggregated report

Drosophila pipelinecd ~/scratch/rnaseq/Drosophila first, then run only the line for the step that failed:

sbatch fastq-dump.sh                       # download FASTQs
(cd reference && sbatch index.sh)          # build STAR index
sbatch trim.sh                             # fastp trimming
sbatch mapping.sh                          # STAR alignment
sbatch featureCounts.sh                    # gene-level read counting
sbatch multiqc.sh                          # final aggregated report

These pipelines are loop-driven (one script processes every sample listed in samples.txt), not Slurm --array= jobs — so to re-run a single sample, edit samples.txt (or comment out the other lines in the script’s loop) before resubmitting. There is no sbatch --array=N shortcut here.

Sanity check before resubmitting — verify upstream outputs exist:

sacct -u $USER --format=JobID,JobName%-25,State,ExitCode --starttime today
ls -lh ~/scratch/rnaseq/ATH/{raw_data,trim,bam}        # Arabidopsis
ls -lh ~/scratch/rnaseq/Drosophila/{raw_data,trim,bam} # Drosophila

MultiQC special casemultiqc.sh is the only step submitted with --dependency=afterany:... (not afterok) inside run_all.sh, so the report still renders even if an upstream step ended with a non-zero warning. If you re-run MultiQC standalone (no --dependency=), keep its script unchanged — do not swap afterany back to afterok in run_all.sh.

Mus Musculus

Data Download

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA773499

CCR2-dependent monocyte-derived cells restrict SARS-CoV-2 infection (house mouse)

SARS-CoV-2 has caused a historic pandemic of respiratory disease (COVID-19) and current evidence suggests severe disease is associated with dysregulated immunity within the respiratory tract1,2. However, the innate immune mechanisms that mediate protection during COVID-19 are not well defined. Here we characterize a mouse model of SARS-CoV-2 infection and find that early CCR2-dependent infiltration of monocytes restricts viral burden in the lung. We find that a recently developed mouse-adapted MA-SARS-CoV-2 strain, as well as the emerging B.1.351 variant, trigger an inflammatory response in the lung characterized by expression of pro-inflammatory cytokines and interferon-stimulated genes. Using intravital antibody labeling, we demonstrate that MA-SARS-CoV-2 infection leads to increases in circulating monocytes and an influx of CD45+ cells into the lung parenchyma that is dominated by monocyte-derived cells. scRNA-seq analysis of lung homogenates identified a hyper-inflammatory monocyte profile. We utilize this model to demonstrate that mechanistically, CCR2 signaling promotes infiltration of classical monocytes into the lung and expansion of monocyte-derived cells. Parenchymal monocyte-derived cells appear to play a protective role against MA-SARS-CoV-2, as mice lacking CCR2 showed higher viral loads in the lungs, increased lung viral dissemination, and elevated inflammatory cytokine responses. These studies have identified a CCR2-monocyte axis that is critical for promoting viral control and restricting inflammation within the respiratory tract during SARS-CoV-2 infection. Overall design: 8 samples in total corresponding to different mice. 4 samples are from mock, control mice. 4 samples are from SARS-CoV-2 infected mice.

mkdir -p ~/scratch/rnaseq
cd ~/scratch/rnaseq/
mkdir Mmusculus && cd Mmusculus
mkdir raw_data trim bam reference
pwd
Run ID LibraryName
SRR16526489 Mock 1; Mus musculus; RNA-Seq
SRR16526488 Mock 2; Mus musculus; RNA-Seq
SRR16526486 Mock 3; Mus musculus; RNA-Seq
SRR16526483 Mock 4; Mus musculus; RNA-Seq
SRR16526477 CoV2 3; Mus musculus; RNA-Seq
SRR16526479 CoV2 2; Mus musculus; RNA-Seq
SRR16526481 CoV2 1; Mus musculus; RNA-Seq
SRR16526475 CoV2 4; Mus musculus; RNA-Seq

Reference download

Browse: https://www.ncbi.nlm.nih.gov/genome/?term=Mus+musculus

Download files (NCBI RefSeq GRCm39)

mkdir -p ~/scratch/rnaseq/Mmusculus/reference && cd ~/scratch/rnaseq/Mmusculus/reference

NCBI_BASE="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39"
curl -fsSL --retry 3 --max-time 3600 -o GRCm39_genomic.fna.gz \
    "${NCBI_BASE}/GCF_000001635.27_GRCm39_genomic.fna.gz"
curl -fsSL --retry 3 --max-time 600  -o GRCm39_genomic.gff.gz \
    "${NCBI_BASE}/GCF_000001635.27_GRCm39_genomic.gff.gz"
gunzip -f GRCm39_genomic.fna.gz GRCm39_genomic.gff.gz

# featureCounts expects GTF — convert with gffread (already in RNASEQ_bch709)
gffread GRCm39_genomic.gff -T -F --keep-exon-attrs -o GRCm39_genomic.gtf
ls -lh

Solanum lycopersicum

Project site

Whole genome sequencing and transcriptome sequencing of Solanum lycopersicum, M82 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA753098

mkdir -p ~/scratch/rnaseq
cd ~/scratch/rnaseq/
mkdir Slycopersium && cd Slycopersium
mkdir raw_data trim bam reference
pwd
Run ID LibraryName
SRR15607542 Root control Rep1
SRR15607543 Root control Rep2
SRR15607544 Root control Rep3
SRR15607552 Root Salt treatment Rep1
SRR15607553 Root Salt treatment Rep2
SRR15607554 Root Salt treatment Rep3

Reference Download

https://phytozome-next.jgi.doe.gov/info/Slycopersicum_ITAG4_0

Download files

Slycopersicum_691_ITAG4.0.gene.gff3.gz
Slycopersicum_691_SL4.0.fa.gz

Mosquito (Anopheles stephensi)

RNAseq from adult male and female Anopheles stephensi https://www.ncbi.nlm.nih.gov/bioproject/PRJNA277477

Folder preparation

mkdir -p ~/scratch/rnaseq
cd ~/scratch/rnaseq/  
mkdir Astephensi && cd Astephensi  
mkdir raw_data trim bam reference  
pwd 

SRA read download

Run ID LibraryName
SRR1851022 Anopheles stephensi male RNAseq replicate 1
SRR1851024 Anopheles stephensi male RNAseq replicate 2
SRR1851026 Anopheles stephensi male RNAseq replicate 3
SRR1851027 Anopheles stephensi female RNAseq replicate 1
SRR1851028 Anopheles stephensi female RNAseq replicate 2
SRR1851030 Anopheles stephensi female RNAseq replicate 3

Reference genome (VectorBase)

Browse: https://vectorbase.org/vectorbase/app/record/dataset/TMPTX_asteIndian

Reference download

mkdir -p ~/scratch/rnaseq/Astephensi/reference && cd ~/scratch/rnaseq/Astephensi/reference

VB_FA="https://vectorbase.org/common/downloads/release-68/AstephensiSDA-500/fasta/data/VectorBase-68_AstephensiSDA-500_Genome.fasta"
VB_GFF="https://vectorbase.org/common/downloads/release-68/AstephensiSDA-500/gff/data/VectorBase-68_AstephensiSDA-500.gff"
curl -fsSL --retry 3 --max-time 1800 -o AstephensiSDA-500.fasta "${VB_FA}"
curl -fsSL --retry 3 --max-time 600  -o AstephensiSDA-500.gff   "${VB_GFF}"
ls -lh

Expression values and Normalization

CPM, RPKM, FPKM, TPM, RLE, MRN, Q, UQ, TMM, VST, RLOG, VOOM … Too many…

CPM: Controls for sequencing depth when dividing by total count. Not for within-sample comparison or DE.

Counts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced (N) times one million. This unit is related to the FPKM without length normalization and a factor of 10^6:
CPM

RPKM/FPKM: Controls for sequencing depth and gene length. Good for technical replicates, not good for sample-sample due to compositional bias. Assumes total RNA output is same in all samples. Not for DE.

TPM: Similar to RPKM/FPKM. Corrects for sequencing depth and gene length. Also comparable between samples but no correction for compositional bias.

TMM/RLE/MRN: Improved assumption: The output between samples for a core set only of genes is similar. Corrects for compositional bias. Used for DE. RLE and MRN are very similar and correlates well with sequencing depth. edgeR::calcNormFactors() implements TMM, TMMwzp, RLE & UQ. DESeq2::estimateSizeFactors implements median ratio method (RLE). Does not correct for gene length.

VST/RLOG/VOOM: Variance is stabilised across the range of mean values. For use in exploratory analyses. Not for DE. vst() and rlog() functions from DESeq2. voom() function from Limma converts data to normal distribution.

geTMM: Gene length corrected TMM.

For DEG using DEG R packages (DESeq2, edgeR, Limma etc), use raw counts
For visualisation (PCA, clustering, heatmaps etc), use TPM or TMM
For own analysis with gene length correction, use TPM (maybe geTMM?)
Other solutions: spike-ins/house-keeping genes

Featurecount

featureCounts -p  -a <GENOME>.gtf <SAMPLE1>.bam <SAMPLE2>.bam <SAMPLE3>.bam  ...... -o counts.txt
micromamba activate RNASEQ_bch709
cd ~/scratch/rnaseq/ATH/bam
featureCounts -o ATH.featureCount.cnt -p --countReadPairs -a ~/scratch/rnaseq/ATH/reference/TAIR10_GFF3_genes.gtf SRR1761506.bamAligned.sortedByCoord.out.bam  SRR1761509.bamAligned.sortedByCoord.out.bam SRR1761507.bamAligned.sortedByCoord.out.bam  SRR1761510.bamAligned.sortedByCoord.out.bam SRR1761508.bamAligned.sortedByCoord.out.bam  SRR1761511.bamAligned.sortedByCoord.out.bam
micromamba activate RNASEQ_bch709
cd ~/scratch/rnaseq/Mmusculus/bam
featureCounts -o Mmusculus.featureCount.cnt -p --countReadPairs -a ~/scratch/rnaseq/Mmusculus/reference/GRCm39_genomic.gtf -g "gene_name"  <YOUR BAM FILES>

FPKM

FPKM

X = mapped reads count N = number of reads L = Length of transcripts

‘length’ is this transcript’s sequence length (poly(A) tail is not counted). ‘effective_length’ counts only the positions that can generate a valid fragment.

FPKM

Fragments per Kilobase of transcript per million mapped reads

X = 3752
Number_Reads_mapped = 559192
Length = 651.04
fpkm= X*(1000/Length)*(1000000/Number_Reads_mapped)
fpkm

ten to the ninth power = 10**9

TPM

Transcripts Per Million

TPM

TPM2

Paper read

Li et al., 2010, RSEM

Dillies et al., 2013

FPKM

Fragments per Kilobase of transcript per million mapped reads

awk 'FNR > 2 { sum+=$7 } END {print sum}' ATH.featureCount.cnt

Example AT1G01060.TAIR10

Length = 976 
X = 500
Number_Reads_mapped = 5949384
fpkm= X*(1000/Length)*(1000000/Number_Reads_mapped)
fpkm

ten to the ninth power = 10**9

fpkm=X/(Number_Reads_mapped*Length)*10**9
fpkm

TPM

sum_count_per_length

awk 'FNR > 2 { sum+=$7/$6 } END {print sum}' ATH.featureCount.cnt
egrep AT1G01060 ATH.featureCount.cnt

TPM calculation from reads count


sum_count_per_length =  4747.27
X = 500
Length = 976
TPM = (X/Length)*(1/sum_count_per_length )*10**6
TPM

TPM and FPKM calculation

cut -f1,6-  ATH.featureCount.cnt |  egrep -v "#" | sed 's/\Aligned\.sortedByCoord\.out\.bam//g; s/\.bam//g' > ATH.featureCount_count_length.cnt

python /data/gpfs/assoc/bch709-6/Course_material/script/tpm_raw_exp_calculator.py -count ATH.featureCount_count_length.cnt

TPM calculation from FPKM

FPKM = 86.10892858272605
SUM_FPKM = 797942
TPM=(FPKM/SUM_FPKM)*10**6
TPM

Featurecount calculation

Use GTF and BAM file under reference and bam folder, respectively.

Drosophila

Mus musculus

Solanum lycopersicum

Mosquito (Anopheles stephensi)

MultiQC summary

MultiQC walks a directory and stitches every QC artifact (fastp JSON, STAR Log.final.out, featureCounts .summary, FastQC, etc.) into a single HTML report — the one file you open to see whether every sample of the cohort behaved.

Arabidopsis

Save as ~/scratch/rnaseq/ATH/multiqc.sh:

#!/bin/bash
#SBATCH --job-name=multiqc_ATH
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=01:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o logs/multiqc_%j.out

set -euo pipefail
cd ~/scratch/rnaseq/ATH

mkdir -p qc

# Pull together everything multiqc can parse under the project dir
multiqc . -o qc/ -n ATH_report --force \
    --module fastp \
    --module star \
    --module featureCounts

What this picks up:

Module Source files What you see
fastp trim/*_fastp.json Q20/Q30 rates, duplication %, adapter trimming per sample
star bam/*Log.final.out Uniquely mapped %, multi-mapped %, splicing rates
featureCounts ATH.featureCount.cnt.summary Assigned vs unassigned reads (ambiguity, no-feature, multi-mapping)

Submit (after align.sh and featureCounts have finished):

MQC_JID=$(sbatch --parsable --dependency=afterany:${FC_JID} multiqc.sh)
echo "MultiQC: ${MQC_JID}"

Once it finishes, copy the report to your laptop and open it in a browser:

scp <netid>@pronghorn.rc.unr.edu:~/scratch/rnaseq/ATH/qc/ATH_report.html ./
open ATH_report.html      # or: double-click in your file browser

Drosophila / Mus musculus / Solanum lycopersicum / Mosquito (Anopheles stephensi)

Same script, only the project path changes — copy multiqc.sh into the species dir and update one line:

cd ~/scratch/rnaseq/Drosophila    # or Mmusculus / Slycopersicum / Astephensi
cp ../ATH/multiqc.sh .
sed -i 's|~/scratch/rnaseq/ATH|~/scratch/rnaseq/Drosophila|g; s|ATH_report|Drosophila_report|g' multiqc.sh
sbatch multiqc.sh

The report will show the same modules as above. If a section is missing, the upstream step’s artifact wasn’t written — check that step’s log before continuing to DE analysis.

DESeq2 vs EdgeR Normalization method

DESeq and EdgeR are very similar and both assume that no genes are differentially expressed. DEseq uses a “geometric” normalisation strategy, whereas EdgeR is a weighted mean of log ratios-based method. Both normalise data initially via the calculation of size / normalisation factors.

Here is further information (important parts in bold):

DESeq

DESeq: This normalization method is included in the DESeq Bioconductor package (version 1.6.0) and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis. By calling the estimateSizeFactors() and sizeFactors() functions in the DESeq Bioconductor package, this factor is computed for each lane, and raw read counts are divided by the factor associated with their sequencing lane.
DESeq2

ϕ was assumed to be a function of μ determined by nonparametric regression. The recent version used in this paper follows a more versatile procedure. Firstly, for each transcript, an estimate of the dispersion is made, presumably using maximum likelihood. Secondly, the estimated dispersions for all transcripts are fitted to the functional form:
ϕ=a+bμ(DESeq parametric fit), using a gamma-family generalised linear model (Using regression)

This normalization method is included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis. By calling the estimateSizeFactors() and sizeFactors() functions in the DESeq Bioconductor package, this factor is computed for each lane, and raw read counts are divided by the factor associated with their sequencing lane.

EdgeR

Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (version 2.4.0). It is also based on the hypothesis that most genes are not DE. The TMM factor is computed for each lane, with one lane being considered as a reference sample and the others as test samples. For each test sample, TMM is computed as the weighted mean of log ratios between this test and the reference, after exclusion of the most expressed genes and the genes with the largest log ratios. According to the hypothesis of low DE, this TMM should be close to 1. If it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (and not the raw counts) in order to fulfill the hypothesis. The calcNormFactors() function in the edgeR Bioconductor package provides these scaling factors. To obtain normalized read counts, these normalization factors are re-scaled by the mean of the normalized library sizes. Normalized read counts are obtained by dividing raw read counts by these re-scaled normalization factors.
EdgeR

edgeR recommends a “tagwise dispersion” function, which estimates the dispersion on a gene-by-gene basis, and implements an empirical Bayes strategy for squeezing the estimated dispersions towards the common dispersion. Under the default setting, the degree of squeezing is adjusted to suit the number of biological replicates within each condition: more biological replicates will need to borrow less information from the complete set of transcripts and require less squeezing.

Trimmed Mean of M-values (TMM): This normalization method hypothesis that most genes are not DE. The TMM factor is computed for each lane, with one lane being considered as a reference sample and the others as test samples. For each test sample, TMM is computed as the weighted mean of log ratios between this test and the reference, after exclusion of the most expressed genes and the genes with the largest log ratios. According to the hypothesis of low DE, this TMM should be close to 1. If it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (and not the raw counts) in order to fulfill the hypothesis. The calcNormFactors() function in the edgeR Bioconductor package provides these scaling factors. To obtain normalized read counts, these normalization factors are re-scaled by the mean of the normalized library sizes. Normalized read counts are obtained by dividing raw read counts by these re-scaled normalization factors.

DESeq2 vs EdgeR Statistical tests for differential expression

DESeq2

DESeq2 uses raw counts, rather than normalized count data, and models the normalization to fit the counts within a Generalized Linear Model (GLM) of the negative binomial family with a logarithmic link. Statistical tests are then performed to assess differential expression, if any.

EdgeR

Data are normalized to account for sample size differences and variance among samples. The normalized count data are used to estimate per-gene fold changes and to perform statistical tests of whether each gene is likely to be differentially expressed.
EdgeR uses an exact test under a negative binomial distribution (Robinson and Smyth, 2008). The statistical test is related to Fisher’s exact test, though Fisher uses a different distribution.

Major difference

The major differences between the two methods are in some of the defaults. DESeq2 by default does a couple things (which can all optionally be turned off): it finds an optimal value at which to filter low count genes, flags genes with large outlier counts or removes these outlier values when there are sufficient samples per group (n>6), excludes from the estimation of the dispersion prior and dispersion moderation those genes with very high within-group variance, and moderates log fold changes which have small statistical support (e.g. from low count genes). edgeR offers similar functionality, for example, it offers a robust dispersion estimation function, estimateGLMRobustDisp, which reduces the effect of individual outlier counts, and a robust argument to estimateDisp so that hyperparameters are not overly affected by genes with very high within-group variance. And the default steps in the edgeR User Guide for filtering low counts genes both increases power by reducing multiple testing burden and removes genes with uninformative log fold changes.

DEG software comparison paper

Fold change

Fold change (FC) is a measure describing the degree of quantity change between control and treatment value. For instance, for a data set with an control of 20 and a treatment of 80, the corresponding fold change is 3, or in common terms, a three-fold increase. Fold change is computed simply as the ratio of the changes between treatment value and the control value over the initial value. Thus, if the control value is X and treatment value is Y, the fold change is (Y - X)/X or equivalently Y/X - 1. As another example, a change from 60 to 30 would be a fold change of -0.5, while a change from 30 to 60 would be a fold change of 1 (a change of 2 times the original).

Likely because of this definition, many researchers use both“fold”and“fold change” to be synonymous with “times,” as in “2-fold larger” = “2 times larger.” Among some experts in this field use persists of fold change as in “40 is 1-fold greater than 20.” Therefore, one could argue that the use of fold change, as in “X is 3-fold greater than 15” should be avoided altogether, since some will interpret this to mean X is 45 whereas others will understand this to mean that A is 60.

In DESeq2 Fold change is typically calculated by simply average of group 2/ average of group 1.

(average in group2)/(average in group1)

The question is why would you want to do this? There are good Bioconductor packages that can do that for you. For example, DESeq2 applies shrinkage methods to the fold-changes. Raw fold-change is not informative in bioinformatic statistical analysis, because it doesn’t address the expression level (and variance) of the gene. Highly and lowly expressed genes can give you the same fold-change, and you don’t want this to happen.

Hypothesis testing using the Wald test

The first step in hypothesis testing is to set up a null hypothesis for each gene. In our case is, the null hypothesis is that there is no differential expression across the two sample groups (LFC == 0). Notice that we can do this without observing any data, because it is based on a thought experiment. Second, we use a statistical test to determine if based on the observed data, the null hypothesis is true. With DESeq2, the Wald test is commonly used for hypothesis testing when comparing two groups. A Wald test statistic is computed along with a probability that a test statistic at least as extreme as the observed value were selected at random. This probability is called the p-value of the test. If the p-value is small we reject the null hypothesis and state that there is evidence against the null (i.e. the gene is differentially expressed).

Multiple test correction

Note that we have pvalues and p-adjusted values in the output. Which should we use to identify significantly differentially expressed genes?

If we used the p-value directly from the Wald test with a significance cut-off of p < 0.05, that means there is a 5% chance it is a false positives. Each p-value is the result of a single test (single gene). The more genes we test, the more we inflate the false positive rate. This is the multiple testing problem. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives. We would not want to sift through our “significant” genes to identify which ones are true positives.

DESeq2 helps reduce the number of genes tested by removing those genes unlikely to be significantly DE prior to testing, such as those with low number of counts and outlier samples (gene-level QC). However, we still need to correct for multiple testing to reduce the number of false positives, and there are a few common approaches:

Bonferroni

The adjusted p-value is calculated by: p-value * m (m = total number of tests). This is a very conservative approach with a high probability of false negatives, so is generally not recommended.

FDR/Benjamini-Hochberg

Benjamini and Hochberg (1995) defined the concept of FDR and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. An interpretation of the BH method for controlling the FDR is implemented in DESeq2 in which we rank the genes by p-value, then multiply each ranked p-value by m/rank.

Q-value / Storey method

The minimum FDR that can be attained when calling that feature significant. For example, if gene X has a q-value of 0.013 it means that 1.3% of genes that show p-values at least as small as gene X are false positives

Deafault test

In DESeq2, the p-values attained by the Wald test are corrected for multiple testing using the Benjamini and Hochberg method by default. There are options to use other methods in the results() function. The p-adjusted values should be used to determine significant genes. The significant genes can be output for visualization and/or functional analysis.

So what does FDR < 0.05 mean? By setting the FDR cutoff to < 0.05, we’re saying that the proportion of false positives we expect amongst our differentially expressed genes is 5%. For example, if you call 500 genes as differentially expressed with an FDR cutoff of 0.05, you expect 25 of them to be false positives.

Environment

micromamba create -n DEG_bch709 -c conda-forge -c bioconda -y \
    r-gplots r-fastcluster=1.1.25 \
    bioconductor-ctc bioconductor-deseq2 bioconductor-qvalue \
    bioconductor-limma bioconductor-edger bioconductor-genomeinfodb \
    bioconductor-topgo bioconductor-org.at.tair.db \
    bioconductor-org.mm.eg.db bioconductor-org.hs.eg.db \
    r-rcurl bedtools intervene r-upsetr r-corrplot r-cairo

micromamba activate DEG_bch709

Arabidopsis

Sample information Run
WT_rep1 SRR1761506
WT_rep2 SRR1761507
WT_rep3 SRR1761508
ABA_rep1 SRR1761509
ABA_rep2 SRR1761510
ABA_rep3 SRR1761511

Slycopersium

Run ID LibraryName
SRR15607542 Root control Rep1
SRR15607543 Root control Rep2
SRR15607544 Root control Rep3
SRR15607552 Root Salt treatment Rep1
SRR15607553 Root Salt treatment Rep2
SRR15607554 Root Salt treatment Rep3

Astephensi

Run ID LibraryName
SRR1851022 Anopheles stephensi male RNAseq replicate 1
SRR1851024 Anopheles stephensi male RNAseq replicate 2
SRR1851026 Anopheles stephensi male RNAseq replicate 3
SRR1851027 Anopheles stephensi female RNAseq replicate 1
SRR1851028 Anopheles stephensi female RNAseq replicate 2
SRR1851030 Anopheles stephensi female RNAseq replicate 3

Mmusculus

Run ID LibraryName
SRR16526489 Mock 1; Mus musculus; RNA-Seq
SRR16526488 Mock 2; Mus musculus; RNA-Seq
SRR16526486 Mock 3; Mus musculus; RNA-Seq
SRR16526483 Mock 4; Mus musculus; RNA-Seq
SRR16526477 CoV2 3; Mus musculus; RNA-Seq
SRR16526479 CoV2 2; Mus musculus; RNA-Seq
SRR16526481 CoV2 1; Mus musculus; RNA-Seq
SRR16526475 CoV2 4; Mus musculus; RNA-Seq

Drosophila

Sample information Run
Control SRR16287545
Control SRR16287546
Control SRR16287547
Ethanol treatment SRR16287549
Ethanol treatment SRR16287548
Ethanol treatment SRR16287550

ATH DEG


cd ~/scratch/rnaseq/ATH
mkdir DEG
cd DEG
cp ~/scratch/rnaseq/ATH/bam/ATH.featureCount* .

cut -f1,7- ATH.featureCount.cnt | egrep -v "#" | sed 's/\.bamAligned\.sortedByCoord\.out\.bam//g; s/\.TAIR10//g' > ATH.featureCount_count_only.cnt 

Sample file

nano samples.txt
Control<TAB>SRR1761506
Control<TAB>SRR1761507
Control<TAB>SRR1761508
ABA<TAB>SRR1761509
ABA<TAB>SRR1761510
ABA<TAB>SRR1761511

PtR (Quality Check Your Samples and Biological Replicates)

Once you’ve performed transcript quantification for each of your biological replicates, it’s good to examine the data to ensure that your biological replicates are well correlated, and also to investigate relationships among your samples. If there are any obvious discrepancies among your sample and replicate relationships such as due to accidental mis-labeling of sample replicates, or strong outliers or batch effects, you’ll want to identify them before proceeding to subsequent data analyses (such as differential expression).

PtR  --matrix ATH.featureCount_count_only.cnt  --samples samples.txt --CPM  --log2 --min_rowSums 10   --sample_cor_matrix --compare_replicates

Control.rep_compare.pdf
ABA.rep_compare.pdf

DEG calculation

run_DE_analysis.pl --matrix ATH.featureCount_count_only.cnt --method DESeq2 --samples_file samples.txt --output rnaseq

Slycopersium DEG


cd ~/scratch/rnaseq/Slycopersium
mkdir DEG
cd DEG
cp ~/scratch/rnaseq/Slycopersium/bam/Slycopersium.featureCount* .

cut -f1,7- Slycopersium.featureCount.cnt | egrep -v "#" | sed 's/\.bamAligned\.sortedByCoord\.out\.bam//g; s/\.ITAG4\.0//g' > Slycopersium.featureCount_count_only.cnt 

Sample file

nano samples.txt
Control SRR15607542
Control SRR15607543
Control SRR15607544
Salt    SRR15607552
Salt    SRR15607553
Salt    SRR15607554

PtR (Quality Check Your Samples and Biological Replicates)

PtR –matrix Slycopersium.featureCount_count_only.cnt –samples samples.txt –CPM –log2 –min_rowSums 10 –sample_cor_matrix –compare_replicates

DEG calculation

run_DE_analysis.pl --matrix Slycopersium.featureCount_count_only.cnt  --method DESeq2 --samples_file samples.txt --output rnaseq

Astephensi DEG


cd ~/scratch/rnaseq/Astephensi
mkdir DEG
cd DEG
cp ~/scratch/rnaseq/Astephensi/bam/*.featureCount* .

cut -f1,7- Astephensi.featureCount.cnt | egrep -v "#" | sed 's/\.bamAligned\.sortedByCoord\.out\.bam//g;' > Astephensi.featureCount_count_only.cnt 

Sample file

nano samples.txt
Male    SRR1851022
Male    SRR1851024
Male    SRR1851026
Female  SRR1851027
Female  SRR1851028
Female  SRR1851030

PtR (Quality Check Your Samples and Biological Replicates)

PtR –matrix Astephensi.featureCount_count_only.cnt –samples samples.txt –CPM –log2 –min_rowSums 10 –sample_cor_matrix –compare_replicates

DEG calculation

run_DE_analysis.pl --matrix Astephensi.featureCount_count_only.cnt  --method DESeq2 --samples_file samples.txt --output rnaseq

Mmusculus DEG


cd ~/scratch/rnaseq/Mmusculus
mkdir DEG
cd DEG
cp ~/scratch/rnaseq/Mmusculus/bam/*.featureCount* .

cut -f1,7- Mmusculus.featureCount.cnt | egrep -v "#" | sed 's/\.bamAligned\.sortedByCoord\.out\.bam//g' > Mmusculus.featureCount_count_only.cnt 

Sample file

nano samples.txt
Mock    SRR16526489
Mock    SRR16526488
Mock    SRR16526486
Mock    SRR16526483
CoV SRR16526477
CoV SRR16526479
CoV SRR16526481
CoV SRR16526475

PtR (Quality Check Your Samples and Biological Replicates)

PtR –matrix Mmusculus.featureCount_count_only.cnt –samples samples.txt –CPM –log2 –min_rowSums 10 –sample_cor_matrix –compare_replicates

DEG calculation

run_DE_analysis.pl --matrix Mmusculus.featureCount_count_only.cnt  --method DESeq2 --samples_file samples.txt --output rnaseq

Drosophila DEG


cd ~/scratch/rnaseq/Drosophila
mkdir DEG
cd DEG
cp ~/scratch/rnaseq/Drosophila/bam/*.featureCount* .

cut -f1,7- Drosophila.featureCount.cnt | egrep -v "#" | sed 's/\.bamAligned\.sortedByCoord\.out\.bam//g' > Drosophila.featureCount_count_only.cnt 

Sample file

nano samples.txt
Control SRR16287545
Control SRR16287546
Control SRR16287547
Ethanol SRR16287550
Ethanol SRR16287548
Ethanol SRR16287549

PtR (Quality Check Your Samples and Biological Replicates)

PtR –matrix Drosophila.featureCount_count_only.cnt –samples samples.txt –CPM –log2 –min_rowSums 10 –sample_cor_matrix –compare_replicates

DEG calculation

run_DE_analysis.pl --matrix Drosophila.featureCount_count_only.cnt  --method DESeq2 --samples_file samples.txt --output rnaseq

RNA-Seq subset

DEG subset

cd rnaseq
## 4-fold and p-value 0.01
analyze_diff_expr.pl --samples ~/scratch/rnaseq/ATH/DEG/samples.txt  --matrix ~/scratch/rnaseq/ATH/DEG/ATH.featureCount_count_length.cnt.tpm.tab -P 0.01 -C 2 --output ATH

## 2-fold and p-value 0.01
analyze_diff_expr.pl --samples  ~/scratch/rnaseq/ATH/DEG/samples.txt   --matrix ~/scratch/rnaseq/ATH/DEG/ATH.featureCount_count_length.cnt.tpm.tab -P 0.01 -C 1 --output ATH

DEG output

ATH.matrix.log2.centered.sample_cor_matrix.pdf
ATH.matrix.log2.centered.genes_vs_samples_heatmap.pdf

ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C2.ABA-UP.subset
ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C2.Control-UP.subset
ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C2.DE.subset

ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C1.ABA-UP.subset
ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C1.Control-UP.subset
ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C1.DE.subset

Venn diagram

Intervene installation

mamba install -c bioconda bedtools intervene r-UpSetR=1.4.0 r-corrplot r-Cairo
cd ~/scratch/rnaseq/ATH/DEG/rnaseq
cut -f 1 ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C2.ABA-UP.subset |  grep -v sample > DESeq.UP_4fold.subset
cut -f 1 ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C2.Control-UP.subset  |  grep -v sample > DESeq.DOWN_4fold.subset 

cut -f 1 ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C1.ABA-UP.subset |  grep -v sample > DESeq.UP_2fold.subset
cut -f 1 ATH.featureCount_count_only.cnt.ABA_vs_Control.DESeq2.DE_results.P0.01_C1.Control-UP.subset  |  grep -v sample >DESeq.DOWN_2fold.subset
 wc -l DESeq*subset
  701 DESeq.DOWN_2fold.subset
  227 DESeq.DOWN_4fold.subset
 1218 DESeq.UP_2fold.subset
  463 DESeq.UP_4fold.subset
 2609 total
intervene venn --type list --save-overlaps -i DESeq.DOWN_2fold.subset DESeq.DOWN_4fold.subset DESeq.UP_2fold.subset DESeq.UP_4fold.subset
intervene upset --type list --save-overlaps -i DESeq.DOWN_2fold.subset DESeq.DOWN_4fold.subset DESeq.UP_2fold.subset DESeq.UP_4fold.subset 

Result

cd Intervene_results
ls 
Intervene_upset_combinations.txt
Intervene_upset.pdf
Intervene_upset.R
Intervene_venn.pdf
sets
cd sets
0010_DESeq.UP_2fold.txt
0011_DESeq.UP_2fold_DESeq.UP_4fold.txt
1000_DESeq.DOWN_2fold.txt
1100_DESeq.DOWN_2fold_DESeq.DOWN_4fold.txt

Gene Ontology

Gene Ontology project is a major bioinformatics initiative Gene ontology is an annotation system The project provides the controlled and consistent vocabulary of terms and gene product annotations, i.e. terms occur only once, and there is a dictionary of allowed words GO describes how gene products behave in a cellular context A consistent description of gene products attributes in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner Each GO term consists of a unique alphanumerical identifier, a common name, synonyms (if applicable), and a definition Each term is assigned to one of the three ontologies Terms have a textual definition When a term has multiple meanings depending on species, the GO uses a “sensu” tag to differentiate among them (trichome differentiation (sensu Magnoliophyta)

http://geneontology.org/docs/ontology-documentation/

GO

kegg

hypergeometric test

The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw. The hypergeometric probability mass function is given by (using the original variable convention)

hyper_geo combination FWER

FWER

The FWER for the other tests is computed in the same way: the gene-associated variables (scores or counts) are permuted while the annotations of genes to GO-categories stay fixed. Then the statistical tests are evaluated again for every GO-category.

Hypergeometric Test Example 1

Suppose we randomly select 2 cards without replacement from an ordinary deck of playing cards. What is the probability of getting exactly 2 cards you want (i.e., Ace or 10)?

Solution: This is a hypergeometric experiment in which we know the following:

N = 52; since there are 52 cards in a deck. k = 16; since there are 16 Ace or 10 cards in a deck. n = 2; since we randomly select cards from the deck. x = 2; since 2 of the cards we select are red. We plug these values into the hypergeometric formula as follows:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

h(2; 52, 2, 16) = [ 16C2 ] [ 48C1 ] / [ 52C2 ]

h(2; 52, 2, 16) = [ 325 ] [ 1 ] / [ 1,326 ]

h(2; 52, 2, 16) = 0.0904977

Thus, the probability of randomly selecting 2 Ace or 10 cards is 9%

category probability
probability mass f 0.09049773755656108597285
lower cumulative P 1
upper cumulative Q 0.09049773755656108597285
Expectation 0.6153846153846153846154

Hypergeometric Test Example 2

Suppose we have 30 DEGs in human genome (200). What is the probability of getting 10 oncogene?

An oncogene is a gene that has the potential to cause cancer.

Solution: This is a hypergeometric experiment in which we know the following:

N = 200; since there are 200 genes in human genome k = 10; since there are 10 oncogenes in human n = 30; since 30 DEGs x = 5; since 5 of the oncogenes in DEGs.

We plug these values into the hypergeometric formula as follows:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

h(5; 200, 30, 10) = [ 10C5 ] [ 190C25 ] / [ 200C30 ]

h(5; 200, 30, 10) = [ 252 ] [ 11506192278177947613740456466942 ] / [ 409681705022127773530866523638950880 ]

h(5; 200, 30, 10) = 0.007078

Thus, the probability of oncogene 0.7%.

hypergeometry.png

hypergeometric distribution value

category probability
probability mass f 0.0070775932109153651831923063371216961166297
lower cumulative P 0.99903494867072865323201131115533112651846
upper cumulative Q 0.0080426445401867119511809951817905695981658
Expectation 1.5

False Discovery Rate (FDR) q-value

The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections).

Gene ontology

http://geneontology.org/

cleverGO

http://s.tartaglialab.com/page/clever_suite

MetaScape

http://metascape.org/gp/index.html

DAVID

https://david.ncifcrf.gov/

Araport

https://bar.utoronto.ca/thalemine/begin.do

REViGO

http://revigo.irb.hr/

Arabidopsis

cd ~/scratch/rnaseq/ATH/DEG/rnaseq

cat DESeq.DOWN_4fold.subset
cat DESeq.UP_4fold.subset

Mouse

~/scratch/rnaseq/Mmusculus/DEG/rnaseq
 cut -f 1 Mmusculus.featureCount_count_only.cnt.CoV_vs_Mock.DESeq2.DE_results.P0.01_C2.Mock-UP.subset | egrep -v sample

https://reactome.org/PathwayBrowser/#/DTAB=AN&ANALYSIS=MjAyMTExMTcwNjE3MjNfNTU5MTM%253D

Tomato

~/scratch/rnaseq/Slycopersium/DEG/rnaseq

Error, no counts from matrix for Solyc12g017350 at /data/gpfs/home/wyim/scratch/bin/miniconda3/envs/DEG_bch709/bin/analyze_diff_expr.pl line 363, <$fh> line 2.
sed -i 's/\.ITAG4\.0//g' Slycopersium.featureCount_count_length.cnt.tpm.tab

analyze_diff_expr.pl --samples ../samples.txt  --matrix ../Slycopersium.featureCount_count_length.cnt.tpm.tab -P 0.01 -C 2 --output Slycopersium

BLAST

BLAST (Basic Local Alignment Search Tool) is a popular program for searching biosequences against databases. BLAST was developed and is maintained by a group at the National Center for Biotechnology Information (NCBI). Salient characteristics of BLAST are:

Local alignments

BLAST tries to find patches of regional similarity, rather than trying to find the best alignment between your entire query and an entire database sequence.

Ungapped alignments

Alignments generated with BLAST do not contain gaps. BLAST’s speed and statistical model depend on this, but in theory it reduces sensitivity. However, BLAST will report multiple local alignments between your query and a database sequence.

Explicit statistical theory

BLAST is based on an explicit statistical theory developed by Samuel Karlin and Steven Altschul (PNAS 87:2284-2268. 1990) The original theory was later extended to cover multiple weak matches between query and database entry PNAS 90:5873. 1993).

CAUTION: the repetitive nature of many biological sequences (particularly naive translations of DNA/RNA) violates assumptions made in the Karlin & Altschul theory. While the P values provided by BLAST are a good rule-of-thumb for initial identification of promising matches, care should be taken to ensure that matches are not due simply to biased amino acid composition.

CAUTION: The databases are contaminated with numerous artifacts. The intelligent use of filters can reduce problems from these sources. Remember that the statistical theory only covers the likelihood of finding a match by chance under particular assumptions; it does not guarantee biological importance.

Heuristic

BLAST is not guaranteed to find the best alignment between your query and the database; it may miss matches. This is because it uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in order to gain speed. However, in practice few biologically significant matches are missed by BLAST which can be found with other sequence search programs. BLAST searches the database in two phases. First it looks for short subsequences which are likely to produce significant matches, and then it tries to extend these subsequences. A substitution matrix is used during all phases of protein searches (BLASTP, BLASTX, TBLASTN) Both phases of the alignment process (scanning & extension) use a substitution matrix to score matches. This is in contrast to FASTA, which uses a substitution matrix only for the extension phase. Substitution matrices greatly improve sensitivity.

BLASTP

search a Protein Sequence against a Protein Database.

BLASTN

search a Nucleotide Sequence against a Nucleotide Database.

TBLASTN

search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames.

BLASTX

search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames.

BLAST site

https://blast.ncbi.nlm.nih.gov/Blast.cgi https://www.uniprot.org/

Rapidly compare a sequence Q to a database to find all sequences in the database with an score above some cutoff S.

Homologous sequence are likely to contain a short high scoring word pair, a seed.

– Unlike Baeza-Yates, BLAST doesn’t make explicit guarantees

BLAST then tries to extend high scoring word pairs to compute maximal high scoring segment pairs (HSPs).

– Heuristic algorithm but evaluates the result statistically.

seed seed

E-value

The Expect value (E) is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise.

E-value = the number of HSPs having score S (or higher) expected to occur by chance.

Smaller E-value, more significant in statistics Bigger E-value , by chance

** E[# occurrences of a string of length m in reference of length L] ~ L/4m **

PAM and BLOSUM Matrices

Two different kinds of amino acid scoring matrices, PAM (Percent Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix), are in wide use. The PAM matrices were created by Margaret Dayhoff and coworkers and are thus sometimes referred to as the Dayhoff matrices. These scoring matrices have a strong theoretical component and make a few evolutionary assumptions. The BLOSUM matrices, on the other hand, are more empirical and derive from a larger data set. Most researchers today prefer to use BLOSUM matrices because in silico experiments indicate that searches employing BLOSUM matrices have higher sensitivity.

There are several PAM matrices, each one with a numeric suffix. The PAM1 matrix was constructed with a set of proteins that were all 85 percent or more identical to one another. The other matrices in the PAM set were then constructed by multiplying the PAM1 matrix by itself: 100 times for the PAM100; 160 times for the PAM160; and so on, in an attempt to model the course of sequence evolution. Though highly theoretical (and somewhat suspect), it is certainly a reasonable approach. There was little protein sequence data in the 1970s when these matrices were created, so this approach was a good way to extrapolate to larger distances.

Protein databases contained many more sequences by the 1990s so a more empirical approach was possible. The BLOSUM matrices were constructed by extracting ungapped segments, or blocks, from a set of multiply aligned protein families, and then further clustering these blocks on the basis of their percent identity. The blocks used to derive the BLOSUM62 matrix, for example, all have at least 62 percent identity to some other member of the block.

PAM-250-and-Blosum-62-matrices

codon

BLAST has a number of possible programs to run depending on whether you have nucleotide or protein sequences:

nucleotide query and nucleotide db - blastn nucleotide query and nucleotide db - tblastx (includes six frame translation of query and db sequences) nucleotide query and protein db - blastx (includes six frame translation of query sequences) protein query and nucleotide db - tblastn (includes six frame translation of db sequences) protein query and protein db - blastp

blasttype

BLAST Process

step1 step2 step3 step4

blast

NCBI BLAST

https://blast.ncbi.nlm.nih.gov/Blast.cgi

Uniprot

https://www.uniprot.org/

BLASTN example

Run blastn against the nt database.


ATGAAAGCGAAGGTTAGCCGTGGTGGCGGTTTTCGCGGTGCGCTGAACTA
CGTTTTTGACGTTGGCAAGGAAGCCACGCACACGAAAAACGCGGAGCGAG
TCGGCGGCAACATGGCCGGGAATGACCCCCGCGAACTGTCGCGGGAGTTC
TCAGCCGTGCGCCAGTTGCGCCCGGACATCGGCAAGCCCGTCTGGCATTG
CTCGCTGTCACTGCCTCCCGGCGAGCGCCTGAGCGCCGAGAAGTGGGAAG
CCGTCGCGGCTGACTTCATGCAGCGCATGGGCTTTGACCAGACCAATACG
CCGTGGGTGGCCGTGCGCCACCAGGACACGGACAAGGATCACATCCACAT
CGTGGCCAGCCGGGTAGGGCTGGACGGGAAAGTGTGGCTGGGCCAGTGGG
AAGCCCGCCGCGCCATCGAGGCGACCCAAGAGCTTGAGCATACCCACGGC
CTGACCCTGACGCCGGGGCTGGGCGATGCGCGGGCCGAGCGCCGGAAGCT
GACCGACAAGGAGATCAACATGGCCGTGAGAACGGGCGATGAACCGCCGC
GCCAGCGTCTGCAACGGCTGCTGGATGAGGCGGTGAAGGACAAGCCGACC
GCGCTAGAACTGGCCGAGCGGCTACAGGCCGCAGGCGTAGGCGTCCGGGC
AAACCTCGCCAGCACCGGGCGCATGAACGGCTTTTCCTTCGAGGTGGCCG
GAGTGCCGTTCAAAGGCAGCGACTTGGGCAAGGGCTACACATGGGCGGGG
CTACAGAAAGCAGGGGTGACTTATGACGAAGCTAGAGACCGTGCGGGCCT
TGAACGATTCAGGCCCACAGTTGCAGATCGTGGAGAGCGTCAGGACGTTG
CAGCAGTCCGTGAGCCTGATGCACGAGGACTTGAAGCGCCTACCGGGCGC
AGTCTCGACCGAGACGGCGCAGACCTTGGAACCGCTGGCCCGACTCCGGC
AGGACGTGACGCAGGTTCTGGAAGCCTACGACAAGGTGACGGCCATTCAG
CGCAAGACGCTGGACGAGCTGACGCAGCAGATGAGCGCGAGCGCGGCGCA
GGCCTTCGAGCAGAAGGCCGGGAAGCTGGACGCGACCATCTCCGACCTGT
CGCGCAGCCTGTCAGGGCTGAAAACGAGCCTCAGCAGCATGGAGCAGACC
GCGCAGCAGGTGGCGACCTTGCCGGGCAAGCTGGCGAGCGCACAGCAGGG
CATGACGAAAGCCGCCGACCAACTGACCGAGGCAGCGAACGAGACGCGCC
CGCGCCTTTGGCGGCAGGCGCTGGGGCTGATTCTGGCCGGGGCCGTGGGC
GCGATGCTGGTAGCGACTGGGCAAGTCGCTTTAAACAGGCTAGTGCCGCC
AAGCGACGTGCAGCAGACGGCAGACTGGGCCAACGCGATTTGGAACAAGG
CCACGCCCACGGAGCGCGAGTTGCTGAAACAGATCGCCAATCGGCCCGCG
AACTAGACCCGACCGCCTACCTTGAGGCCAGCGGCTACACCGTGAAGCGA
GAAGGGCGGCACCTGTCCGTCAGGGCGGGCGGTGATGAGGCGTACCGCGT
GACCCGGCAGCAGGACGGGCGCTGGCTCTGGTGCGACCGCTACGGCAACG
ACGGCGGGGACAATATCGACCTGGTGCGCGAGATCGAACCCGGCACCGGC
TACGCCGAGGCCGTCTATCGGCTTTCAGGTGCGCCGACAGTCCGGCAGCA
ACCGCGCCCGAGCGAGCCGAAGCGCCAACCGCCGCAGCTACCGGCGCAAG
GGCTGGCAGCCCGCGAGCATGGCCGCGACTACCTCAAGGGCCGGGGCATC
AGCCAGGACACCATCGAGCACGCCGAGAAGGCGGGCATGGTGCGCTATGC
AGACGGTGGAGTGCTGTTCGTCGGCTACGACCGTGCAGGCACCGCGCAGA
ACGCCACACGCCGCGCCATTGCCCCCGCTGACCCGGTGCAGAAGCGCGAC
CTACGCGGCAGCGACAAGAGCTATCCGCCGATCCTGCCGGGCGACCCGGC
AAAGGTCTGGATCGTGGAAGGTGGCCCGGATGCGCTGGCCCTGCACGACA
TCGCCAAGCGCAGCGGCCAGCAGCCGCCCACCGTCATCGTGTCAGGCGGG
GCGAACGTGCGCAGCTTCTTGGAGCGGGCCGACGTGCAAGCGATCCTGAA
GCGGGCCGAGCGCGTCACCGTGGCCGGGGAAAACGAGAAGAACCCCGAGG
CGCAGGCAAAGGCCGACGCCGGGCACCAGAAGCAGGCGCAGCGGGTGGCC
AAAATCACCGGGCGCGAGGTGCGCCAATGGACGCCGAAGCCCGAGCACGG
CAAGGACTTGGCCGACATGAACGCCCGGCAGGTGGCAGAGATCGAGCGCA
AGCGACAGGCCGAGATCGAGGCCGAAAGAGCACGAAACCGCGAGCTTTCA
CGCAAGAGCCGGAGGTATGATGGCCCCAGCTTCGGCAGATAA

BLASTP Query

Do a BLASTP on NCBI website with the following protein against nr, but limit the organism to cetartiodactyla using default parameters:

MASGPGGWLGPAFALRLLLAAVLQPVSAFRAEFSSESCRELGFSSNLLCSSCDLLGQFSL
LQLDPDCRGCCQEEAQFETKKYVRGSDPVLKLLDDNGNIAEELSILKWNTDSVEEFLSEK
LERI

Have a look at the multiple sequence alignment, can you explain the results?

Do a similar blastp vs UniProtKB (UniProt) without post filtering.

Running a standalone BLAST program

location

mkdir -p ~/scratch/rnaseq/BLAST
cd ~/scratch/rnaseq/BLAST

ENV

micromamba create -n blast -c conda-forge -c bioconda \
    perl-path-tiny blast perl-data-dumper perl-config-tiny seqkit -y
micromamba activate blast

Running a standalone BLAST program

Create the index for the target database using makeblastdb; Choose the task program: blastn, blastp, blastx, tblastx, psiblast or deltablast; Set the configuration for match, mismatch, gap-open penalty, gap-extension penalty or scoring matrix; Set the word size; Set the E-value threshold; Set the output format and the number of output results

Standalone BLAST

In addition to providing BLAST sequence alignment services on the web, NCBI also makes these sequence alignment utilities available for download through FTP. This allows BLAST searches to be performed on local platforms against databases downloaded from NCBI or created locally. These utilities run through DOS-like command windows and accept input through text-based command line switches. There is no graphic user interface

https://www.ncbi.nlm.nih.gov/books/NBK52640/

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

NR vs NT

At NCBI they are two different things as well. ‘nr’ is a database of protein sequences and ‘nt’ is nucleotide. At one time ‘nr’ meant non-redundant but it stopped being non-redundant a while ago. nt is a nucleotide database, while nr is a protein database (in amino acids)

Standalone BLAST

  1. Download the database.
  2. Use makeblastdb to build the index.
  3. Change the scoring matrix, record the changes in the alignment results and interpret the results.

How many sequences in plant.1.protein.faa.gz

Subsampling by SeqKit

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.

This project describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.

https://bioinf.shenwei.me/seqkit/

https://bioinf.shenwei.me/seqkit/tutorial/

Download Database

mkdir -p ~/scratch/rnaseq/BLAST
cd ~/scratch/rnaseq/BLAST
curl -fsSL --retry 3 --max-time 600 -O https://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.1.protein.faa.gz

Run BLASTX

cd ~/scratch/rnaseq/BLAST
gunzip plant.1.protein.faa.gz
makeblastdb -in plant.1.protein.faa -dbtype prot
seqkit sample -n 100 /data/gpfs/assoc/bch709-6/Course_material/test_mrna.fna > test_mrna.fasta
seqkit sample -n 100 plant.1.protein.faa > test_protein.fasta
blastx -query test_mrna.fasta  -db plant.1.protein.faa 
blastx -query test_mrna.fasta  -db plant.1.protein.faa -outfmt 7

Run BLASTP

blastp -query test_protein.fasta -db plant.1.protein.faa -outfmt 7

Run BLASTN

makeblastdb -in test_mrna.fasta -dbtype nucl
blastn -query test_mrna.fasta  -db test_mrna.fasta -outfmt 7 -out blastn.output

Tab output

qseqid      Query sequence ID
sseqid      Subject (ie DB) sequence ID
pident      Percent Identity across the alignment
length      Alignment length
mismatch    # of mismatches
gapopen     Number of gap openings
qstart      Start of alignment in query
qend        End of alignment in query 
sstart      Start of alignment in subject
send        End of alignment in subject
evalue      E-value
bitscore    Bit score

DCBLAST

The Basic Local Alignment Search Tool (BLAST) is by far best the most widely used tool in for sequence analysis for rapid sequence similarity searching among nucleic acid or amino acid sequences. Recently, cluster, HPC, grid, and cloud environmentshave been are increasing more widely used and more accessible as high-performance computing systems. Divide and Conquer BLAST (DCBLAST) has been designed to perform run on grid system with query splicing which can run National Center for Biotechnology Information (NCBI) BLASTBLAST search comparisons over withinthe cluster, grid, and cloud computing grid environment by using a query sequence distribution approach NCBI BLAST. This is a promising tool to accelerate BLAST job dramatically accelerates the execution of BLAST query searches using a simple, accessible, robust, and practical approach.

blast

Citation

Won C. Yim and John C. Cushman (2017) Divide and Conquer BLAST: using grid engines to accelerate BLAST and other sequence analysis tools. PeerJ 10.7717/peerj.3486 https://peerj.com/articles/3486/

Ortholog

Example 1

Example 2

Synteny

Genome Evolution

Chromosomal Evolution


GO Enrichment with topGO (on Pronghorn)

After you have the up/down gene lists from DESeq2, run GO enrichment directly on Pronghorn. topGO is already available in the DEG_bch709 environment used earlier in this lesson.

Prepare the gene lists

# Working directory: your DEG output folder
cd ~/scratch/rnaseq/ATH/DEG/rnaseq/venn

# Universe = every gene tested (from the TPM matrix)
cut -f 1 ../ATH.featureCount_count_length.cnt.tpm.tab | grep -v sample > universe.txt

# Interesting gene set (change file as needed)
cp DESeq.UP_4fold.subset interesting_genes.txt

wc -l universe.txt interesting_genes.txt

Run topGO

Create go_enrichment.R:

# go_enrichment.R
library(topGO)
library(org.At.tair.db)   # change to org.Hs.eg.db / org.Mm.eg.db for human / mouse

args <- commandArgs(trailingOnly = TRUE)
universe_file     <- args[1]   # all genes tested
interesting_file  <- args[2]   # DEGs
out_prefix        <- args[3]   # output prefix

universe    <- readLines(universe_file)
interesting <- readLines(interesting_file)

gene_list <- factor(as.integer(universe %in% interesting))
names(gene_list) <- universe

run_ontology <- function(ont) {
    godata <- new("topGOdata",
                   ontology = ont,
                   allGenes = gene_list,
                   annot = annFUN.org,
                   mapping = "org.At.tair.db",
                   ID = "tair")
    result <- runTest(godata, algorithm = "classic", statistic = "fisher")
    table  <- GenTable(godata,
                       classicFisher = result,
                       topNodes = 30,
                       orderBy = "classicFisher")
    write.table(table,
                file = paste0(out_prefix, "_", ont, ".tsv"),
                sep = "\t", quote = FALSE, row.names = FALSE)
    cat("Wrote", paste0(out_prefix, "_", ont, ".tsv"), "\n")
}

for (ont in c("BP", "MF", "CC")) run_ontology(ont)

Submit as a Slurm job

go_enrichment.sh:

#!/bin/bash
#SBATCH --job-name=go_enrich
#SBATCH --account=cpu-s5-bch709-6
#SBATCH --partition=cpu-core-0
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=01:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH -o go_enrich_%j.out

# Activate `DEG_bch709` in your login shell BEFORE running `sbatch go_enrichment.sh`
cd ~/scratch/rnaseq/ATH/DEG/rnaseq/venn

Rscript go_enrichment.R universe.txt interesting_genes.txt ATH_UP4fold

Submit:

sbatch go_enrichment.sh

The output is three tab-separated files per input list — <prefix>_BP.tsv, _MF.tsv, _CC.tsv — with the top 30 GO terms, Fisher p-values, and counts. Copy them to your laptop with rsync and open in Excel or a spreadsheet.


Visualize DEGs — Heatmap and Volcano Plot

Heatmap of top 50 DEGs

heatmap.R:

library(pheatmap)

counts  <- read.table("ATH.featureCount_count_length.cnt.tpm.tab",
                      header = TRUE, row.names = 1, sep = "\t")
deg_ids <- read.table("DEG_ids.txt", header = FALSE)[,1]

mat <- log2(as.matrix(counts[deg_ids, ]) + 1)
mat <- mat[order(-rowMeans(mat)), ][1:min(50, nrow(mat)), ]

pdf("heatmap_top50.pdf", height = 10, width = 7)
pheatmap(mat, scale = "row", show_rownames = TRUE,
         main = "Top 50 DEGs (log2 TPM, row-scaled)")
dev.off()

Volcano plot

volcano.R:

library(ggplot2)

d <- read.table("ATH.featureCount_count_only.cnt.ABA_vs_WT.DESeq2.DE_results",
                 header = TRUE, sep = "\t")
d$sig <- with(d, ifelse(padj < 0.01 & abs(log2FoldChange) >= 2,
                         ifelse(log2FoldChange > 0, "UP", "DOWN"), "NS"))

ggplot(d, aes(log2FoldChange, -log10(pvalue), color = sig)) +
    geom_point(alpha = 0.6, size = 1) +
    scale_color_manual(values = c(UP = "firebrick", DOWN = "steelblue", NS = "grey70")) +
    geom_vline(xintercept = c(-2, 2), linetype = "dashed") +
    geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
    theme_classic() +
    labs(x = "log2 FC (ABA vs WT)", y = "-log10 p-value")

ggsave("volcano.pdf", width = 6, height = 5)

Run either script interactively on Pronghorn:

micromamba activate DEG_bch709
Rscript heatmap.R
Rscript volcano.R

Then copy the PDFs back to your laptop with rsync.


Full RNA-Seq Workflow Summary

Step Tool Script / Section Output
Download FASTQ curl from ENA fastq-dump.sh *.fastq.gz
QC FastQC + MultiQC built-in HTML report
Trim fastp trim.sh Trimmed FASTQ
Index STAR index.sh STAR index dir
Align STAR align.sh Aligned BAM
Count featureCounts built-in (align.sh) *.cnt matrix
Normalize perl (TPM/FPKM) built-in TPM/FPKM table
DEG DESeq2 / edgeR (Trinity wrappers) DEG calculation *.DE_results
Subset DEGs analyze_diff_expr.pl built-in .subset files
Venn overlap intervene Draw Venn Diagram Intervene_*
GO enrichment topGO go_enrichment.R (above) *_BP/MF/CC.tsv
Visualization pheatmap / ggplot2 heatmap.R / volcano.R .pdf

Cleanup

# Inside ~/scratch/rnaseq you can remove:
#   - Raw FASTQ (once alignment + trim QC are reviewed)
#   - STAR alignment intermediates (_STARtmp, Log.out, ReadsPerGene.out.tab)
# Keep:
#   - Final BAMs, count matrices, DESeq2/edgeR results, DEG subsets, GO tables, plots

# Example: drop raw FASTQ and STAR tmp (limit scope to ~/scratch/rnaseq)
find ~/scratch/rnaseq -name "*_STARtmp" -type d -exec rm -rf {} +
find ~/scratch/rnaseq -name "*.fastq.gz" -path "*/raw_data/*" -delete

Leaving the Pronghorn session:

micromamba deactivate
exit

Next Steps