🤖 BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: RNA-Seq Tutorial

Paper Reading

Please read this paper before class: Conesa et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biology 17:13


Overview: RNA Sequencing

RNA Sequencing

RNA-Seq (RNA Sequencing) is the standard method for measuring genome-wide gene expression. Key characteristics:

  1. The transcriptome is spatially and temporally dynamic
  2. Data comes from functional units (coding regions)
  3. Only a small fraction of the genome is transcribed at any given time

Introduction

Sequence-based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification, transcript discovery, and splice isoform identification, as well as adaptability for numerous more specialized measurements. RNA-Seq studies present challenges shared with prior methods such as microarrays and SAGE tagging, and also new ones specific to high-throughput sequencing platforms.

RNA-Seq experiments are diverse in their aims and design goals, currently including multiple types of RNA isolated from whole cells or from specific sub-cellular compartments or biochemical classes, such as total polyA+ RNA, polysomal RNA, nuclear ribosome-depleted RNA, various size fractions of RNA and a host of others.


The RNA-Seq Workflow

RNA Sequencing Workflow

Seven Stages of a Data Science Project

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results
  7. Make your analysis reproducible

1. Experimental Design

Sample Information

Document the following for each sample:

Item Description
Material type Tissue, cell line, primary cell type, etc.
Treatments TALENs, CRISPR, drug treatment, etc.
Subcellular fraction Whole cell, nuclear, cytoplasmic, etc.
Input amount Bulk vs. low-input (affects reproducibility)
Lot/catalog # For commercial cell lines
Culture protocol If cells were propagated in vitro
QC/phenotyping Purity confirmation methods

RNA Information

Key properties to report:

Library Preparation Protocol

Document all steps:

Replicate Strategy

Reading Materials


2. Sequencing Parameters

Key Parameters to Report

Parameter Examples
Platform Illumina NovaSeq, PacBio, Oxford Nanopore
Format Single-end (SE) or Paired-end (PE)
Read length 100 bp, 150 bp, 250 bp
Barcode placement Standard or custom
Custom primers Sequence and position

RNA Library Prep

Library Type Minimum Aligned Read Pairs
Long RNA-Seq (polyA+) 30 million
RAMPAGE 20 million
Small RNA-Seq 30 million

Quantitative Standards (Spike-ins)

ERCC spike-in controls are highly recommended for calibrating quantification, sensitivity, and linearity.

Report:


3. Data Files and FASTQ Format

What is FASTQ?

FASTQ is a text-based format storing both nucleotide sequences and per-base quality scores.

Each record has exactly 4 lines:

@A00261:180:HL7GCDSXX:2:1101:30572:1047/2          # Line 1: header
AAAATACATTGATGACCATCTAAAGTCTACGGCGTAT...            # Line 2: sequence
+                                                   # Line 3: separator
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...              # Line 4: quality scores

Illumina Read Header Format

@Instrument:RunID:FlowcellID:Lane:Tile:X:Y/ReadNum
Field Example Meaning
Instrument A00261 Instrument name
RunID 180 Run ID
FlowcellID HL7GCDSXX Flowcell ID
Lane 2 Flowcell lane
Tile 1101 Tile number within lane
X 30572 X-coordinate of cluster
Y 1047 Y-coordinate of cluster
ReadNum /2 Pair member (1 or 2)

Phred Quality Scores

FASTQ Quality

Quality scores represent the probability of an incorrect base call:

Phred Score Error Probability Accuracy
Q10 1 in 10 90%
Q20 1 in 100 99%
Q30 1 in 1,000 99.9%
Q40 1 in 10,000 99.99%

4. Hands-On: Setting Up the Environment

Create Working Directory

mkdir -p ~/bch709/rnaseq
cd ~/bch709/rnaseq

Create Conda Environment

conda create -n rnaseq python=3.11 -y
conda activate rnaseq

If your conda command is micromamba-based, use:

micromamba activate rnaseq

Install Tools

conda install -n rnaseq -c conda-forge -c bioconda \
  fastqc trim-galore hisat2 star samtools subread rsem multiqc -y

If your default channel_priority is strict, keeping conda-forge before bioconda helps avoid dependency conflicts (for example, with multiqc).

Download Example Data

wget https://www.dropbox.com/s/y7yehmfze1l6cgz/pair1.fastq.gz
wget https://www.dropbox.com/s/xsrth6icapyr4p0/pair2.fastq.gz
ls -lh

Inspect a FASTQ File

zcat pair2.fastq.gz | head -12

5. Quality Control

What to Check

Metric Tool
Number of reads FastQC
Per-base sequence quality FastQC
Per-sequence GC content FastQC
Per-base N content FastQC
Sequence length distribution FastQC
Adapter content FastQC
Duplication levels FastQC

FastQC Documentation

Run FastQC

fastqc -t 4 pair1.fastq.gz pair2.fastq.gz

Aggregate QC Reports with MultiQC

MultiQC

MultiQC aggregates results from multiple tools into a single interactive report.

multiqc .

Open Reports in Your Browser

MultiQC generates an HTML report in your working directory. Open it directly:

macOS:

open multiqc_report.html

Linux:

xdg-open multiqc_report.html

Windows (WSL):

explorer.exe multiqc_report.html

6. Read Trimming

Why Trim?

Trimming

Common Trimming Tools

Tool Link
Trim Galore link
fastp link
Cutadapt link
Trimmomatic link
Skewer link

Run Trim Galore

cd ~/bch709/rnaseq

trim_galore \
  --paired \
  --three_prime_clip_R1 5 \
  --three_prime_clip_R2 5 \
  --cores 4 \
  --max_n 40 \
  --fastqc \
  --gzip \
  -o trim \
  pair1.fastq.gz pair2.fastq.gz

Post-Trimming QC Report

multiqc --dirs ~/bch709/rnaseq --filename trim

7. Alignment (Mapping)

Mapping

For RNA-Seq, reads must be aligned to a reference genome using a splice-aware aligner that can span intron-exon junctions.

Common RNA-Seq Aligners

Aligner Algorithm Link
HISAT2 Graph FM index (BWT-based) paper
STAR Suffix arrays paper
GSNAP SNP/indel-aware paper

Baruzzo et al. (2017) Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods 14:135

Algorithm Overview

Download Reference Files

cd ~/bch709/rnaseq

wget https://www.dropbox.com/s/851ob9e3ktxhyxz/bch709.fasta
wget https://www.dropbox.com/s/e9dvdkrl9dta4qg/bch709.gtf

Build HISAT2 Index

hisat2-build bch709.fasta bch709

Align Reads

The --dta flag is required when using HISAT2 output with featureCounts or StringTie for downstream quantification. Pipe directly to samtools sort to skip writing the intermediate SAM file to disk.

hisat2 \
  -x bch709 \
  --threads 4 \
  --dta \
  -1 trim/pair1_val_1.fq.gz \
  -2 trim/pair2_val_2.fq.gz \
  --summary-file alignment.txt \
  | samtools sort -@ 4 -o align_sort.bam

samtools index align_sort.bam
cat alignment.txt

8. SAM/BAM Format

SAM Record Format

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [TAGS]

SAM Format

SAM Flags

SAM flags are bitwise integers encoding alignment properties (strand, pairing, etc.):

SAM Flags

Convert a decimal flag to binary to interpret it:

python3 -c "print(bin(163))"
# or
echo 'obase=2; 163' | bc

Check what a flag means: SAM Flag Decoder

SAM Optional Tags

SAM Tags

Full SAM specification: samtools.github.io/hts-specs


9. BAM Processing with SAMtools

SAMtools provides utilities for manipulating SAM/BAM files: sorting, indexing, filtering, and statistics.

Coordinate-sorted and indexed BAM files are the standard input for most downstream tools (featureCounts, IGV, many QC utilities).

If you aligned without piping (produced a SAM file), convert and sort it:

samtools view -Sb align.sam | samtools sort -@ 4 -o align_sort.bam
samtools index align_sort.bam

What These Commands Do

Command Purpose Typical use
samtools view Convert/filter alignments Convert SAM to BAM; filter by flags or regions
samtools sort Coordinate-sort BAM Required before indexing and most visualization
samtools index Create .bai index Enables fast random access to genomic regions
samtools flagstat Quick mapping summary Read-level QC (mapped %, paired %, duplicates)
samtools idxstats Per-reference counts Check chromosome/contig-level mapping balance
samtools stats Detailed metrics Insert size, mismatch profile, coverage summaries

Practical QC Commands

# Quick integrity check (silent if OK)
samtools quickcheck -v align_sort.bam

# Read-level summary
samtools flagstat -@ 4 align_sort.bam > align_sort.flagstat.txt

# Per-reference mapped/unmapped counts
samtools idxstats align_sort.bam > align_sort.idxstats.txt

# Detailed alignment statistics
samtools stats -@ 4 align_sort.bam > align_sort.bam.stat

# Show the first summary lines (SN = Summary Number)
grep '^SN' align_sort.bam.stat | head -20

Useful Flag Filters

# Total aligned records
samtools view -c align_sort.bam

# Mapped reads only (-F 4 removes unmapped)
samtools view -c -F 4 align_sort.bam

# Properly paired reads only (for paired-end data)
samtools view -c -f 2 align_sort.bam

File Size Comparison

Format Size Notes
SAM (align.sam) ~903 MB Text format; avoid writing to disk if possible
BAM (align_sort.bam) ~166 MB Binary, ~5× smaller

Visualize Alignments

# Terminal viewer
COLUMNS=150 samtools tview -d t align_sort.bam bch709.fasta

GUI Viewers:

Alignment QC

# MultiQC aggregates flagstat/stats outputs into one report
multiqc --dirs ~/bch709/rnaseq --filename align

Key Alignment Metrics

Metric Tool
Total/mapped/properly paired reads samtools flagstat
Mapped reads per chromosome/contig samtools idxstats
Mapping rate, pairing samtools stats
Insert size distribution samtools stats
Gene body coverage RSeQC
Strand specificity RSeQC
Biotype counts QoRTs
Sequencing saturation QoRTs

Tools: QoRTs, RSeQC, Qualimap


10. Read Quantification

Gene Counts

Reads overlapping annotated gene features are counted as a proxy for gene expression.

Count Methods

Quantification Tools

Tool Approach
featureCounts Fast, summarizes reads per gene/exon
HTSeq Python-based, flexible
RSEM Transcript-level, EM-based
Salmon Quasi-mapping, very fast
Kallisto Pseudoalignment

EM-Based Quantification (STAR + RSEM)

EM (Expectation-Maximization) is useful when a read can map to multiple isoforms or genes.

This is why EM-based tools (for example, RSEM) are commonly used for transcript-level quantification.

STAR --quantMode GeneCounts gives simple gene counts and is not EM-based.
For EM quantification, use STAR transcriptome BAM + RSEM.

# 1) Build STAR genome index (one-time)
mkdir -p star_index
STAR \
  --runThreadN 4 \
  --runMode genomeGenerate \
  --genomeDir star_index \
  --genomeFastaFiles bch709.fasta \
  --sjdbGTFfile bch709.gtf \
  --sjdbOverhang 99

# 2) Build RSEM reference (one-time)
mkdir -p rsem_ref
rsem-prepare-reference --gtf bch709.gtf bch709.fasta rsem_ref/bch709

# 3) STAR alignment with transcriptome BAM output
STAR \
  --runThreadN 4 \
  --genomeDir star_index \
  --readFilesIn trim/pair1_val_1.fq.gz trim/pair2_val_2.fq.gz \
  --readFilesCommand zcat \
  --quantMode TranscriptomeSAM \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix star_

# 4) EM-based expression estimation by RSEM
rsem-calculate-expression \
  --alignments \
  --paired-end \
  --num-threads 4 \
  star_Aligned.toTranscriptome.out.bam \
  rsem_ref/bch709 \
  sample1

Key output files:

Useful columns:

Run featureCounts

The -p flag is for paired-end reads. Add -T 4 to use multiple threads. The -s flag sets strandedness (0=unstranded, 1=forward, 2=reverse).

featureCounts \
  -p \
  -T 4 \
  -a bch709.gtf \
  -o counts.txt \
  align_sort.bam

# View the count matrix (skip the first commented header line)
grep -v "^#" counts.txt | head

PCR Duplicates

For RNA-Seq, PCR duplicates are generally not removed because many identical reads reflect true high-abundance transcripts.

Multi-Mapping Reads

Strategy Tool
Discard multi-mappers featureCounts, HTSeq
Distribute counts Cufflinks
Probabilistic (EM) RSEM
Prioritize features Rcount

Reference

  • Fu et al. (2018) Elimination of PCR duplicates in RNA-seq using UMIs. BMC Genomics 19:531
  • Parekh et al. (2016) The impact of amplification on differential expression. Scientific Reports 6:25533

11. Differential Expression Analysis

After generating a count matrix, the next step is testing for statistically significant differences between conditions.

Common Tools

Tool Model Notes
DESeq2 Negative binomial Recommended for small n
edgeR Negative binomial GLM-based
Limma-Voom Normal (after voom) Good for large studies

Downstream: Functional Analysis


12. Full Workflow Summary

Raw FASTQ
    │
    ├── FastQC → MultiQC (QC report)
    │
    ├── Trim Galore (adapter/quality trimming)
    │       │
    │       └── FastQC → MultiQC (post-trim QC)
    │
    ├── HISAT2 (alignment to genome)
    │       │
    │       └── SAMtools (sort, index, stats)
    │
    ├── featureCounts (read quantification)
    │
    └── DESeq2 / edgeR (differential expression)
            │
            └── GO / GSEA (functional enrichment)

13. Cleanup

conda deactivate
# Optional: remove the environment when done
conda env remove --name rnaseq

If you are using micromamba:

micromamba deactivate
micromamba env remove -n rnaseq

References

Resource Link
Conesa et al. (2016) Best practices for RNA-seq Genome Biology
ENCODE RNA-Seq Standards Hong et al. 2016
QC Fail blog sequencing.qcfail.com
Conda documentation docs.conda.io
BioConda bioconda.github.io