🤖 BCH709 AI Assistant: Ask questions about this class using NotebookLM

BCH709 Introduction to Bioinformatics: RNA-Seq Tutorial

Paper Reading

Please read this paper before class: Conesa et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biology 17:13


Overview: RNA Sequencing

RNA Sequencing

RNA-Seq (RNA Sequencing) is the standard method for measuring genome-wide gene expression. Key characteristics:

  1. The transcriptome is spatially and temporally dynamic
  2. Data comes from functional units (coding regions)
  3. Only a small fraction of the genome is transcribed at any given time

Introduction

Sequence-based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification, transcript discovery, and splice isoform identification, as well as adaptability for numerous more specialized measurements. RNA-Seq studies present challenges shared with prior methods such as microarrays and SAGE tagging, and also new ones specific to high-throughput sequencing platforms.

RNA-Seq experiments are diverse in their aims and design goals, currently including multiple types of RNA isolated from whole cells or from specific sub-cellular compartments or biochemical classes, such as total polyA+ RNA, polysomal RNA, nuclear ribosome-depleted RNA, various size fractions of RNA and a host of others.


The RNA-Seq Workflow

RNA Sequencing Workflow

Seven Stages of a Data Science Project

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results
  7. Make your analysis reproducible

1. Experimental Design

Sample Information

Document the following for each sample:

Item Description
Material type Tissue, cell line, primary cell type, etc.
Treatments TALENs, CRISPR, drug treatment, etc.
Subcellular fraction Whole cell, nuclear, cytoplasmic, etc.
Input amount Bulk vs. low-input (affects reproducibility)
Lot/catalog # For commercial cell lines
Culture protocol If cells were propagated in vitro
QC/phenotyping Purity confirmation methods

RNA Information

Key properties to report:

Library Preparation Protocol

Document all steps:

Replicate Strategy

Reading Materials


2. Sequencing Parameters

Key Parameters to Report

Parameter Examples
Platform Illumina NovaSeq, PacBio, Oxford Nanopore
Format Single-end (SE) or Paired-end (PE)
Read length 100 bp, 150 bp, 250 bp
Barcode placement Standard or custom
Custom primers Sequence and position

RNA Library Prep

Library Type Minimum Aligned Read Pairs
Long RNA-Seq (polyA+) 30 million
RAMPAGE 20 million
Small RNA-Seq 30 million

Quantitative Standards (Spike-ins)

ERCC spike-in controls are highly recommended for calibrating quantification, sensitivity, and linearity.

Report:


3. Data Files and FASTQ Format

What is FASTQ?

FASTQ is a text-based format storing both nucleotide sequences and per-base quality scores.

Each record has exactly 4 lines:

@A00261:180:HL7GCDSXX:2:1101:30572:1047/2          # Line 1: header
AAAATACATTGATGACCATCTAAAGTCTACGGCGTAT...            # Line 2: sequence
+                                                   # Line 3: separator
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...              # Line 4: quality scores

Illumina Read Header Format

@Instrument:RunID:FlowcellID:Lane:Tile:X:Y/ReadNum
Field Example Meaning
Instrument A00261 Instrument name
RunID 180 Run ID
FlowcellID HL7GCDSXX Flowcell ID
Lane 2 Flowcell lane
Tile 1101 Tile number within lane
X 30572 X-coordinate of cluster
Y 1047 Y-coordinate of cluster
ReadNum /2 Pair member (1 or 2)

Phred Quality Scores

FASTQ Quality

Quality scores represent the probability of an incorrect base call:

Phred Score Error Probability Accuracy
Q10 1 in 10 90%
Q20 1 in 100 99%
Q30 1 in 1,000 99.9%
Q40 1 in 10,000 99.99%

4. Hands-On: Setting Up the Environment

Create Conda Environment

conda create -n rnaseq python=3.11
conda activate rnaseq

Install Tools

conda install -c bioconda -c conda-forge \
  fastqc trim-galore hisat2 samtools subread multiqc

Create Working Directory

mkdir -p ~/bch709/rnaseq
cd ~/bch709/rnaseq

Download Example Data

wget https://www.dropbox.com/s/y7yehmfze1l6cgz/pair1.fastq.gz
wget https://www.dropbox.com/s/xsrth6icapyr4p0/pair2.fastq.gz
ls -lh

Inspect a FASTQ File

zcat pair2.fastq.gz | head -12

5. Quality Control

What to Check

Metric Tool
Number of reads FastQC
Per-base sequence quality FastQC
Per-sequence GC content FastQC
Per-base N content FastQC
Sequence length distribution FastQC
Adapter content FastQC
Duplication levels FastQC

FastQC Documentation

Run FastQC

fastqc -t 4 pair1.fastq.gz pair2.fastq.gz

Aggregate QC Reports with MultiQC

MultiQC

MultiQC aggregates results from multiple tools into a single interactive report.

multiqc .

Open Reports in Your Browser

MultiQC generates an HTML report in your working directory. Open it directly:

macOS / Linux:

open multiqc_report.html

Windows (WSL):

explorer.exe multiqc_report.html

6. Read Trimming

Why Trim?

Trimming

Common Trimming Tools

Tool Link
Trim Galore link
fastp link
Cutadapt link
Trimmomatic link
Skewer link

Run Trim Galore

cd ~/bch709/rnaseq

trim_galore \
  --paired \
  --three_prime_clip_R1 5 \
  --three_prime_clip_R2 5 \
  --cores 4 \
  --max_n 40 \
  --fastqc \
  --gzip \
  -o trim \
  pair1.fastq.gz pair2.fastq.gz

Post-Trimming QC Report

multiqc --dirs ~/bch709/rnaseq --filename trim

7. Alignment (Mapping)

Mapping

For RNA-Seq, reads must be aligned to a reference genome using a splice-aware aligner that can span intron-exon junctions.

Common RNA-Seq Aligners

Aligner Algorithm Link
HISAT2 Graph FM index (BWT-based) paper
STAR Suffix arrays paper
GSNAP SNP/indel-aware paper

Baruzzo et al. (2017) Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods 14:135

Algorithm Overview

Download Reference Files

cd ~/bch709/rnaseq

wget https://www.dropbox.com/s/851ob9e3ktxhyxz/bch709.fasta
wget https://www.dropbox.com/s/e9dvdkrl9dta4qg/bch709.gtf

Build HISAT2 Index

hisat2-build bch709.fasta bch709

Align Reads

The --dta flag is required when using HISAT2 output with featureCounts or StringTie for downstream quantification. Pipe directly to samtools sort to skip writing the intermediate SAM file to disk.

hisat2 \
  -x bch709 \
  --threads 4 \
  --dta \
  -1 trim/pair1_val_1.fq.gz \
  -2 trim/pair2_val_2.fq.gz \
  --summary-file alignment.txt \
  | samtools sort -@ 4 -o align_sort.bam

samtools index align_sort.bam
cat alignment.txt

8. SAM/BAM Format

SAM Record Format

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [TAGS]

SAM Format

SAM Flags

SAM flags are bitwise integers encoding alignment properties (strand, pairing, etc.):

SAM Flags

Convert a decimal flag to binary to interpret it:

python3 -c "print(bin(163))"
# or
echo 'obase=2; 163' | bc

Check what a flag means: SAM Flag Decoder

SAM Optional Tags

SAM Tags

Full SAM specification: samtools.github.io/hts-specs


9. BAM Processing with SAMtools

SAMtools provides utilities for manipulating SAM/BAM files: sorting, indexing, merging, and statistics.

If you aligned without piping (produced a SAM file), convert and sort it:

samtools view -Sb align.sam | samtools sort -@ 4 -o align_sort.bam
samtools index align_sort.bam
# Compute alignment statistics
samtools stats align_sort.bam > align_sort.bam.stat
cat align_sort.bam.stat

File Size Comparison

Format Size Notes
SAM (align.sam) ~903 MB Text format; avoid writing to disk if possible
BAM (align.bam) ~166 MB Binary, ~5× smaller

Visualize Alignments

# Terminal viewer
COLUMNS=150 samtools tview -d t align_sort.bam bch709.fasta

GUI Viewers:

Alignment QC

multiqc --dirs ~/bch709/rnaseq --filename align

Key Alignment Metrics

Metric Tool
Mapping rate, pairing samtools stats
Insert size distribution samtools stats
Gene body coverage RSeQC
Strand specificity RSeQC
Biotype counts QoRTs
Sequencing saturation QoRTs

Tools: QoRTs, RSeQC, Qualimap


10. Read Quantification

Gene Counts

Reads overlapping annotated gene features are counted as a proxy for gene expression.

Count Methods

Quantification Tools

Tool Approach
featureCounts Fast, summarizes reads per gene/exon
HTSeq Python-based, flexible
RSEM Transcript-level, EM-based
Salmon Quasi-mapping, very fast
Kallisto Pseudoalignment

Run featureCounts

The -p flag is for paired-end reads. Add -T 4 to use multiple threads. The -s flag sets strandedness (0=unstranded, 1=forward, 2=reverse).

featureCounts \
  -p \
  -T 4 \
  -a bch709.gtf \
  -o counts.txt \
  align_sort.bam

# View the count matrix (skip the first commented header line)
grep -v "^#" counts.txt | head

PCR Duplicates

For RNA-Seq, PCR duplicates are generally not removed because many identical reads reflect true high-abundance transcripts.

Multi-Mapping Reads

Strategy Tool
Discard multi-mappers featureCounts, HTSeq
Distribute counts Cufflinks
Probabilistic (EM) RSEM
Prioritize features Rcount

Reference

  • Fu et al. (2018) Elimination of PCR duplicates in RNA-seq using UMIs. BMC Genomics 19:531
  • Parekh et al. (2016) The impact of amplification on differential expression. Scientific Reports 6:25533

11. Differential Expression Analysis

After generating a count matrix, the next step is testing for statistically significant differences between conditions.

Common Tools

Tool Model Notes
DESeq2 Negative binomial Recommended for small n
edgeR Negative binomial GLM-based
Limma-Voom Normal (after voom) Good for large studies

Downstream: Functional Analysis


12. Full Workflow Summary

Raw FASTQ
    │
    ├── FastQC → MultiQC (QC report)
    │
    ├── Trim Galore (adapter/quality trimming)
    │       │
    │       └── FastQC → MultiQC (post-trim QC)
    │
    ├── HISAT2 (alignment to genome)
    │       │
    │       └── SAMtools (sort, index, stats)
    │
    ├── featureCounts (read quantification)
    │
    └── DESeq2 / edgeR (differential expression)
            │
            └── GO / GSEA (functional enrichment)

13. Cleanup

conda deactivate
# Optional: remove the environment when done
conda env remove --name rnaseq

References

Resource Link
Conesa et al. (2016) Best practices for RNA-seq Genome Biology
ENCODE RNA-Seq Standards Hong et al. 2016
QC Fail blog sequencing.qcfail.com
Conda documentation docs.conda.io
BioConda bioconda.github.io