BCH709 Introduction to Bioinformatics: DESeq2 / EdgeR

RNA-Seq reads Count Analysis

pwd
### Your current location is 
## /data/gpfs/assoc/bch709/<YOUR_ID>/rnaseq/DEG

nano <JOBNAME>.sh

RNA-Seq reads Count Analysis job script

#!/bin/bash
#SBATCH --job-name=<JOBNAME>
#SBATCH --cpus-per-task=16
#SBATCH --time=2:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<YOUR ID>@unr.edu
#SBATCH -o <JOBNAME>.out # STDOUT
#SBATCH -e <JOBNAME>.err # STDERR

align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/rawdata_rnaseq/fastq/pair1.fastq.gz --right /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/rawdata_rnaseq/fastq/pair2.fastq.gz  --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir rsem_outdir  --thread_count  16

Job submission

sbatch <JOBNAME>.sh  

Job check

squeue

Job running check

## do ```ls``` first
ls

cat <YOUR_JOB>.err  

cat <YOUR_JOB>.out  

or

cat slurm_<JOBID>.out

RSEM results check

less rsem_outdir/RSEM.genes.results

Your current location is* /data/gpfs/assoc/bch709/spiderman/rnaseq/DEG

FPKM

FPKM

X = mapped reads count N = number of reads L = Length of transcripts

head -n 2 rsem_outdir/RSEM.genes.results

Reads count

samtools flagstat rsem_outdir/bowtie2.bam

Call Python

python
X = 861
Number_Reads_mapped = 1485483
Length = 3475
fpkm= X*(1000/Length)*(1000000/Number_Reads_mapped)
fpkm

ten to the ninth power = 10**9

fpkm=X/(Number_Reads_mapped*Length)*10**9
fpkm

FPKM

Fragments per Kilobase of transcript per million mapped reads

TPM

Transcripts Per Million

TPM

TPM2

Sum of FPKM

cat rsem_outdir/RSEM.genes.results | egrep -v FPKM | awk '{ sum+=$7} END {print sum}'

TPM calculation from FPKM

FPKM = 180.61
SUM_FPKM = 646089
TPM=(FPKM/SUM_FPKM)*10**6
TPM

TPM calculation from reads count

cat rsem_outdir/RSEM.genes.results | egrep -v FPKM | awk '{ sum+=$5/$3} END {print sum}'
sum_count_per_length = 714.037

TPM = (X/Length)*(1/sum_count_per_length )*10**6

Paper read

Li et al., 2010, RSEM Dillies et al., 2013


Home work solve

File download

$ mkdir -p /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework2

$ cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework2

Download below files

mkdir fastq
cd fastq
pwd
#current directory should be /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/fastq

Run trimming

$ cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework2/fastq
$ nano <JOB_NAME>.sh

Submit trimming job with following criteria

Mock SBATCH example is below

#!/bin/bash
#SBATCH --job-name=<JOB_NAME>
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --mem-per-cpu=1g
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<YOUR ID>@unr.edu



trim_galore --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 16  --max_n 40  --gzip -o  /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/trim /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/fastq/KRWTD1_1.fastq.gz /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/fastq/KRWTD1_2.fastq.gz 

Job submission Requirement

chmod 775 <JOB_NAME>.sh
sbatch <JOB_NAME>.sh

Need to do all of samples

KRWTD1_1.fastq.gz KRWTD1_2.fastq.gz

KRWTD2_1.fastq.gz KRWTD2_2.fastq.gz

KRWTD3_1.fastq.gz KRWTD3_2.fastq.gz

KRWTW1_1.fastq.gz KRWTW1_2.fastq.gz

KRWTW2_1.fastq.gz KRWTW2_2.fastq.gz

KRWTW3_1.fastq.gz KRWTW3_2.fastq.gz

###Batch example for i in $(ls -1 .gz | sed ‘s/_.//g’ | sort -u); do echo $i;

Run FastQC on Pronghorn by using SBATCH

PLEASE CHECK YOUR MULTIQC

Merge gz file (Pair1 and Pair2)

cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/trim

For example

zcat /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/trim/*.1.fq.gz  >> merged_R1.fastq

zcat /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/trim/*.2.fq.gz  >> merged_R2.fastq

Trinity run

mkdir /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/Trinity
cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/Trinity

Submit Trinity job with following criteria

#!/bin/bash
#SBATCH --job-name=<JOB_NAME>
#SBATCH --cpus-per-task=64
#SBATCH --time=2:00:00
#SBATCH --mem=100g
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<YOUR ID>@unr.edu
#SBATCH -o <JOB_NAME>.out # STDOUT
#SBATCH -e <JOB_NAME>.err # STDERR

Trinity --seqType fq  --CPU 64 --max_memory 100G --left <PAIR1_MERGED_FILE_LOCATION> --right <PAIR2_MERGED_FILE_LOCATION>

MultiQC

cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/
multiqc . -n <YOUR_NAME>

Please send me the results

cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/Trinity/

egrep -c ">" trinity_out_dir/Trinity.fasta >> <YOUR_NAME>.trinity.stat

TrinityStats.pl trinity_out_dir/Trinity.fasta >> <YOUR_NAME>.trinity.stat

Use SCP for downloading and send me the file
***.trinity.stat & .html***

** When you have an ERROR, please send me your history. “histoty » myhistory.txt” **

Activating Enviroment

conda activate rnaseq

Starting Alignment

cd /data/gpfs/assoc/bch709/<YOUR_ID>/rnaseq/

mkdir DEG2

## move to DEG2 folder
cd DEG2

## Making folder
mkdir trimmed_fastq  
mkdir reference

## Change Directory
cd trimmed_fastq

## link files
ln -s /data/gpfs/assoc/bch709/spiderman/rnaseq/homework/trim/*.gz .
ls -algh
zcat  KRWTD1_1_val_1.fq.gz | head

## Change Directory
cd ../
## Your directory should be /data/gpfs/assoc/bch709/spiderman/rnaseq/DEG2/

## Link reference
cd reference

ln -s  /data/gpfs/assoc/bch709/spiderman/rnaseq/homework/Trinity/trinity_out_dir/Trinity.fasta.gene_trans_map .

ln -s  /data/gpfs/assoc/bch709/spiderman/rnaseq/homework/Trinity/trinity_out_dir/Trinity.fasta .

trinity_tree

Type sample file

conda install nano

nano sample.txt

###^ means CTRL key
###M- means ALT key

WT<TAB>WT_REP1<TAB>trimmed_fastq/KRWTW1_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTW1_2_val_2.fq.gz
WT<TAB>WT_REP2<TAB>trimmed_fastq/KRWTW2_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTW2_2_val_2.fq.gz
WT<TAB>WT_REP3<TAB>trimmed_fastq/KRWTW3_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTW3_2_val_2.fq.gz
DT<TAB>DT_REP1<TAB>trimmed_fastq/KRWTD1_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTD1_2_val_2.fq.gz
DT<TAB>DT_REP2<TAB>trimmed_fastq/KRWTD2_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTD2_2_val_2.fq.gz
DT<TAB>DT_REP3<TAB>trimmed_fastq/KRWTD3_1_val_1.fq.gz<TAB>trimmed_fastq/KRWTD3_2_val_2.fq.gz

Run alignment

#!/bin/bash
#SBATCH --job-name=<JOB_NAME>
#SBATCH --cpus-per-task=32
#SBATCH --time=2:00:00
#SBATCH --mem=64g
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<YOUR ID>@nevada.unr.edu
#SBATCH -o <JOB_NAME>.out # STDOUT
#SBATCH -e <JOB_NAME>.err # STDERR

align_and_estimate_abundance.pl --thread_count 32 --transcripts reference/Trinity.fasta --seqType fq  --est_method RSEM --aln_method bowtie2  --trinity_mode --prep_reference --output_dir rsem_outdir --samples_file sample.txt

Install R-packages

conda install -c bioconda bioconductor-ctc bioconductor-deseq2 bioconductor-edger bioconductor-biobase  bioconductor-qvalue r-ape  r-gplots  r-fastcluster

abundance_estimates_to_matrix

 abundance_estimates_to_matrix.pl  --est_method RSEM --gene_trans_map none --name_sample_by_basedir  --cross_sample_norm TMM WT_REP1/RSEM.isoforms.results WT_REP2/RSEM.isoforms.results WT_REP3/RSEM.isoforms.results   DT_REP1/RSEM.isoforms.results DT_REP2/RSEM.isoforms.results DT_REP3/RSEM.isoforms.results

RSEM

Paper read

Li et al., 2010, RSEM Dillies et al., 2013

Normalization

CPM, RPKM, FPKM, TPM, RLE, MRN, Q, UQ, TMM, VST, RLOG, VOOM … Too many…

CPM: Controls for sequencing depth when dividing by total count. Not for within-sample comparison or DE.

Counts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced (N) times one million. This unit is related to the FPKM without length normalization and a factor of 10^3:
CPM

RPKM/FPKM: Controls for sequencing depth and gene length. Good for technical replicates, not good for sample-sample due to compositional bias. Assumes total RNA output is same in all samples. Not for DE.

TPM: Similar to RPKM/FPKM. Corrects for sequencing depth and gene length. Also comparable between samples but no correction for compositional bias.

TMM/RLE/MRN: Improved assumption: The output between samples for a core set only of genes is similar. Corrects for compositional bias. Used for DE. RLE and MRN are very similar and correlates well with sequencing depth. edgeR::calcNormFactors() implements TMM, TMMwzp, RLE & UQ. DESeq2::estimateSizeFactors implements median ratio method (RLE). Does not correct for gene length.

VST/RLOG/VOOM: Variance is stabilised across the range of mean values. For use in exploratory analyses. Not for DE. vst() and rlog() functions from DESeq2. voom() function from Limma converts data to normal distribution.

geTMM: Gene length corrected TMM.

For DGE using DGE R packages (DESeq2, edgeR, Limma etc), use raw counts
For visualisation (PCA, clustering, heatmaps etc), use TPM or TMM
For own analysis with gene length correction, use TPM (maybe geTMM?)
Other solutions: spike-ins/house-keeping genes

PtR (Quality Check Your Samples and Biological Replicates)

Once you’ve performed transcript quantification for each of your biological replicates, it’s good to examine the data to ensure that your biological replicates are well correlated, and also to investigate relationships among your samples. If there are any obvious discrepancies among your sample and replicate relationships such as due to accidental mis-labeling of sample replicates, or strong outliers or batch effects, you’ll want to identify them before proceeding to subsequent data analyses (such as differential expression).

cut -f 1,2 sample.txt >> samples_ptr.txt

PtR  --matrix RSEM.isoform.counts.matrix --samples samples_ptr.txt --CPM --log2 --min_rowSums 10  --compare_replicates

PtR  --matrix RSEM.isoform.counts.matrix --samples samples_ptr.txt --CPM  --log2 --min_rowSums 10   --sample_cor_matrix

PtR  --matrix RSEM.isoform.counts.matrix --samples samples_ptr.txt --CPM  --log2 --min_rowSums 10   --center_rows --prin_comp 3

Please transfer results to your local computer

DEG calculation

run_DE_analysis.pl --matrix RSEM.isoform.counts.matrix --samples_file samples_ptr.txt --method DESeq2 
run_DE_analysis.pl --matrix RSEM.isoform.counts.matrix --samples_file samples_ptr.txt --method edgeR


cd DESeq2.XXXXX.dir
analyze_diff_expr.pl --matrix ../RSEM.isoform.TMM.EXPR.matrix  -P 0.001 -C 1  --samples ../samples_ptr.txt
wc -l RSEM.isoform.counts.matrix.DT_vs_WT.DESeq2.DE_results.P0.001_C1.DE.subset
cd ../

cd edgeR.XXXXX.dir
analyze_diff_expr.pl --matrix ../RSEM.isoform.TMM.EXPR.matrix  -P 0.001 -C 1  --samples ../samples_ptr.txt
wc -l RSEM.isoform.counts.matrix.DT_vs_WT.edgeR.DE_results.P0.001_C1.DE.subset

DESeq2 vs EdgeR Normalization method

DESeq and EdgeR are very similar and both assume that no genes are differentially expressed. DEseq uses a “geometric” normalisation strategy, whereas EdgeR is a weighted mean of log ratios-based method. Both normalise data initially via the calculation of size / normalisation factors.

Here is further information (important parts in bold):

DESeq

DESeq: This normalization method is included in the DESeq Bioconductor package (version 1.6.0) and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis. By calling the estimateSizeFactors() and sizeFactors() functions in the DESeq Bioconductor package, this factor is computed for each lane, and raw read counts are divided by the factor associated with their sequencing lane.
DESeq2

EdgeR

Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (version 2.4.0). It is also based on the hypothesis that most genes are not DE. The TMM factor is computed for each lane, with one lane being considered as a reference sample and the others as test samples. For each test sample, TMM is computed as the weighted mean of log ratios between this test and the reference, after exclusion of the most expressed genes and the genes with the largest log ratios. According to the hypothesis of low DE, this TMM should be close to 1. If it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (and not the raw counts) in order to fulfill the hypothesis. The calcNormFactors() function in the edgeR Bioconductor package provides these scaling factors. To obtain normalized read counts, these normalization factors are re-scaled by the mean of the normalized library sizes. Normalized read counts are obtained by dividing raw read counts by these re-scaled normalization factors.
EdgeR

DESeq2 vs EdgeR Statistical tests for differential expression

DESeq2

DESeq2 uses raw counts, rather than normalized count data, and models the normalization to fit the counts within a Generalized Linear Model (GLM) of the negative binomial family with a logarithmic link. Statistical tests are then performed to assess differential expression, if any.

EdgeR

Data are normalized to account for sample size differences and variance among samples. The normalized count data are used to estimate per-gene fold changes and to perform statistical tests of whether each gene is likely to be differentially expressed.
EdgeR uses an exact test under a negative binomial distribution (Robinson and Smyth, 2008). The statistical test is related to Fisher’s exact test, though Fisher uses a different distribution.

Negative binormal

DESeq2

ϕ was assumed to be a function of μ determined by nonparametric regression. The recent version used in this paper follows a more versatile procedure. Firstly, for each transcript, an estimate of the dispersion is made, presumably using maximum likelihood. Secondly, the estimated dispersions for all transcripts are fitted to the functional form:
ϕ=a+bμ(DESeq parametric fit), using a gamma-family generalised linear model (Using regression)

EdgeR

edgeR recommends a “tagwise dispersion” function, which estimates the dispersion on a gene-by-gene basis, and implements an empirical Bayes strategy for squeezing the estimated dispersions towards the common dispersion. Under the default setting, the degree of squeezing is adjusted to suit the number of biological replicates within each condition: more biological replicates will need to borrow less information from the complete set of transcripts and require less squeezing.

DEG software comparison paper

Draw Venn Diagram

conda create -n venn python=2.7
conda activate venn
conda install -c bioconda bedtools intervene r-UpSetR=1.4.0 r-corrplot r-Cairo
cd ../
pwd
# /data/gpfs/assoc/bch709/spiderman/rnaseq/DEG2
mkdir Venn


###DESeq2
cut -f 1 ../DESeq2.91008.dir/RSEM.isoform.counts.matrix.DT_vs_WT.DESeq2.DE_results.P0.001_C1.DT-UP.subset  | grep -v sample > DESeq.UP.subset
cut -f 1 ../DESeq2.91008.dir/RSEM.isoform.counts.matrix.DT_vs_WT.DESeq2.DE_results.P0.001_C1.WT-UP.subset  | grep -v sample > DESeq.DOWN.subset

###edgeR
cut -f 1 ../edgeR.91693.dir/RSEM.isoform.counts.matrix.DT_vs_WT.edgeR.DE_results.P0.001_C1.DT-UP.subset   | grep -v sample > edgeR.UP.subset
cut -f 1 ../edgeR.91693.dir/RSEM.isoform.counts.matrix.DT_vs_WT.edgeR.DE_results.P0.001_C1.WT-UP.subset   | grep -v sample > edgeR.DOWN.subset


### Drawing
intervene venn -i DESeq.DOWN.subset DESeq.UP.subset edgeR.DOWN.subset edgeR.UP.subset  --type list --save-overlaps
intervene upset -i DESeq.DOWN.subset DESeq.UP.subset edgeR.DOWN.subset edgeR.UP.subset  --type list --save-overlaps
intervene pairwise  -i DESeq.DOWN.subset DESeq.UP.subset edgeR.DOWN.subset edgeR.UP.subset  --type list

Assignment

File download

$ mkdir -p /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014

$ cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014

Download below files

mkdir fastq
cd fastq
pwd
#current directory should be /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/fastq

Run trimming

$ cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014/fastq
$ nano <JOB_NAME>.sh

Submit trimming job with following criteria

Mock SBATCH example is below

#!/bin/bash
#SBATCH --job-name=<JOB_NAME>
#SBATCH --cpus-per-task=16
#SBATCH --time=2:00:00
#SBATCH --mem-per-cpu=4g
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<YOUR ID>@unr.edu



trim_galore --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 16  --max_n 40  --gzip -o  /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/homework/trim /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014/fastq/WT1_R1.fastq.gz /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014/fastq/WT1_R2.fastq.gz

Job submission Requirement

chmod 775 <JOB_NAME>.sh
sbatch <JOB_NAME>.sh

Need to do all of samples

DT1_R1.fastq.gz DT1_R2.fastq.gz

DT2_R1.fastq.gz DT2_R2.fastq.gz

DT3_R1.fastq.gz DT3_R2.fastq.gz

WT1_R1.fastq.gz WT1_R2.fastq.gz

WT2_R1.fastq.gz WT2_R2.fastq.gz

WT3_R1.fastq.gz WT3_R2.fastq.gz

fastqc

You can do everything at once.

fastqc 
fastqc

Type sample file

nano sample.txt

Run alignment

align_and_estimate_abundance.pl  

PtR (Quality Check Your Samples and Biological Replicates)

PtR

abundance_estimates_to_matrix

 abundance_estimates_to_matrix.pl

DEG calculation by DESeq2

run_DE_analysis.pl 

analyze_diff_expr.pl 

Draw Venn Diagram

conda activate venn
intervene venn

MultiQC

cd /data/gpfs/assoc/bch709/<YOUR_NAME>/rnaseq/assignment1014/
multiqc . -n <YOUR_NAME>

PLEASE CHECK YOUR MULTIQC

Please send me the whole PDF results

DT.rep_compare
WT.rep_compare
RSEM.isoform.counts.matrix.minRow10.CPM.log2.centered.prcomp.principal_components
RSEM.isoform.counts.matrix.minRow10.log2.sample_cor_matrix
RSEM.isoform.counts.matrix.minRow10.CPM.log2.sample_cor_matrix
diffExpr.P0.001_C1.matrix.log2.centered.sample_cor_matrix.pdf
diffExpr.P0.001_C1.matrix.log2.centered.genes_vs_samples_heatmap.pdf
RSEM.isoform.counts.matrix.DT_vs_WT.DESeq2.DE_results.MA_n_Volcano.pdf
Intervene_venn.pdf

Office Hour

Fri Hackathon Mon 10am-noon