BCH709 Introduction to Bioinformatics: RNA-Seq

Human DEG

Activate environment

conda activate BCH709_RNASeq

Remove folder

rm -rf /data/gpfs/assoc/bch709-4/${USER}/rnaseq_assembly

Copy human featurecount results

cp /data/gpfs/assoc/bch709-4/Course_materials/human/human_featurecount.txt  /data/gpfs/assoc/bch709-4/${USER}/human/readcount/
cd /data/gpfs/assoc/bch709-4/${USER}/human/readcount/

Clean sample name

cut -f1,7-  human_featurecount.txt |  egrep -v "#" | sed 's/\Aligned\.sortedByCoord\.out\.bam//g; s/\.bam//g' > human_featurecount_only.cnt
cut -f1,6-  human_featurecount.txt |  egrep -v "#" | sed 's/\Aligned\.sortedByCoord\.out\.bam//g; s/\.bam//g' > human_featurecount_length.cnt

Copy read count to DEG folder

cp /data/gpfs/assoc/bch709-4/${USER}/human/readcount/human_featurecount_* /data/gpfs/assoc/bch709-4/${USER}/human/DEG

Go to DEG folder

cd /data/gpfs/assoc/bch709-4/${USER}/human/DEG

sample files

cp /data/gpfs/assoc/bch709-4/Course_materials/human/human_sample_association /data/gpfs/assoc/bch709-4/${USER}/human/DEG

cp /data/gpfs/assoc/bch709-4/Course_materials/human/human_contrast /data/gpfs/assoc/bch709-4/${USER}/human/DEG

cd /data/gpfs/assoc/bch709-4/${USER}/human/DEG

PtR (Quality Check Your Samples and Biological Replicates)

Once you’ve performed transcript quantification for each of your biological replicates, it’s good to examine the data to ensure that your biological replicates are well correlated, and also to investigate relationships among your samples. If there are any obvious discrepancies among your sample and replicate relationships such as due to accidental mis-labeling of sample replicates, or strong outliers or batch effects, you’ll want to identify them before proceeding to subsequent data analyses (such as differential expression).

cd /data/gpfs/assoc/bch709-4/${USER}/human/DEG
PtR  --matrix human_featurecount_only.cnt  --samples human_sample_association --CPM  --log2 --min_rowSums 10   --sample_cor_matrix --compare_replicates

PtR download on local

scp ${USER}@pronghorn.rc.unr.edu:/data/gpfs/assoc/bch709-4/${USER}/human/DEG/*.pdf .

If this is not working on local Mac

echo "setopt nonomatch" >> ~/.zshrc

DEG analysis on Pronghorn

cd /data/gpfs/assoc/bch709-4/${USER}/human/DEG

run_DE_analysis.pl --matrix human_featurecount_only.cnt --method DESeq2 --samples_file human_sample_association --contrasts human_contrast --output human_rnaseq

DEG analysis results

ls human_rnaseq

TPM/FPKM calculation

python /data/gpfs/assoc/bch709-4/Course_materials/script/tpm_raw_exp_calculator.py -count human_featurecount_length.cnt

TPM and FPKM calculation output

human_featurecount_length.cnt.fpkm.xls
human_featurecount_length.cnt.fpkm.tab
human_featurecount_length.cnt.tpm.xls
human_featurecount_length.cnt.tpm.tab

DEG subset

cd /data/gpfs/assoc/bch709-4/${USER}/human/DEG/human_rnaseq
analyze_diff_expr.pl --samples /data/gpfs/assoc/bch709-4/${USER}/human/DEG/human_sample_association  --matrix /data/gpfs/assoc/bch709-4/${USER}/human/DEG/human_featurecount_length.cnt.tpm.tab -P 0.01 -C 2 --output human_RNASEQ_P001_C2
analyze_diff_expr.pl --samples /data/gpfs/assoc/bch709-4/${USER}/human/DEG/human_sample_association  --matrix /data/gpfs/assoc/bch709-4/${USER}/human/DEG/human_featurecount_length.cnt.tpm.tab -P 0.01 -C 1 --output human_RNASEQ_P001_C1

DEG download

scp -r [YOURID]@pronghorn.rc.unr.edu:/data/gpfs/assoc/bch709-4/[YOURID]/human/DEG/ .



# Mouse DEG
## Activate environment
```bash
conda activate BCH709_RNASeq

Copy Mouse featurecount results

cp /data/gpfs/assoc/bch709-4/Course_materials/mouse/mouse_featurecount.txt  /data/gpfs/assoc/bch709-4/${USER}/mouse/readcount/
cd /data/gpfs/assoc/bch709-4/${USER}/mouse/readcount/

Clean sample name

cut -f1,7-  mouse_featurecount.txt |  egrep -v "#" | sed 's/\Aligned\.sortedByCoord\.out\.bam//g; s/\.bam//g' >> mouse_featurecount_only.cnt
cut -f1,6-  mouse_featurecount.txt |  egrep -v "#" | sed 's/\Aligned\.sortedByCoord\.out\.bam//g; s/\.bam//g' >> mouse_featurecount_length.cnt

Copy read count to DEG folder

cp /data/gpfs/assoc/bch709-4/${USER}/mouse/readcount/mouse_featurecount_* /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

Go to DEG folder

cd /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

sample files

cp /data/gpfs/assoc/bch709-4/Course_materials/mouse/mouse_sample_association /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

cp /data/gpfs/assoc/bch709-4/Course_materials/mouse/mouse_contrast /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

cd /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

PtR (Quality Check Your Samples and Biological Replicates)

cd /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG
PtR  --matrix mouse_featurecount_only.cnt  --samples mouse_sample_association --CPM  --log2 --min_rowSums 10   --sample_cor_matrix --compare_replicates

PtR download on local

scp [YOURID]@pronghorn.rc.unr.edu:/data/gpfs/assoc/bch709-4/[YOURID]/mouse/DEG/*.pdf .

If this is not working on local Mac

echo "setopt nonomatch" >> ~/.zshrc

DEG analysis on Pronghorn

cd /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG

run_DE_analysis.pl --matrix mouse_featurecount_only.cnt --method DESeq2 --samples_file mouse_sample_association --contrasts mouse_contrast --output mouse_rnaseq

DEG analysis results

ls mouse_rnaseq

TPM/FPKM calculation

python /data/gpfs/assoc/bch709-4/Course_materials/script/tpm_raw_exp_calculator.py -count mouse_featurecount_length.cnt

TPM and FPKM calculation output

mouse_featurecount_length.cnt.fpkm.xls
mouse_featurecount_length.cnt.fpkm.tab
mouse_featurecount_length.cnt.tpm.xls
mouse_featurecount_length.cnt.tpm.tab

DEG subset

cd /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG/mouse_rnaseq
analyze_diff_expr.pl --samples /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG/mouse_sample_association  --matrix /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG/mouse_featurecount_length.cnt.tpm.tab -P 0.01 -C 2 --output mouse_RNASEQ_P001_C2
analyze_diff_expr.pl --samples /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG/mouse_sample_association  --matrix /data/gpfs/assoc/bch709-4/${USER}/mouse/DEG/mouse_featurecount_length.cnt.tpm.tab -P 0.01 -C 1 --output mouse_RNASEQ_P001_C1

DEG download

scp -r [YOURID]@pronghorn.rc.unr.edu:/data/gpfs/assoc/bch709-4/wyim/mouse/DEG/ .

Functional analysis • GO

Gene enrichment analysis (Hypergeometric test) Gene set enrichment analysis (GSEA) Gene ontology / Reactome databases

Gene Ontology

Gene Ontology project is a major bioinformatics initiative Gene ontology is an annotation system The project provides the controlled and consistent vocabulary of terms and gene product annotations, i.e. terms occur only once, and there is a dictionary of allowed words GO describes how gene products behave in a cellular context A consistent description of gene products attributes in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner Each GO term consists of a unique alphanumerical identifier, a common name, synonyms (if applicable), and a definition Each term is assigned to one of the three ontologies Terms have a textual definition When a term has multiple meanings depending on species, the GO uses a “sensu” tag to differentiate among them (trichome differentiation (sensu Magnoliophyta)

GO

kegg

hypergeometric test

The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw. The hypergeometric probability mass function is given by (using the original variable convention)

hyper_geo combination FWER

FWER

The FWER for the other tests is computed in the same way: the gene-associated variables (scores or counts) are permuted while the annotations of genes to GO-categories stay fixed. Then the statistical tests are evaluated again for every GO-category.

Hypergeometric Test Example 1

Suppose we randomly select 2 cards without replacement from an ordinary deck of playing cards. What is the probability of getting exactly 2 cards you want (i.e., Ace or 10)?

Solution: This is a hypergeometric experiment in which we know the following:

N = 52; since there are 52 cards in a deck. k = 16; since there are 16 Ace or 10 cards in a deck. n = 2; since we randomly select cards from the deck. x = 2; since 2 of the cards we select are red. We plug these values into the hypergeometric formula as follows:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

h(2; 52, 2, 16) = [ 16C2 ] [ 48C1 ] / [ 52C2 ]

h(2; 52, 2, 16) = [ 325 ] [ 1 ] / [ 1,326 ]

h(2; 52, 2, 16) = 0.0904977

Thus, the probability of randomly selecting 2 Ace or 10 cards is 9%

category probability
probability mass f 0.09049773755656108597285
lower cumulative P 1
upper cumulative Q 0.09049773755656108597285
Expectation 0.6153846153846153846154

Hypergeometric Test Example 2

Suppose we have 30 DEGs in human genome (200). What is the probability of getting 10 oncogene?

An oncogene is a gene that has the potential to cause cancer.

Solution: This is a hypergeometric experiment in which we know the following:

N = 200; since there are 200 genes in human genome k = 10; since there are 10 oncogenes in human n = 30; since 30 DEGs x = 5; since 5 of the oncogenes in DEGs.

We plug these values into the hypergeometric formula as follows:

h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]

h(5; 200, 30, 10) = [ 10C5 ] [ 190C25 ] / [ 200C30 ]

h(5; 200, 30, 10) = [ 252 ] [ 11506192278177947613740456466942 ] / [ 409681705022127773530866523638950880 ]

h(5; 200, 30, 10) = 0.007078

Thus, the probability of oncogene 0.7%.

hypergeometry.png

hypergeometric distribution value

category probability
probability mass f 0.0070775932109153651831923063371216961166297
lower cumulative P 0.99903494867072865323201131115533112651846
upper cumulative Q 0.0080426445401867119511809951817905695981658
Expectation 1.5

False Discovery Rate (FDR) q-value

The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections).

MetaScape

http://metascape.org/gp/index.html

REViGO

http://revigo.irb.hr/revigo.jsp

cleverGO

http://www.tartaglialab.com/GO_analyser/tutorial

DAVID

https://david.ncifcrf.gov/

Araport

http://araport.org

Paper read

Fu, Yu, et al. “Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers.” BMC genomics 19.1 (2018): 531 Parekh, Swati, et al. “The impact of amplification on differential expression analyses by RNA-seq.” Scientific reports 6 (2016): 25533 Klepikova, Anna V., et al. “Effect of method of deduplication on estimation of differential gene expression using RNA-seq.” PeerJ 5 (2017): e3091