Course Evaluation
Students will have access to course evaluation You can log in with your NetID to http://www.unr.edu/evaluate and check live updating response rates for your course evaluations. Our institutional goal is to achieve an 85% response rate for all evaluations, and to help us achieve that, we rely on you as well as the students. Students will have access to them until 11:59 PM on Wed, May 6, 2020 PDT.
If we can achieve 100% response rate for evaluation, I will give you additional points for all of you.
Location
mkdir /data/gpfs/assoc/bch709/<YOURID>/genomedescription
cd /data/gpfs/assoc/bch709/<YOURID>/genomedescription
Environment
conda activate genomeannotation
BLAST comparison
cp ../genomeannotation/bch709.all.maker.proteins.fasta .
### If you don't have this bch709.all.maker.proteins.fasta, please use /data/gpfs/assoc/bch709/Course_material/2020/Genome_description_and_enrichment_test/bch709.all.maker.proteins.fasta file.
cp -r /data/gpfs/assoc/bch709/Course_material/2020/Genome_description_and_enrichment_test/* .
blast.sh
#!/bin/bash
#SBATCH --job-name=blast
#SBATCH --cpus-per-task=8
#SBATCH --time=3-12:00:00
#SBATCH --mem=20g
#SBATCH --mail-type=all
#SBATCH --mail-user=wyim@unr.edu
#SBATCH -o blast.out # STDOUT
#SBATCH -e blast.err # STDERR
#SBATCH -p cpu-s2-core-0
#SBATCH -A cpu-s2-bch709-0
makeblastdb -in uniprot.faa -dbtype prot
blastp -query bch709.all.maker.proteins.fasta -db uniprot.faa -out genome_vs_uniprot_blastp.txt -num_threads 8 -outfmt 6
makeblastdb -in arabidopsis.faa -dbtype prot
blastp -query bch709.all.maker.proteins.fasta -db arabidopsis.faa -out genome_vs_arabidopsis_blastp.txt -num_threads 8 -outfmt 6
BLAST (Basic Local Alignment Search Tool)
BLAST is a popular program for searching biosequences against databases. BLAST was developed and is maintained by a group at the National Center for Biotechnology Information (NCBI). Salient characteristics of BLAST are:
Local alignments
BLAST tries to find patches of regional similarity, rather than trying to find the best alignment between your entire query and an entire database sequence.
Ungapped alignments
Alignments generated with BLAST do not contain gaps. BLAST’s speed and statistical model depend on this, but in theory it reduces sensitivity. However, BLAST will report multiple local alignments between your query and a database sequence.
Explicit statistical theory
BLAST is based on an explicit statistical theory developed by Samuel Karlin and Steven Altschul (PNAS 87:2284-2268. 1990) The original theory was later extended to cover multiple weak matches between query and database entry PNAS 90:5873. 1993).
CAUTION: the repetitive nature of many biological sequences (particularly naive translations of DNA/RNA) violates assumptions made in the Karlin & Altschul theory. While the P values provided by BLAST are a good rule-of-thumb for initial identification of promising matches, care should be taken to ensure that matches are not due simply to biased amino acid composition.
CAUTION: The databases are contaminated with numerous artifacts. The intelligent use of filters can reduce problems from these sources. Remember that the statistical theory only covers the likelihood of finding a match by chance under particular assumptions; it does not guarantee biological importance.
Heuristic
BLAST is not guaranteed to find the best alignment between your query and the database; it may miss matches. This is because it uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in order to gain speed. However, in practice few biologically significant matches are missed by BLAST which can be found with other sequence search programs. BLAST searches the database in two phases. First it looks for short subsequences which are likely to produce significant matches, and then it tries to extend these subsequences. A substitution matrix is used during all phases of protein searches (BLASTP, BLASTX, TBLASTN) Both phases of the alignment process (scanning & extension) use a substitution matrix to score matches. This is in contrast to FASTA, which uses a substitution matrix only for the extension phase. Substitution matrices greatly improve sensitivity.
Popular BLAST software
BLASTP
search a Protein Sequence against a Protein Database.
BLASTN
search a Nucleotide Sequence against a Nucleotide Database.
TBLASTN
search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames.
BLASTX
search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames.
AHRD
edit bch709.yml
ahrd.sh
#!/bin/bash
#SBATCH --job-name=ahrd
#SBATCH --cpus-per-task=8
#SBATCH --time=12:00:00
#SBATCH --mem=20g
#SBATCH --mail-type=all
#SBATCH --mail-user=wyim@unr.edu
#SBATCH -o blast.out # STDOUT
#SBATCH -e blast.err # STDERR
#SBATCH -p cpu-s2-core-0
#SBATCH -A cpu-s2-bch709-0
java -Xmx8g -jar ahrd.jar bch709.yml
AHRD
High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms. https://www.cropbio.uni-bonn.de/
- Choose best scoring blast results, 200 from each database searched
- Filter description lines of above blast-results using regular expressions:
- Reject those matched by any regex given in e.g. ./test/resources/blacklist_descline.txt,
- Delete those parts of each description line, matching any regex in e.g. ./test/resources/filter_descline_sprot.txt.
- Divide each description line into tokens (characters of collective meaning)
- In terms of score ignore any tokens matching regexs given e.g. in ./test/resources/blacklist_token.txt.
- Token score (calculated from: bitscore, database weight, overlap score)
- Lexical score (calculated from: Token score, High score factor, Pattern factor, Correction factor)
- Description score (calculated from: Lexical score and Blast score)
- Choose best scoring description line
Find best BLAST match
perl blast_parse.pl -q bch709.all.maker.proteins.fasta -i genome_vs_arabidopsis_blastp.txt -o genome_vs_arabidopsis_blastp_filtered
Replace Name of Data
cp ../genomeannotation/bch709.all.maker.transcripts.fasta .
cp ../genomeannotation/bch709.all.gff .
cut -f 1,2 genome_vs_arabidopsis_blastp_filtered | sed 's/-mRNA-1//g' > id_map
cut -f 1,2 genome_vs_arabidopsis_blastp_filtered >> id_map
map_fasta_ids id_map bch709.all.maker.transcripts.fasta
map_fasta_ids id_map bch709.all.maker.proteins.fasta
map_gff_ids id_map bch709.all.gff
cp ../genomernaseq/rnaseq/BCH709.featureCount_count_only.cnt.DT_vs_WT.DESeq2.DE_results* .
map_data_ids id_map BCH709.featureCount_count_only.cnt.DT_vs_WT.DESeq2.DE_results.P0.001_C2.DT-UP.subset
map_data_ids id_map BCH709.featureCount_count_only.cnt.DT_vs_WT.DESeq2.DE_results.P0.001_C2.WT-UP.subset
map_data_ids id_map BCH709.featureCount_count_only.cnt.DT_vs_WT.DESeq2.DE_results.P0.001_C2.DE.subset
Gene Ontology
Gene Ontology project is a major bioinformatics initiative Gene ontology is an annotation system The project provides the controlled and consistent vocabulary of terms and gene product annotations, i.e. terms occur only once, and there is a dictionary of allowed words GO describes how gene products behave in a cellular context A consistent description of gene products attributes in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner Each GO term consists of a unique alphanumerical identifier, a common name, synonyms (if applicable), and a definition Each term is assigned to one of the three ontologies Terms have a textual definition When a term has multiple meanings depending on species, the GO uses a “sensu” tag to differentiate among them (trichome differentiation (sensu Magnoliophyta)
hypergeometric test
The hypergeometric distribution is the lesser-known cousin of the binomial distribution, which describes the probability of k successes in n draws with replacement. The hypergeometric distribution describes probabilities of drawing marbles from the jar without putting them back in the jar after each draw. The hypergeometric probability mass function is given by (using the original variable convention)
FWER
The FWER for the other tests is computed in the same way: the gene-associated variables (scores or counts) are permuted while the annotations of genes to GO-categories stay fixed. Then the statistical tests are evaluated again for every GO-category.
Hypergeometric Test Example 1
Suppose we randomly select 2 cards without replacement from an ordinary deck of playing cards. What is the probability of getting exactly 2 cards you want (i.e., Ace or 10)?
Solution: This is a hypergeometric experiment in which we know the following:
N = 52; since there are 52 cards in a deck. k = 16; since there are 16 Ace or 10 cards in a deck. n = 2; since we randomly select cards from the deck. x = 2; since 2 of the cards we select are red. We plug these values into the hypergeometric formula as follows:
h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]
h(2; 52, 2, 16) = [ 16C2 ] [ 48C1 ] / [ 52C2 ]
h(2; 52, 2, 16) = [ 325 ] [ 1 ] / [ 1,326 ]
h(2; 52, 2, 16) = 0.0904977
Thus, the probability of randomly selecting 2 Ace or 10 cards is 9%
category | probability |
---|---|
probability mass f | 0.09049773755656108597285 |
lower cumulative P | 1 |
upper cumulative Q | 0.09049773755656108597285 |
Expectation | 0.6153846153846153846154 |
Hypergeometric Test Example 2
Suppose we have 30 DEGs in human genome (200). What is the probability of getting 10 oncogene?
An oncogene is a gene that has the potential to cause cancer.
Solution: This is a hypergeometric experiment in which we know the following:
N = 200; since there are 200 genes in human genome k = 10; since there are 10 oncogenes in human n = 30; since 30 DEGs x = 5; since 5 of the oncogenes in DEGs.
We plug these values into the hypergeometric formula as follows:
h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]
h(5; 200, 30, 10) = [ 10C5 ] [ 190C25 ] / [ 200C30 ]
h(5; 200, 30, 10) = [ 252 ] [ 11506192278177947613740456466942 ] / [ 409681705022127773530866523638950880 ]
h(5; 200, 30, 10) = 0.007078
Thus, the probability of oncogene 0.7%.
hypergeometry.png
hypergeometric distribution value
category | probability |
---|---|
probability mass f | 0.0070775932109153651831923063371216961166297 |
lower cumulative P | 0.99903494867072865323201131115533112651846 |
upper cumulative Q | 0.0080426445401867119511809951817905695981658 |
Expectation | 1.5 |
False Discovery Rate (FDR) q-value
The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections).
- Benjamini–Yekutieli
- Benjamini–Hochberg
- Bonferroni-Selected–Bonferroni
- Bonferroni and Sidak
REViGO
http://revigo.irb.hr/revigo.jsp
cleverGO
http://www.tartaglialab.com/GO_analyser/tutorial
MetaScape
http://metascape.org/gp/index.html#/main/step1
DAVID
https://david.ncifcrf.gov/
AHRD References
High quality genome projects that used AHRD:
-
Young, Nevin D., Frédéric Debellé, Giles E. D. Oldroyd, Rene Geurts, Steven B. Cannon, Michael K. Udvardi, Vagner A. Benedito, et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480, no. 7378 (December 22, 2011): 520–24. doi:10.1038/nature10625.
-
The Tomato Genome Consortium. “The Tomato Genome Sequence Provides Insights into Fleshy Fruit Evolution.” Nature 485, no. 7400 (May 31, 2012): 635–41. doi:10.1038/nature11119.
-
International Wheat Genome Sequencing Consortium (IWGSC). “A Chromosome-Based Draft Sequence of the Hexaploid Bread Wheat (Triticum Aestivum) Genome.” Science (New York, N.Y.) 345, no. 6194 (July 18, 2014): 1251788. doi:10.1126/science.1251788.
-
International Barley Genome Sequencing Consortium, Klaus F. X. Mayer, Robbie Waugh, John W. S. Brown, Alan Schulman, Peter Langridge, Matthias Platzer, et al. “A Physical, Genetic and Functional Sequence Assembly of the Barley Genome.” Nature 491, no. 7426 (November 29, 2012): 711–16. doi:10.1038/nature11543.
-
Wang, W., G. Haberer, H. Gundlach, C. Gläßer, T. Nussbaumer, M. C. Luo, A. Lomsadze, et al. “The Spirodela Polyrhiza Genome Reveals Insights into Its Neotenous Reduction Fast Growth and Aquatic Lifestyle.” Nature Communications 5 (2014): 3311. doi:10.1038/ncomms4311.
-
Guo, Shaogui, Jianguo Zhang, Honghe Sun, Jerome Salse, William J. Lucas, Haiying Zhang, Yi Zheng, et al. “The Draft Genome of Watermelon (Citrullus Lanatus) and Resequencing of 20 Diverse Accessions.” Nature Genetics 45, no. 1 (January 2013): 51–58. doi:10.1038/ng.2470.
AHRD was applied on all plant genomes present in the public database PlantsDB:
- Spannagl, Manuel, Thomas Nussbaumer, Kai C. Bader, Mihaela M. Martis, Michael Seidel, Karl G. Kugler, Heidrun Gundlach, and Klaus F. X. Mayer. “PGSB PlantsDB: Updates to the Database Framework for Comparative Plant Genome Research.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D1141–47. doi:10.1093/nar/gkv1130.
Meeting Schedule
Please mark time at the below link and send me email before meeting time.
https://docs.google.com/spreadsheets/d/1YxNb6Gfu_LymZ_8r_xQYk4VhK9ZY9eeMtuQqeeNAcVY/edit?usp=sharing
Final
- Seven questions.
- RNA-Seq, Venn diagram, Gene enrichments, etcs.
- Will open 5/10 7:00AM to 5/12 7:00PM (No extension will be allowed) on Webcanvas.
- 200 points for final grade