Midterm

Six questions
Thursday (10:00AM) to Tuesday (10:00AM)

Pronghorn

Connect to Pronghorn

ssh <yourID>@pronghorn.rc.unr.edu

What is file permission?

 -    (rw-)   (rw-)   (r--) 1 john sap
|      |       |       |
type  owner  group   others

How to download file from Pronghorn in your local

scp <yourID>@pronghorn.rc.unr.edu:~/[FOLDER AND FILE] [LOCAL directory]

Conda

Conda installation

search on web browser [software name & conda] on Google ex: 'hisat2 conda'

conda install [package name]

Conda export

export your environment to rnaseq.yaml

conda env export  > [Name].yaml

Sequencing

Illumina sequencing

Illumina

PacBio sequencing

PacBio

File format

Fasta file

Fastq file

Fastq_file basequality

GFF format

GFF GFF

Sequence ID
Source
- Describes the algorithm or the procedure that generated this feature. Typically Genescane or Genebank, respectively.
Feature Type
- Describes what the feature is (mRNA, domain, exon, etc.). These terms are constrained to the Sequence Ontology terms.
Feature Start
Feature End
Score
- Typically E-values for sequence similarity and P-values for predictions.
Strand
Phase
- Indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. 9 .Atributes
- A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent . You can see the full list here.

RNA Sequencing

The transcriptome is spatially and temporally dynamic
Data comes from functional units (coding regions)
Only a tiny fraction of the genome

Paper reading

Please read this paper
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

Conda

Dependencies is one of the main reasons to use Conda.
Sometimes, install a package is not as straight forward as you think. Imagine a case like this: You want to install package Matplotlib, when installing, it asks you to install Numpy, and Scipy, because Matplotlib need these Numpy and Scipy to work. They are called the dependencies of Matplotlib. For Numpy and Scipy, they may have their own dependencies. These require even more packages.

Conda env clean

conda clean --all

Conda create enviroment

conda create -n review python=3  

Conda activate enviroment

conda activate review

Install software

conda install -c bioconda -c anaconda trinity samtools multiqc fastqc rsem jellyfish bowtie2 salmon trim-galore fastqc bioconductor-ctc bioconductor-deseq2 bioconductor-edger bioconductor-biobase  bioconductor-qvalue r-ape  r-gplots  r-fastcluster
conda install -c anaconda openblas
conda install nano
conda install -c eumetsat tree
conda install -c lmfaber transrate

Check installation

conda list

Basic Unix/Linux command

cd

cd /data/gpfs/assoc/bch709-4/<YOUR_ID>

mkdir

mkdir BCH709_midterm
cd BCH709_midterm

pwd

pwd

wget

file download

wget https://www.dropbox.com/s/yqvfm70yz79jvij/fasta.zip https://www.dropbox.com/s/jjz6aip3euh0d7q/fastq.tar
wget https://www.dropbox.com/s/szzyb3l4243xcsu/bch709.py 

Decompress tar file

tar xvf fastq.tar

ls

Decompress zip file

unzip fasta.zip

ls

gz file

zcat

pipe

wc

rm

RNA-Seq

RNA Sequencing workflow

Advanced bioinformatics tools

Seqkit

https://plantgenomicslab.github.io/BCH709/seqkit_tutorial/index.html

Trim-Galore

Install Trim Galore

conda install -c bioconda -c conda-forge trim-galore

Run trimming

trim_galore --help

trim_galore --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 2  --max_n 40  --gzip -o trim pair1.fastq.gz pair2.fastq.gz --fastqc

HISAT2

Example

Install HISAT2 (graph FM index, spin off version Burrows-Wheeler Transform)

conda install -c conda-forge -c bioconda hisat2

Download reference sequence

wget https://nevada.box.com/shared/static/5v14j6gjt16c7k5d42b7g51csjmmj16l.fasta -O bch709.fasta

HISAT2 indexing

hisat2-build --help
hisat2-build bch709.fasta bch709

HISAT2 mapping

hisat2 -x bch709 --threads <YOUR CPU COUNT> -1 trim/paired1_val_1.fq.gz -2 trim/paired2_val_2.fq.gz  -S align.sam 2> summarymetrics.txt

MultiQC

multiqc .

SAMtools

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that:

Is flexible enough to store all the alignment information generated by various alignment programs;
Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
Is compact in file size;
Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. http://samtools.sourceforge.net/

samtools view -Sb align.sam -o align.bam

samtools sort align.bam  -o align_sort.bam

samtools index align_sort.bam

Quantification • Counts

genecount

Read counts = gene expression
Reads can be quantified on any feature (gene, transcript, exon etc)
Intersection on gene models
Gene/Transcript level

count

featureCounts HTSeq RSEM Cufflinks Rcount

conda install -c conda-forge -c bioconda subread

wget https://www.dropbox.com/s/e9dvdkrl9dta4qg/bch709.gtf
featureCounts -p  -a bch709.gtf align_sort.bam -o counts.txt

References:

Conda documentation https://docs.conda.io/en/latest/
Conda-forge https://conda-forge.github.io/
BioConda https://bioconda.github.io/

BCH709 Introduction to Bioinformatics: midterm_review

Midterm

Pronghorn

Connect to Pronghorn

What is file permission?

How to download file from Pronghorn in your local

Conda

Conda installation

Conda export

Sequencing

Illumina sequencing

PacBio sequencing

File format

Fasta file

Fastq file

GFF format

RNA Sequencing

Paper reading

Conda

Conda env clean

Conda create enviroment

Conda activate enviroment

Install software

Check installation

Basic Unix/Linux command

cd

mkdir

pwd

wget

Decompress tar file

Decompress zip file

gz file

zcat

pipe

wc

rm

RNA-Seq

Advanced bioinformatics tools

Seqkit

Trim-Galore

Install Trim Galore

Run trimming

HISAT2

Example

Install HISAT2 (graph FM index, spin off version Burrows-Wheeler Transform)

Download reference sequence

HISAT2 indexing

HISAT2 mapping

MultiQC

SAMtools

Quantification • Counts

References: