Quick Reminder

Fastq_file

basequality

RNASeq

$ cd /data/gpfs/assoc/bch709/YOURID

$ mkdir rnaseq/

$ cd rnaseq

$ conda activate rnaseq

File copy

$ cd  /data/gpfs/assoc/bch709/YOURID/

$ cd rnaseq/transcriptome_assembly/fastq

$ cp /data/gpfs/assoc/bch709/Course_material/2020/RNASeq/*.gz .

FASTQC

$ pwd 
/data/gpfs/assoc/bch709/YOURID/rnaseq/transcriptome_assembly/fastq

$ conda install -c bioconda fastqc

$ fastqc pair1.fastq.gz pair2.fastq.gz

Trim the reads

Trim IF necessary
- Synthetic bases can be an issue for SNP calling
- Insert size distribution may be more important for assemblers
Trim/Clip/Filter reads
Remove adapter sequences
Trim reads by quality
Sliding window trimming
Filter by min/max read length
Remove reads less than ~18nt
Demultiplexing/Splitting

Trimming

Cutadapt
fastp
Skewer
Prinseq
Trimmomatics
Trim Galore

Install Trim Galore

$ conda install trim-galore

Run trimming

$ trim_galore --help

$ trim_galore --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 2  --max_n 40  --gzip -o trim paired1.fastq.gz paired2.fastq.gz 

$ ls trim/

Align the reads (mapping)

mapping

Aligning reads back to a reference sequence
Mapping to genome vs transcriptome
Splice-aware alignment (genome)

STAR
HISAT2
GSNAP
Bowtie2
Novoalign

Baruzzo, Giacomo, et al. “Simulation-based comprehensive benchmarking of RNA-seq aligners.” Nature methods 14.2 (2017): 135

algorithm banana banana

Install HISAT2 (graph FM index, spin off version Burrows-Wheeler Transform)

$ conda install hisat2

Download reference sequence

$ wget https://www.dropbox.com/s/0onch14nnxx9b94/bch709.fasta

HISAT2 indexing

$ hisat2-build --help
$ hisat2-build bch709.fasta bch709

HISAT2 mapping

$ hisat2 -x bch709 --threads <YOUR CPU COUNT> -1 trim/paired1_val_1.fq.gz -2 trim/paired2_val_2.fq.gz  -S align.sam

SAM file format

Check result

$ head align.sam

SAM file

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]

sam1

SAM flag

Demical to binary

$ echo 'obase=2;163' | bc

If you don’t have bc, please install through conda

$ conda install bc

SAM tag

There are a bunch of predefined tags, please see the SAM manual for more information. For the tags used in this example:

Any tags that start with X? are reserved fields for end users: XT:A:M, XN:i:2, XM:i:0, XO:i:0, XG:i:0

samtag

More information is below.

http://samtools.github.io/hts-specs/

SAMtools

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that:

Is flexible enough to store all the alignment information generated by various alignment programs;
Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
Is compact in file size;
Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. http://samtools.sourceforge.net/

$ samtools view -Sb align.sam > align.bam
$ samtools sort align.bam  -o align_sort.bam
$ samtools index align_sort.bam

BAM file

A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data.

SAM/BAM	size
align.sam	903M
align.bam	166M

Alignment visualization

samtools tview align_sort.bam bch709.fasta

tview

IGV
SAMTOOLS Tview
Tablet

How to make a report?

MultiQC MultiQC

$ conda install multiqc
$ multiqc --help
$ multiqc .

Alignment QC

Number of reads mapped/unmapped/paired etc
Uniquely mapped
Insert size distribution
Coverage
Gene body coverage
Biotype counts / Chromosome counts
Counts by region: gene/intron/non-genic
Sequencing saturation
Strand specificity

samtools > stats
bamtools > stats
QoRTs
RSeQC
Qualimap

Quantification • Counts

genecount

Read counts = gene expression
Reads can be quantified on any feature (gene, transcript, exon etc)
Intersection on gene models
Gene/Transcript level

count

featureCounts HTSeq RSEM Cufflinks Rcount

conda install -c bioconda subread
conda install -c bioconda rsem

Quantification method

PCR duplicates
Ignore for RNA-Seq data
Computational deduplication (Don’t!)
Use PCR-free library-prep kits
Use UMIs during library-prep
Multi-mapping
Added (BEDTools multicov)
Discard (featureCounts, HTSeq)
Distribute counts (Cufflinks)
Rescue
- Probabilistic assignment (Rcount, Cufflinks)
- Prioritise features (Rcount)
- Probabilistic assignment with EM (RSEM)

Reference

Fu, Yu, et al. “Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers.” BMC genomics 19.1 (2018): 531 Parekh, Swati, et al. “The impact of amplification on differential expression analyses by RNA-seq.” Scientific reports 6 (2016): 25533 Klepikova, Anna V., et al. “Effect of method of deduplication on estimation of differential gene expression using RNA-seq.” PeerJ 5 (2017): e3091

Quick reminder

conda activate rnaseq

conda install python=3 jellyfish bowtie2 salmon cmake htop 

fastqc paired1.fastq.gz paired2.fastq.gz
 
trim_galore --clip_R1 10 --clip_R2 10 --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 2  --max_n 40  --gzip -o trim/ paired1.fastq.gz paired2.fastq.gz
 
fastqc trim/paired1_val_1.fq.gz  trim/paired2_val_2.fq.gz
 
hisat2-build bch709.fasta bch709
 
hisat2 -x bch709 --threads 8 -1 trim/paired1_val_1.fq.gz -2 trim/paired2_val_2.fq.gz  -S align.sam
 
samtools view -Sb align.sam > align.bam
 
samtools sort align.bam  -o align_sort.bam
 
samtools index align_sort.bam

samtools tview  align_sort.bam

Transcriptome Assembly

De novo assembly denovo

Assembly?

assembly
assembly1

Sequencing coverage

coverage
averagecoverage

Assembly law

assemblylaw
assemblylaw1

Overlap graph

overlapgraph
greedy1
greedy2
greedy3
greedy4
greedy5
greedy6
greedy7
greedy8
greedy9

K-mer: substring of length k

k-mers are subsequences of length k contained within a biological sequence. kmer

De bruijn

hamiltonian_Eulerian DBG
dbg1

OLC vs De bruijn

assemblyalgorithm
olcdbg
realdbg
realdbg2
realdbg3

Transcriptome assembler

trinity

Transcriptome assembly error

transcriptome_error

Running Trinity

Trinity is run via the script: ‘Trinity’ found in the base installation directory.

Usage info is as follows:

conda create -n transcriptome_assembly

conda activate transcriptome_assembly

conda install -c bioconda/label/cf201901 trinity

 ###############################################################################
 #
 #     ______  ____   ____  ____   ____  ______  __ __
 #    |      ||    \ |    ||    \ |    ||      ||  |  |
 #    |      ||  D  ) |  | |  _  | |  | |      ||  |  |
 #    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |
 #      |  |  |    \  |  | |  |  | |  |   |  |  |___, |
 #      |  |  |  .  \ |  | |  |  | |  |   |  |  |     |
 #      |__|  |__|\_||____||__|__||____|  |__|  |____/
 #
 ###############################################################################
 #
 # Required:
 #
 #  --seqType <string>      :type of reads: ( fa, or fq )
 #
 #  --max_memory <string>      :suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)
 #                            provided in Gb of RAM, ie.  '--max_memory 10G'
 #
 #  If paired reads:
 #      --left  <string>    :left reads, one or more file names (separated by commas, not spaces)
 #      --right <string>    :right reads, one or more file names (separated by commas, not spaces)
 #
 #  Or, if unpaired reads:
 #      --single <string>   :single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
 #
 #  Or,
 #      --samples_file <string>         tab-delimited text file indicating biological replicate relationships.
 #                                   ex.
 #                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq
 #                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq
 #                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq
 #                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq
 #
 #                      # if single-end instead of paired-end, then leave the 4th column above empty.
 #
 ####################################
 ##  Misc:  #########################
 #
 #  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.
 #                                   if paired: RF or FR,
 #                                   if single: F or R.   (dUTP method = RF)
 #                                   See web documentation.
 #
 #  --CPU <int>                     :number of CPUs to use, default: 2
 #  --min_contig_length <int>       :minimum assembled contig length to report
 #                                   (def=200)
 #
 #  --long_reads <string>           :fasta file containing error-corrected or circular consensus (CCS) pac bio reads
 #
 #  --genome_guided_bam <string>    :genome guided mode, provide path to coordinate-sorted bam file.
 #                                   (see genome-guided param section under --show_full_usage_info)
 #
 #  --jaccard_clip                  :option, set if you have paired reads and
 #                                   you expect high gene density with UTR
 #                                   overlap (use FASTQ input file format
 #                                   for reads).
 #                                   (note: jaccard_clip is an expensive
 #                                   operation, so avoid using it unless
 #                                   necessary due to finding excessive fusion
 #                                   transcripts w/o it.)
 #
 #  --trimmomatic                   :run Trimmomatic to quality trim reads
 #                                        see '--quality_trimming_params' under full usage info for tailored settings.
 #
 #
 #  --no_normalize_reads            :Do *not* run in silico normalization of reads. Defaults to max. read coverage of 50.
 #                                       see '--normalize_max_read_cov' under full usage info for tailored settings.
 #                                       (note, as of Sept 21, 2016, normalization is on by default)
 #
 #
 #
 #  --output <string>               :name of directory for output (will be
 #                                   created if it doesn't already exist)
 #                                   default( your current working directory: "/Users/bhaas/GITHUB/trinityrnaseq/trinity_out_dir"
 #                                    note: must include 'trinity' in the name as a safety precaution! )
 #
 #  --full_cleanup                  :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
 #
 #  --cite                          :show the Trinity literature citation
 #
 #  --version                       :reports Trinity version (BLEEDING_EDGE) and exits.
 #
 #  --show_full_usage_info          :show the many many more options available for running Trinity (expert usage).
 #
 #
 ###############################################################################
 #
 #  *Note, a typical Trinity command might be:
 #
 #        Trinity --seqType fq --max_memory 50G --left reads_1.fq  --right reads_2.fq --CPU 6
 #
 #
 #    and for Genome-guided Trinity:
 #
 #        Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G
 #                --genome_guided_max_intron 10000 --CPU 6
 #
 #     see: /Users/bhaas/GITHUB/trinityrnaseq/sample_data/test_Trinity_Assembly/
 #          for sample data and 'runMe.sh' for example Trinity execution
 #
 #     For more details, visit: http://trinityrnaseq.github.io
 #
 ###############################################################################

Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved. For protocols on strand-specific RNA-Seq, see: Borodina T, Adjaye J, Sultan M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 2011;500:79-98. PubMed PMID: 21943893.

If you have strand-specific data, specify the library type. There are four library types:

Paired reads:
- RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method.
- FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)
Unpaired (single) reads:
- F: the single read is in the sense (forward) orientation
- R: the single read is in the antisense (reverse) orientation

By setting the –SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.

strand specific specification

Other important considerations:

Whether you use Fastq or Fasta formatted input files, be sure to keep the reads oriented as they are reported by Illumina, if the data are strand-specific. This is because, Trinity will properly orient the sequences according to the specified library type. If the data are not strand-specific, no worries because the reads will be parsed in both orientations.
If you have both paired and unpaired data, and the data are NOT strand-specific, you can combine the unpaired data with the left reads of the paired fragments. Be sure that the unpaired reads have a /1 as a suffix to the accession value similarly to the left fragment reads. The right fragment reads should all have /2 as the accession suffix. Then, run Trinity using the –left and –right parameters as if all the data were paired.
If you have multiple paired-end library fragment sizes, set the ‘–group_pairs_distance’ according to the larger insert library. Pairings that exceed that distance will be treated as if they were unpaired by the Butterfly process.
by setting the ‘–CPU option’, you are indicating the maximum number of threads to be used by processes within Trinity. Note that Inchworm alone will be internally capped at 6 threads, since performance will not improve for this step beyond that setting)

Typical Trinity Command Line

A typical Trinity command for assembling non-strand-specific RNA-seq data would be like so, running the entire process on a single high-memory server (aim for ~1G RAM per ~1M ~76 base Illumina paired reads, but often much less memory is required):

Run Trinity like so:

 Trinity --seqType fq --max_memory 50G \
         --left reads_1.fq.gz  --right reads_2.fq.gz --CPU 6

If you have multiple sets of fastq files, such as corresponding to multiple tissue types or conditions, etc., you can indicate them to Trinity like so:

 Trinity --seqType fq --max_memory 50G  \
         --left condA_1.fq.gz,condB_1.fq.gz,condC_1.fq.gz \
         --right condA_2.fq.gz,condB_2.fq.gz,condC_2.fq.gz \
         --CPU 6

or better yet, create a ‘samples.txt’ file that describes the data like so:

 #      --samples_file <string>         tab-delimited text file indicating biological replicate relationships.
 #                                   ex.
 #                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq
 #                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq
 #                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq
 #                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq

This same samples file can then be used later on with other downstream analysis steps, including expression quantification and differential expression analysis.

note that fastq files can be gzip-compressed as shown above, in which case they should require a ‘.gz’ extension.

Sample file

$ cd ../

$ wget https://www.dropbox.com/s/q1c7t4xf5kkk5um/samples.txt

Sample name	stress	replicate	pair
KRWT	D	1	_1

Run trimming

$ trim_galore --paired   --three_prime_clip_R1 20 --three_prime_clip_R2 20 --cores 2  --max_n 40  --gzip -o trimmed_fastq fastq/WTD1_1.fastq.gz fastq/WTD1_2.fastq.gz 
.
.
.
.
.

$ multiqc . -n rnaseq_data

PLEASE CHECK YOUR MULTIQC

Trinity run

htop

cd /data/gpfs/assoc/bch709/YOURID/rnaseq/transcriptome_assembly

Trinity --seqType fq --max_memory 6G --samples_file samples.txt

Paper need to read https://academic.oup.com/gigascience/article/8/5/giz039/5488105 https://www.nature.com/articles/nbt.1883 https://www.nature.com/articles/nprot.2013.084

Assignment

de Bruijn graph construction (10 pts)

Draw (by hand) the de Bruijn graph for the following reads using k=3 (assume all reads are from the forward strand, no sequencing errors)
ATG
AGT
CAT
GTA
GTT
TGT
TAG
TTA
TAC

Trimming Practice

Make Trim folder under /data/gpfs/assoc/bch709/YOURID/BCH709_assignment

Change directory to the folder Trim

Copy all fastq.gz files in /data/gpfs/assoc/bch709/Course_material/2020/RNASeq_raw_fasta to Trim folder

Run Trim-Galore and process MultiQC

Generate 6 trimmed output from MultiQC and upload html file to WebCanvas. ** IF YOU USE “LOOP” FOR THIS JOB, YOU WILL GET ADDITIONAL 10 POINTS **

Reference:

Conda documentation https://docs.conda.io/en/latest/
Conda-forge https://conda-forge.github.io/
BioConda https://bioconda.github.io/

BCH709 Introduction to Bioinformatics: 7_Transcriptome Assembly

Quick Reminder

RNASeq

File copy

FASTQC

Trim the reads

Install Trim Galore

Run trimming

Align the reads (mapping)

Install HISAT2 (graph FM index, spin off version Burrows-Wheeler Transform)

Download reference sequence

HISAT2 indexing

HISAT2 mapping

SAM file format

SAM file

SAM flag

Demical to binary

SAM tag

More information is below.

SAMtools

BAM file

Alignment visualization

How to make a report?

Alignment QC

Quantification • Counts

Quantification method

Reference

Quick reminder

Transcriptome Assembly

Assembly?

Sequencing coverage

Assembly law

Overlap graph

K-mer: substring of length k

De bruijn

OLC vs De bruijn

Transcriptome assembler

Transcriptome assembly error

Running Trinity

Typical Trinity Command Line

Sample file

Run trimming

Trinity run

Assignment

Reference: