15_RNA-Seq tutorial – BCH709 Introduction to Bioinformatics

Importing Data from the NCBI Sequence Read Archive (SRA) using the DE

Course Evaluation

Students will have access to course evaluation You can log in with your NetID to http://www.unr.edu/evaluate and check live updating response rates for your course evaluations. Our institutional goal is to achieve an 85% response rate for all evaluations, and to help us achieve that, we rely on you as well as the students.

If we can achieve 100% response rate for evaluation, I will give you additional points for all of you.

Discussion is open

https://unr.canvaslms.com/courses/56453/discussion_topics/514795

The due date for the question is November 23rd 11:59pm The due date for discussion is December 4th 11:59pm

Define the biological hypotheses or bottleneck you wish to address which is related to your research, state the approach of your experiment, also state your system, study organism, or study site, and provide justification for what is the goal of your biological hypotheses. Please provide enough background information that the other students can understand your biological hypotheses or bottleneck. If your experiments are complicated, consider briefly explaining the experimental design with reason. If you get more like will get points. (30 Points)
Please provide the bioinformatics suggestion that you want to suggest for other people’s research hypotheses or bottleneck. It should be scientifically valid methods even if it does not exist. Provide enough information to create an experiment and if you want to create software, please provide reasons and explain what kind of software we need, which part of the hypotheses or bottleneck can be solved. If the software doesn’t exist, please provide the design or roadmap of your software. Citation is optional but recommended. Please provide an obstacle to other people’s suggestions. In addition, insights and addition will also get points. If you get more like will get points. (10 points per valid answer with reference or concept or hypothesis, a total of 70 Points, seven replies are needed )

3. The suggestion needs to reply as threaded format.

Examples are below

Example

Tef (or Teff) is a warm season, C4-photosynthesis grass that is gaining popularity in the U.S. as a high-quality summer forage, fodder, and gluten-free grain. However, Tef has relatively tiny seeds compare to other C4 grass. Currently, the primary goal of my research is to determine the loci of seed color and size. We are trying to use Genome-wide Association Study (GWAS) https://en.wikipedia.org/wiki/Genome-wide_association_study) (Links to an external site.) to identify the locus of seed color and size. In our lab, we have 386 teff accession and all of them have different seed colors. We extracted all of 386 teff accession DNA and sequencing was done. But I don’t know how to check the size and colors. The phenotyping is the most important but the main bottleneck of our experiment. How can we facilitate this task?

Student A (This example answer will get 3 points) I cannot find any solution but you can use a similar approach such as colony counting. The accurate counting of plates with high numbers of CFUs is error-prone since it requires a high level of attention by the counter. In the microbiome and general biology field use colony count software to analyze whole plate count. The examples are below.

https://www.nature.com/articles/s41598-018-24916-9 (Links to an external site.)

http://opencfu.sourceforge.net/ (Links to an external site.)

Student B (This example answer will get 5 points) I found one software, especially for seed size and color. GrainScan software was designed for seed size and color estimation. GrainScan uses a grayscale image is derived from the scanned color image by converting Red and Green color channel averaging. Based on the grayscaled image, the dimension measurements will be provided which include area, perimeter, and surrogates for length and width the major and minor axes of the best fit ellipse. Another great point of this software will provide color measurements for each seed in CIELAB values based on user provide color calibration options.

https://plantmethods.biomedcentral.com/articles/10.1186/1746-4811-10-23 (Links to an external site.)

Student C (This example answer will get 5 points) I don’t know how to code python, but there are several image analysis packages such as scikit-image (Links to an external site.)

Base on scikit software, you can calculate circularity with “4 * pi * props.area / props.perimeter ** 2”

The props area can be calculated with number of pixels from the centroid approach. The axis location and length can be converted by using orientation value from props and axis value can be estimated cos(orientation) * length/ 2 and sin (orientation) * length/ 2.

Delete previous work

rm -rf  /data/gpfs/assoc/bch709-1/<YOURID>/rnaseq_assembly/trinity_out_dir

WORKTING PATH

mkdir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/
cd /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/

Conda environment

CONDA_INSTRUMENTATION_ENABLED=1 conda create -n BCH709 python=3.7

conda activate BCH709

CONDA_INSTRUMENTATION_ENABLED=1 conda install -y -c bioconda -c conda-forge  sra-tools minimap2 trinity star multiqc=1.9 samtools=1.9 trim-galore gffread seqkit kraken2

CONDA_INSTRUMENTATION_ENABLED=1 conda install -y -c bioconda -c conda-forge -c r openssl=1.0 r-base icu=58.2 bioconductor-ctc  bioconductor-deseq2=1.20.0 bioconductor-biobase=2.40.0  bioconductor-qvalue=2.16.0 r-ape  r-gplots r-fastcluster=1.1.25 libiconv

SRA

Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys.

Searching the SRA: Searching the SRA can be complicated. Often a paper or reference will specify the accession number(s) connected to a dataset. You can search flexibly using a number of terms (such as the organism name) or the filters (e.g. DNA vs. RNA). The SRA Help Manual provides several useful explanations. It is important to know is that projects are organized and related at several levels, and some important terms include:

Bioproject: A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium of coordinating organizations; see for example Bio Project 272719 Bio Sample: A description of the source materials for a project Run: These are the actual sequencing runs (usually starting with SRR); see for example SRR1761506

Publication (Arabidopsis)

Kim JS et al., “ROS1-Dependent DNA Demethylation Is Required for ABA-Inducible NIC3 Expression.”, Plant Physiol, 2019 Apr;179(4):1810-1821

SRA Bioproject site

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA272719

Runinfo

Run	ReleaseDate	LoadDate	spots	bases	spots_with_mates	avgLength	size_MB	download_path	Experiment	LibraryStrategy	LibrarySelection	LibrarySource	LibraryLayout	Platform	Model	SRAStudy	BioProject	Study_Pubmed_id	ProjectID	Sample	BioSample	SampleType	TaxID	ScientificName	SampleName	Tumor	CenterName	Submission	Consent	RunHash	ReadHash
SRR1761506	1/15/2016 15:51	1/15/2015 12:43	7379945	1490748890	7379945	202	899	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761506/SRR1761506.1	SRX844600	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820503	SAMN03285048	simple	3702	Arabidopsis thaliana	GSM1585887	no	GEO	SRA232612	public	F335FB96DDD730AC6D3AE4F6683BF234	12818EB5275BCB7BCB815E147BFD0619
SRR1761507	1/15/2016 15:51	1/15/2015 12:43	9182965	1854958930	9182965	202	1123	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761507/SRR1761507.1	SRX844601	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820504	SAMN03285045	simple	3702	Arabidopsis thaliana	GSM1585888	no	GEO	SRA232612	public	00FD62759BF7BBAEF123BF5960B2A616	A61DCD3B96AB0796AB5E969F24F81B76
SRR1761508	1/15/2016 15:51	1/15/2015 12:47	19060611	3850243422	19060611	202	2324	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761508/SRR1761508.1	SRX844602	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820505	SAMN03285046	simple	3702	Arabidopsis thaliana	GSM1585889	no	GEO	SRA232612	public	B75A3E64E88B1900102264522D2281CB	657987ABC8043768E99BD82947608CAC
SRR1761509	1/15/2016 15:51	1/15/2015 12:51	16555739	3344259278	16555739	202	2016	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761509/SRR1761509.1	SRX844603	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820506	SAMN03285049	simple	3702	Arabidopsis thaliana	GSM1585890	no	GEO	SRA232612	public	27CA2B82B69EEF56EAF53D3F464EEB7B	2B56CA09F3655F4BBB412FD2EE8D956C
SRR1761510	1/15/2016 15:51	1/15/2015 12:46	12700942	2565590284	12700942	202	1552	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761510/SRR1761510.1	SRX844604	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820508	SAMN03285050	simple	3702	Arabidopsis thaliana	GSM1585891	no	GEO	SRA232612	public	D3901795C7ED74B8850480132F4688DA	476A9484DCFCF9FFFDAADAAF4CE5D0EA
SRR1761511	1/15/2016 15:51	1/15/2015 12:44	13353992	2697506384	13353992	202	1639	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761511/SRR1761511.1	SRX844605	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820507	SAMN03285047	simple	3702	Arabidopsis thaliana	GSM1585892	no	GEO	SRA232612	public	5078379601081319FCBF67C7465C404A	E3B4195AFEA115ACDA6DEF6E4AA7D8DF
SRR1761512	1/15/2016 15:51	1/15/2015 12:44	8134575	1643184150	8134575	202	1067	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761512/SRR1761512.1	SRX844606	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820509	SAMN03285051	simple	3702	Arabidopsis thaliana	GSM1585893	no	GEO	SRA232612	public	DDB8F763B71B1E29CC9C1F4C53D88D07	8F31604D3A4120A50B2E49329A786FA6
SRR1761513	1/15/2016 15:51	1/15/2015 12:43	7333641	1481395482	7333641	202	960	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761513/SRR1761513.1	SRX844607	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820510	SAMN03285053	simple	3702	Arabidopsis thaliana	GSM1585894	no	GEO	SRA232612	public	4068AE245EB0A81DFF02889D35864AF2	8E05C4BC316FBDFEBAA3099C54E7517B
SRR1761514	1/15/2016 15:51	1/15/2015 12:44	6160111	1244342422	6160111	202	807	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761514/SRR1761514.1	SRX844608	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820511	SAMN03285059	simple	3702	Arabidopsis thaliana	GSM1585895	no	GEO	SRA232612	public	0A1F3E9192E7F9F4B3758B1CE514D264	81BFDB94C797624B34AFFEB554CE4D98
SRR1761515	1/15/2016 15:51	1/15/2015 12:44	7988876	1613752952	7988876	202	1048	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761515/SRR1761515.1	SRX844609	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820512	SAMN03285054	simple	3702	Arabidopsis thaliana	GSM1585896	no	GEO	SRA232612	public	39B37A0BD484C736616C5B0A45194525	85B031D74DF90AD1815AA1BBBF1F12BD
SRR1761516	1/15/2016 15:51	1/15/2015 12:44	8770090	1771558180	8770090	202	1152	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761516/SRR1761516.1	SRX844610	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820514	SAMN03285055	simple	3702	Arabidopsis thaliana	GSM1585897	no	GEO	SRA232612	public	E4728DFBF0F9F04B89A5B041FA570EB3	B96545CB9C4C3EE1C9F1E8B3D4CE9D24
SRR1761517	1/15/2016 15:51	1/15/2015 12:44	8229157	1662289714	8229157	202	1075	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761517/SRR1761517.1	SRX844611	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820513	SAMN03285058	simple	3702	Arabidopsis thaliana	GSM1585898	no	GEO	SRA232612	public	C05BC519960B075038834458514473EB	4EF7877FC59FF5214DBF2E2FE36D67C5
SRR1761518	1/15/2016 15:51	1/15/2015 12:44	8760931	1769708062	8760931	202	1072	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761518/SRR1761518.1	SRX844612	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820515	SAMN03285052	simple	3702	Arabidopsis thaliana	GSM1585899	no	GEO	SRA232612	public	7D8333182062545CECD5308A222FF506	382F586C4BF74E474D8F9282E36BE4EC
SRR1761519	1/15/2016 15:51	1/15/2015 12:44	6643107	1341907614	6643107	202	811	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761519/SRR1761519.1	SRX844613	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820516	SAMN03285056	simple	3702	Arabidopsis thaliana	GSM1585900	no	GEO	SRA232612	public	163BD8073D7E128D8AD1B253A722DD08	DFBCC891EB5FA97490E32935E54C9E14
SRR1761520	1/15/2016 15:51	1/15/2015 12:44	8506472	1718307344	8506472	202	1040	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761520/SRR1761520.1	SRX844614	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820517	SAMN03285062	simple	3702	Arabidopsis thaliana	GSM1585901	no	GEO	SRA232612	public	791BD0D8840AA5F1D74E396668638DA1	AF4694425D34F84095F6CFD6F4A09936
SRR1761521	1/15/2016 15:51	1/15/2015 12:46	13166085	2659549170	13166085	202	1609	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761521/SRR1761521.1	SRX844615	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820518	SAMN03285057	simple	3702	Arabidopsis thaliana	GSM1585902	no	GEO	SRA232612	public	47C40480E9B7DB62B4BEE0F2193D16B3	1443C58A943C07D3275AB12DC31644A9
SRR1761522	1/15/2016 15:51	1/15/2015 12:49	9496483	1918289566	9496483	202	1162	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761522/SRR1761522.1	SRX844616	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820519	SAMN03285061	simple	3702	Arabidopsis thaliana	GSM1585903	no	GEO	SRA232612	public	BB05DF11E1F95427530D69DB5E0FA667	7706862FB2DF957E4041D2064A691CF6
SRR1761523	1/15/2016 15:51	1/15/2015 12:46	14999315	3029861630	14999315	202	1832	https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1761523/SRR1761523.1	SRX844617	RNA-Seq	cDNA	TRANSCRIPTOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP052302	PRJNA272719	3	272719	SRS820520	SAMN03285060	simple	3702	Arabidopsis thaliana	GSM1585904	no	GEO	SRA232612	public	101D3A151E632224C09A702BD2F59CF5	0AC99FAA6B8941F89FFCBB8B1910696E

Subset of data

Sample information	Run
WT_rep1	SRR1761506
WT_rep2	SRR1761507
WT_rep3	SRR1761508
ABA_rep1	SRR1761509
ABA_rep2	SRR1761510
ABA_rep3	SRR1761511

mkdir ATH && cd ATH

mkdir raw_data

mkdir trim

fastq-dump submission

#!/bin/bash
#SBATCH --job-name=fastqdump_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o fastq-dump.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1

fastq-dump SRR1761506 --split-3 --outdir ./raw_data  --gzip
fastq-dump SRR1761507 --split-3 --outdir ./raw_data  --gzip
fastq-dump SRR1761508 --split-3 --outdir ./raw_data  --gzip
fastq-dump SRR1761509 --split-3 --outdir ./raw_data  --gzip
fastq-dump SRR1761510 --split-3 --outdir ./raw_data  --gzip
fastq-dump SRR1761511 --split-3 --outdir ./raw_data  --gzip

Trim-galore

#!/bin/bash
#SBATCH --job-name=trim_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1
#SBATCH --dependency=afterok:<PREVIOUS_JOBID>

trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761506_1.fastq.gz raw_data/SRR1761506_2.fastq.gz --fastqc
trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761507_1.fastq.gz raw_data/SRR1761507_2.fastq.gz --fastqc
trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761508_1.fastq.gz raw_data/SRR1761508_2.fastq.gz --fastqc
trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761509_1.fastq.gz raw_data/SRR1761509_2.fastq.gz --fastqc
trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761510_1.fastq.gz raw_data/SRR1761510_2.fastq.gz --fastqc
trim_galore --paired   --three_prime_clip_R1 5 --three_prime_clip_R2 5 --cores 2  --max_n 40  --gzip -o trim raw_data/SRR1761511_1.fastq.gz raw_data/SRR1761511_2.fastq.gz --fastqc

Reference downloads

https://www.arabidopsis.org/

https://phytozome.jgi.doe.gov/pz/portal.html

cd /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH

mkdir bam

mkdir reference && cd reference

wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas --no-check-certificate

wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff  --no-check-certificate

wget ftp://ftp.arabidopsis.org/home/tair/Sequences/ATH_cDNA_EST_sequences_FASTA/ATH_cDNA_sequences_20101108.fas --no-check-certificate

seqkit stats TAIR10_chr_all.fas 

seqkit stats ATH_cDNA_sequences_20101108.fas

Reference index

#!/bin/bash
#SBATCH --job-name=reference_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1

gffread  TAIR10_GFF3_genes.gff -T -o TAIR10_GFF3_genes.gtf

STAR  --runThreadN <YOUR THREAD> --runMode genomeGenerate --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --genomeFastaFiles  /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/TAIR10_chr_all.fas --sjdbGTFfile  /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/TAIR10_GFF3_genes.gtf  --sjdbOverhang 99   --genomeSAindexNbases 12

ERROR

Fatal INPUT FILE error, no valid exon lines in the GTF file: /data/gpfs/assoc/bch709-1/wyim/RNA-Seq_example/ATH/reference/TAIR10_GFF3_genes.gtf Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

head  /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/TAIR10_GFF3_genes.gtf
head  /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/TAIR10_chr_all.fas
sed -i 's/Chr//g' TAIR10_GFF3_genes.gtf

Mapping

#!/bin/bash
#SBATCH --job-name=mapping_ATH
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o mapping.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1
#SBATCH --dependency=afterok:<PREVIOUS_JOBID(trim_ATH)>

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761506_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761506_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761506.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761507_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761507_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761507.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761508_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761508_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761508.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761509_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761509_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761509.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761510_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761510_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761510.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 10000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761511_1_val_1.fq.gz /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/trim/SRR1761511_2_val_2.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/ATH/bam/SRR1761511.bam

Replace YOURID to your id

sed -i ‘s//your id/g' mapping.sh

Conda Environment

conda activate BCH709

CONDA_INSTRUMENTATION_ENABLED=1 conda install -y -c bioconda -c conda-forge  sra-tools minimap2 trinity star multiqc=1.9 samtools=1.9 trim-galore gffread seqkit kraken2

Publication (Drosophila)

Ramond E et al., “Comparative RNA-Seq analyses of Drosophila plasmatocytes reveal gene specific signatures in response to clean injury and septic injury”, Plos one, 2019 June 29, 2020

SRA Bioproject site

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA638422

Run	ReleaseDate	LoadDate	spots	bases	avgLength	size_MB	download_path	Experiment	LibraryName	LibraryStrategy	LibrarySelection	LibrarySource	LibraryLayout	Platform	Model	SRAStudy	BioProject	ProjectID	Sample	BioSample	SampleType	TaxID	ScientificName	SampleName	Sex	Tumor	CenterName	Submission	Consent	RunHash	ReadHash
SRR11968960	6/9/2020 17:10	6/9/2020 17:09	12256307	1237887007	101	378	https://sra-download.ncbi.nlm.nih.gov/traces/sra23/SRR/011688/SRR11968960	SRX8512716	4w1118-ci	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811237	SAMN15192434	simple	7227	Drosophila melanogaster	w1118-ci-rep1	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	3D3A8EBF0A13F90F9305C5DD917E9AE2	A111523A7FB7106EE54D2D8337D2E8F2
SRR11968959	6/9/2020 17:09	6/9/2020 17:07	14144827	1428627527	101	432	https://sra-download.ncbi.nlm.nih.gov/traces/sra1/SRR/011688/SRR11968959	SRX8512717	5w1118-ci	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811238	SAMN15192435	simple	7227	Drosophila melanogaster	w1118-ci-rep2	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	5515CADB5697C29CDC396F942C24F387	6D312D3B5BF5001309FF93CB968E584B
SRR11968958	6/9/2020 17:11	6/9/2020 17:09	16118803	1627999103	101	495	https://sra-download.ncbi.nlm.nih.gov/traces/sra60/SRR/011688/SRR11968958	SRX8512718	6w1118-ci	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811239	SAMN15192436	simple	7227	Drosophila melanogaster	w1118-ci-rep3	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	FCC81714EB524E34632C58BDC1E4C162	9F494AF29716E7175EA1E4652B08F0B7
SRR11968957	6/9/2020 17:07	6/9/2020 17:05	6215784	627794184	101	188	https://sra-download.ncbi.nlm.nih.gov/traces/sra47/SRR/011688/SRR11968957	SRX8512719	7w1118-ec	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811240	SAMN15192437	simple	7227	Drosophila melanogaster	w1118-ec-rep1	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	9FA82BA9A828F9BDDE839810689EFA4F	CC558BFAAE5EE65BDD1CC1C690575F9D
SRR11968956	6/9/2020 19:58	6/9/2020 19:56	46628659	4709494559	101	1573	https://sra-download.ncbi.nlm.nih.gov/traces/sra59/SRR/011688/SRR11968956	SRX8512720	8w1118-ec	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811241	SAMN15192438	simple	7227	Drosophila melanogaster	w1118-ec-rep2	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	16E3AE29FAC6BDFDB2B60F5300A02302	0F7770A244784C635FEC2DC814A1040C
SRR11968955	6/9/2020 17:13	6/9/2020 17:11	16299093	1646208393	101	496	https://sra-download.ncbi.nlm.nih.gov/traces/sra62/SRR/011688/SRR11968955	SRX8512721	9w1118-ec	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811242	SAMN15192439	simple	7227	Drosophila melanogaster	w1118-ec-rep3	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	CFA33A602A41E07AC4EFBEED3D2A0FE3	4F5983B317885D4E8FFC4B3D312B7674
SRR11968964	6/9/2020 17:15	6/9/2020 17:12	22436848	2266121648	101	843	https://sra-download.ncbi.nlm.nih.gov/traces/sra49/SRR/011688/SRR11968964	SRX8512712	22w1118-l3	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811233	SAMN15192443	simple	7227	Drosophila melanogaster	w1118-l3-rep1	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	2D2BB637C1817EC80B369D1EF0B39615	2136B5CFE75B7833A2A6927CF26E701E
SRR11968963	6/9/2020 19:33	6/9/2020 17:14	19826612	2002487812	101	740	https://sra-download.ncbi.nlm.nih.gov/traces/sra45/SRR/011688/SRR11968963	SRX8512713	23w1118-l3	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811234	SAMN15192444	simple	7227	Drosophila melanogaster	w1118-l3-rep2	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	A832FE389916D06C16C6F21DB93AD77A	CE81D58F649BBBCECE551891816145AF
SRR11968962	6/9/2020 17:15	6/9/2020 17:12	20056763	2025733063	101	750	https://sra-download.ncbi.nlm.nih.gov/traces/sra11/SRR/011688/SRR11968962	SRX8512714	24w1118-l3	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811235	SAMN15192445	simple	7227	Drosophila melanogaster	w1118-l3-rep3	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	3F651F739352EAC0B28096237F2254EC	98BD1486600E785F9B5F8AC7DBCD4EA6
SRR11968954	6/9/2020 17:13	6/9/2020 17:10	16301608	1646462408	101	499	https://sra-download.ncbi.nlm.nih.gov/traces/sra20/SRR/011688/SRR11968954	SRX8512722	10w1118-sa	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811243	SAMN15192440	simple	7227	Drosophila melanogaster	w1118-sa-rep1	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	55D22FA303406FAE40145D8A1E62598B	3E5BEB8C6FF03B853BA64D10989542E1
SRR11968966	6/9/2020 17:10	6/9/2020 17:08	16076977	1623774677	101	485	https://sra-download.ncbi.nlm.nih.gov/traces/sra50/SRR/011688/SRR11968966	SRX8512710	11w1118-sa	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811231	SAMN15192441	simple	7227	Drosophila melanogaster	w1118-sa-rep2	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	9789FA28D07EBFD979E3DAE45E9D8CDF	54D9D5C5343EB9C9A7817434F1D4BB8B
SRR11968965	6/9/2020 17:10	6/9/2020 17:08	10379871	1048366971	101	316	https://sra-download.ncbi.nlm.nih.gov/traces/sra76/SRR/011688/SRR11968965	SRX8512711	12w1118-sa	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811232	SAMN15192442	simple	7227	Drosophila melanogaster	w1118-sa-rep3	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	908A23B3924A405F6BE6D5362130E7B3	8BD0419DF27093A94546D02492F6661C
SRR11968968	6/9/2020 17:11	6/9/2020 17:09	16112703	1627383003	101	494	https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/011688/SRR11968968	SRX8512708	1w1118-uc	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811229	SAMN15192431	simple	7227	Drosophila melanogaster	w1118-uc-rep1	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	39E07B4F04BC5A14AE664312E4DD5E67	27280649B15E4B766D86363C23679BE1
SRR11968967	6/9/2020 17:08	6/9/2020 17:06	9828233	992651533	101	302	https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRR/011688/SRR11968967	SRX8512709	2w1118-uc	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811230	SAMN15192432	simple	7227	Drosophila melanogaster	w1118-uc-rep2	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	C9868075DE213901336D4DD9D22A9B72	558FD3E03B55FA60A231537D5E8EE198
SRR11968961	6/9/2020 17:15	6/9/2020 17:11	16343251	1650668351	101	498	https://sra-download.ncbi.nlm.nih.gov/traces/sra70/SRR/011688/SRR11968961	SRX8512715	3w1118-uc	RNA-Seq	Oligo-dT	TRANSCRIPTOMIC	SINGLE	ILLUMINA	Illumina HiSeq 2500	SRP266662	PRJNA638422	638422	SRS6811236	SAMN15192433	simple	7227	Drosophila melanogaster	w1118-uc-rep3	unknown	no	SWISS FEDERAL INSTITUTE OF TECHNOLOGY LAUSANNE	SRA1085163	public	1725A91FA94755464378D8FF0F18A197	870D1C8B738F5A31C179B44124757B27

Fig 2. Transcriptome summaries from unchallenged whole larvae and hemocytes from unchallenged and infected larvae. (A) Transcriptome summary showing the number of reads for each triplicate in all experimental conditions with their corresponding number of mapped reads and the average percentage of alignment to the D. melanogaster genome. (B) Venn diagram representing the quantity of shared genes between all experimental treatments: Unchallenged wandering L3 larvae, hemocytes from unchallenged larvae, hemocytes from clean-pricked larvae (CI), hemocytes from larvae pricked with Escherichia coli (Ec), hemocytes from larvae pricked with Staphylococcus aureus (Sa).

Subset of data

Sample information	Run
22w1118-l3	SRR11968964
23w1118-l3	SRR11968963
24w1118-l3	SRR11968962
10w1118-sa	SRR11968954
11w1118-sa	SRR11968966
12w1118-sa	SRR11968965

cd /data/gpfs/assoc/bch709-1/wyim/RNA-Seq_example/
mkdir Drosophila && cd Drosophila

mkdir raw_data

mkdir trim

fastq-dump submission

#!/bin/bash
#SBATCH --job-name=fastqdump_Droso
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o fastq-dump.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1

fastq-dump  SRR11968964 --outdir ./raw_data  --gzip
fastq-dump  SRR11968963 --outdir ./raw_data  --gzip
fastq-dump  SRR11968962 --outdir ./raw_data  --gzip
fastq-dump  SRR11968954 --outdir ./raw_data  --gzip
fastq-dump  SRR11968966 --outdir ./raw_data  --gzip
fastq-dump  SRR11968965 --outdir ./raw_data  --gzip

Trim-galore

#!/bin/bash
#SBATCH --job-name=trim_Droso
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1
#SBATCH --dependency=afterok:<PREVIOUS_JOBID>

trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968964.fastq.gz --fastqc
trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968963.fastq.gz --fastqc
trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968962.fastq.gz --fastqc
trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968954.fastq.gz --fastqc
trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968966.fastq.gz --fastqc
trim_galore --cores 2  --max_n 40  --gzip -o trim raw_data/SRR11968965.fastq.gz --fastqc

Reference downloads

https://flybase.org/

cd /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila

mkdir bam

mkdir reference && cd reference

wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.36_FB2020_05/fasta/dmel-all-chromosome-r6.36.fasta.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.36_FB2020_05/gtf/dmel-all-r6.36.gtf.gz
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.36_FB2020_05/fasta/dmel-all-CDS-r6.36.fasta.gz

gunzip dmel-all-chromosome-r6.36.fasta.gz
gunzip dmel-all-r6.36.gtf.gz
gunzip dmel-all-CDS-r6.36.fasta.gz

seqkit stats dmel-all-chromosome-r6.36.fasta
seqkit stats dmel-all-CDS-r6.36.fasta  

Reference index

#!/bin/bash
#SBATCH --job-name=reference_Droso
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1

 STAR  --runThreadN 24 --runMode genomeGenerate --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --genomeFastaFiles  /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/dmel-all-chromosome-r6.36.fasta --sjdbGTFfile /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/dmel-all-r6.36.gtf  --sjdbOverhang 99 -genomeSAindexNbases 12

Mapping reads

#!/bin/bash
#SBATCH --job-name=mapping_Droso
#SBATCH --cpus-per-task=2
#SBATCH --time=2-15:00:00
#SBATCH --mem=16g
#SBATCH --mail-type=all
#SBATCH --mail-user=<youremail>
#SBATCH -o trim.out # STDOUT & STDERR
#SBATCH -p cpu-s2-core-0 
#SBATCH -A cpu-s2-bch709-1
#SBATCH --dependency=afterok:<trim_Droso>

STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968964_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968964.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968963_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968963.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968962_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968962.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968954_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968954.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968966_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968966.bam
STAR --runMode alignReads --runThreadN 8 --readFilesCommand zcat --outFilterMultimapNmax 10 --alignIntronMin 25 --alignIntronMax 100000 --genomeDir /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/reference/ --readFilesIn /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/trim/SRR11968965_trimmed.fq.gz --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/bam/SRR11968965.bam

Assignment

cd /data/gpfs/assoc/bch709-1/wyim/RNA-Seq_example
multiqc . -n rnaseq1

Please upload rnaseq1.html to Webcampus.

Investigate taxa

Here we introduce a software called Kraken2. This tool uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible). The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.

Donwload most recent database

https://benlangmead.github.io/aws-indexes/k2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_16gb_20200919.tar.gz

cd /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example

mkdir Kraken2 && cd Kraken2
kraken2-inspect --db EXAMPLE_DB | head -5

kraken2  --threads 24 --report SRR11968954 --db /data/gpfs/assoc/bch709-1/Course_material/database/ /data/gpfs/assoc/bch709-1/<YOURID>/RNA-Seq_example/Drosophila/raw_data/SRR11968954.fastq.gz

"C"/"U": a one letter code indicating that the sequence was either classified or unclassified.

https://fbreitwieser.shinyapps.io/pavian/

BCH709 Introduction to Bioinformatics: 15_RNA-Seq tutorial

Importing Data from the NCBI Sequence Read Archive (SRA) using the DE

Course Evaluation

Discussion is open

Delete previous work

WORKTING PATH

Conda environment

SRA

Publication (Arabidopsis)

SRA Bioproject site

Runinfo

Subset of data

fastq-dump submission

Trim-galore

Reference downloads

Reference index

ERROR

Mapping

Replace YOURID to your id

Conda Environment

Publication (Drosophila)

SRA Bioproject site

Subset of data

fastq-dump submission

Trim-galore

Reference downloads

Reference index

Mapping reads

Assignment

Investigate taxa

Donwload most recent database