2.11. Loading RNA-seq data

2.11.1. Load RNA-seq information

Before inserting RNA-seq count tables, it is needed to input information about the experiments and samples from which the data was generated. In Machado we will focus on the GEO/NCBI database as a source for RNA-seq experiments information and data. From the GEO/NCBI database we are supposed to get identifiers for different series (e.g.: GSE85653) that describe the studies/projects as a whole. From the GEO series we can get identifiers for biosamples or, in CHADO lingo, biomaterials (e.g.: GSM2280286). From the GEO biosamples we get identifiers for RNA-seq experiments (or assays), usually from the SRA database (e.g.: SRR4033018). SRA identifiers have links for the raw data one can be interested to analyse.

In Machado, it is necessary to input a .csv file with information for all SRA datafile regarding RNA-seq assays that will be input.

This file must have the following fields in each line:

“Organism specific name (e.g.: ‘Oryza sativa’)”, “GEO series identifier (e.g: GSE85653)”, “GEO sample identifier (e.g: GSM2280286)”, “SRA identifier (e.g: SRR4033018)”, “Assay description (e.g. Heat stress leaves rep1)”, “Treatment description (e.g: ‘Heat stress’)”, “Biomaterial description (e.g.: ‘Leaf’)”, “Date (in format ‘%b-%d-%Y’: e.g.: Oct-16-2016)”.

A sample line for such a file can be seen below:

Oryza sativa,GSE85653,GSM2280286,SRR4033018,Heat leaf rep1,Heat stress,Leaf,May-30-2018

To load such a file an example command can be seen below. The databases for the project, biomaterial and assay are required.:

python manage.py load_rnaseq_info --file file.csv --biomaterialdb GEO --assaydb SRA
  • Loading this file can be faster if you increase the number of threads (–cpu).
python manage.py load_rnaseq_info --help
–file .csv file *
–biomaterialdb Biomaterial database info (e.g.: ‘GEO’)
–assaydb Assay database info (e.g.: ‘SRA’)
–cpu Number of threads

* Any text editor can be used to make such a file.

2.11.2. Remove RNA-seq information

If, by any reason, you need to remove RNA-seq information relationships, you should use the command remove_file –name. Every relations from filename (e.g. ‘file.csv’) will be deleted on cascade.

python manage.py remove_file --help
  • This command requires the file name ‘file.csv.txt’ used before as input to load RNA-seq information.

2.11.3. Load RNA-seq data

To load expression count tables for RNA-seq data, a tabular file should be loaded, that can contain data from several RNA-seq experiments, or assays, per column. This file should have the following header:

“Gene identifier” “SRA Identifier 1” “SRA Identifier 2” … “SRA Identifier n”

Example of a header for such a sample file, that contains two assays/experiments:

gene    SRR5167848.htseq        SRR2302912.htseq

The body of the table is composed of the gene identifier followed by the counts for each gene, in each experiment.

Example of a line of sucha a sample file:

AT2G44195.1.TAIR10     0.0     0.6936967934559419

Note that the count fields can have floats or integers, depending on the normalization used (usually TPM, FPKM or raw counts).

The gene identifier is supposed to already be loaded as a feature, usually from the organism’s genome annotation .gff file.

We used the output of the LSTrAP program as standard format for this file.

python manage.py load_rnaseq_data --file file.tab --organism 'Oryza sativa' --programversion 1.3 --assaydb SRA
  • As default the program name is ‘LSTrAP’ but can be changed with –program
  • The data is by default taken as normalized (TPM, FPKM, etc.) but can be changed with –norm
  • Loading this file can be faster if you increase the number of threads (–cpu).
python manage.py load_rnaseq_data --help
–file tabular text file with gene counts per line.
–organism Scientific name (e.g.: ‘Oryza sativa’)
–programversion Version of the software (e.g.: ‘1.3’) (string)
–name Optional name (string)
–description Optional description (string)
–algorithm Optional algorithm description (string)
–assaydb Optional assay database info (e.g.: ‘SRA’) (string)
–timeexecuted
Optional Date software was run. Mandatory format: e.g.:
‘Oct-16-2016’ (string)
–program Optional Name of the software (default: ‘LSTrAP’) (string)
–norm
Optional Normalized data: 1-yes (tpm, fpkm, etc.); 0-no (raw
counts); default is 1) (integer)

2.11.4. Remove RNA-seq data

If, by any reason, you need to remove RNA-seq data relationships, you should use the command remove_file –name. Every relations from filename (e.g. ‘file.tab’) will be deleted on cascade.

python manage.py remove_file --help
  • This command requires the file name ‘file.tab’ used before as input to load RNA-seq information.