RNA-Seq data analysis: unsupervised classification or clustering

Goal

Detect homogenous groups of subjects according to their transcriptomic profile.

Data

We are studying a complex disease in which we know that a certain hormone has an important role. For them, we designed an experiment with RNA-Seq in mice with two groups: 6 wild type mice (WT) and 6 mice treated with T3 hormone.

These data were obtained after applying a primary analysis that included the evaluation of the quality of the sequences, mapping and quantification of expression at the gene level. We have expression levels (non-normalized counts) for the 12 mice described in 38,293 genes.

Work plan

  1. Open the data file of rnaseq_12samples.txt with a spreadsheet and inspect its contents. There will be as many columns as subjects and as many rows as genes.
  2. Upload this txt file in Babelomics from the “Upload” menu. We will have to indicate the type of data that we upload: “Data matrix expression”. This link describes the different types of data that we can use in Babelomics: https://github.com/babelomics/babelomics/wiki/Data-types.
  3. After loading the data, the first step will be normalization. From “Processing / Normalization NGS: RNA-Seq” we will select our file and choose a standardization method (we will start with TMM). Interesting clue: when the normalization finishes, check out the results and in the “Job information” section, look up the identifier of the “Output folder”. Then we will need it to indicate to Babelomics where are the normalized data.
  4. Once the data is already normalized, we are ready to perform the clustering. From “Expression / Unsupervised analysis”, select the data (now it's time to select the previous “output folder” where the normalized data are ready).
  5. Next, we select the clustering by samples. We chose a method of clustering and distance (to begin with, those that are by default). We assign a name to the job and run it.
  6. Perform a clustering for genes (to begin with, those that are by default). We assign a name to the job and execute it.

Questions

  1. Are there groups of samples with a similar transcriptomic profile? How many groups appear?
  2. Is there any sample that has an anomalous behavior when comparing with other subjects?
  3. Do you think that if we performed a differential expression analysis we would obtain a large number of differentially expressed genes?
  4. Any incidence with clustering by genes?