Table of Contents


Functional characterization of transcriptomic differences between lung cancer and kidney cancer


A. Goal

To functionally characterize by means of different enrichment strategies (overrepresentation and GSA methods), the results obtained in the differential expression analysis of an experiment where we used RNA-Seq in human to know the transcriptomic differences between lung cancer and kidney cancer.


B. Data

After obtaining RNA sample sequences from 10 patients who were part of the study (5 with kidney (k) 5 with lung (l) cancer), a primary analysis was performed including sequence quality assessment, mapping and quantification of expression at the gene level. The data were then normalized with the TMM method and a differential expression analysis was generated with edgeR for the 29,405 genes quantified. The results determined were as follows:


C. Work plan

Open the data file “top list” with a notepad or similar and check its contents. Also check the other two files.

C.1. Functional description

C.2. Statistical analysis: Statistical overrepresentation test

Then reproduce the previous two points with the “Bottom list” file.

C.3. Statistical analysis: Statistical enrichment test


D. Working from R

ClusterProfiler implements methods for analyzing and visualizing functional profiles (GO, KEGG, DisGeNET, Reactome…) of genes and gene clusters. Please reproduce the approaches performed with PANTHER, but this time using directly R and the ClusterProfiler package (the script included in the exercise on “Functional evaluation of the effect of hormone T3 in RNA-Seq study” may be useful).

What do we ask for in this task?


Questions

A. Over-Representation Analysis (ORA). We want to characterize on the one hand the list of top genes (more expressed in K than in L) and on the other hand the list of bottom genes (more expressed in L than in K):

  1. How many genes are included in the input of each analysis?
  2. Within a gene list, are there repeated genes? If there were repeats, would it have any impact on the final functional result?
  3. What nomenclature or “Type” do you use? (we refer to which database the gene ids belong to).
  4. In each overrepresentation analysis, which two gene lists are we comparing?
  5. Who is the universe of genes we are working with: all the genes of the experiment, all the genes that present some annotated function….?
  6. Regarding the functional annotation and considering the arguments of the enrichGO function:
    • Could you select the groups of genes annotated to a function, with a given size *Which function argument would you use?
    • What would it make sense for us to eliminate gene sets with few genes or those with many annotated genes?
    • Any function arguments to work with only one ontology or all of them simultaneously?
  7. Generate a functional overrepresentation analysis for each of the 3 GO ontologies: CC, MF, BP.
  8. Is there any ontology that presents a clearly higher number of results than another one? If there are differences, why do you think this is?
  9. What is the meaning of each of the indicators that appear in the results? “ID”, “Description”, “GeneRatio”, “BgRatio”, “pvalue”, “p.adjust”, “qvalue”, “geneID”, “Count”?
  10. How do you functionally interpret the obtained results?
  11. Interpret the obtained graphs.
  12. Could we perform such an enrichment analysis if we have NO significant genes?


B. Gene Set Analysis (GSE).

  1. How many genes are included in the input? Why is it different from the overrepresentation analysis?
  2. Generate a GSEA for each of the 3 ontologies: CC, MF, BP.
  3. In the results, how many significant features were obtained? Do they seem few or many? Is it due to the type of study? Is this amount of information obtained manageable?
  4. What does each of the statistical indicators that appear in the results mean?
  5. Comparing the results of the GSA with the overrepresentation analyses, what could you indicate?
  6. Could we perform such an enrichment analysis if we have NO significant genes?
  7. Interpret the resulting GSE plots, commenting on these points:
    • What do these graphs represent?
    • What do we show on the X and Y axes of each of the two graphs?
    • Do these groups of genes corresponding to the GO:0070161 function have a higher level of expression in kidney tumors or in lung tumors?