Functional characterization of transcriptomic differences between lung cancer and kidney cancer

  • There are several methods of functional enrichment that can be used from various web tools and programming languages.
  • We will start by using the web tool PANTHER where we will perform several approaches to find out the functions in which our list of genes of interest are participating.
  • Depending on the omics scenario and what our objective or research question is, there will be a more direct analysis strategy, however, the application and interpretation of these various methods provides complementary information between the results obtained from all these methods. So that in many occasions, it is not necessary to stay only with one of them, but the functional “pictures” offered by all of them, help us to configure where and how the biological or clinical signal we are studying is produced.

A. Goal

To functionally characterize by means of different enrichment strategies (overrepresentation and GSA methods), the results obtained in the differential expression analysis of an experiment where we used RNA-Seq in human to know the transcriptomic differences between lung cancer and kidney cancer.


B. Data

After obtaining RNA sample sequences from 10 patients who were part of the study (5 with kidney (k) 5 with lung (l) cancer), a primary analysis was performed including sequence quality assessment, mapping and quantification of expression at the gene level. The data were then normalized with the TMM method and a differential expression analysis was generated with edgeR for the 29,405 genes quantified. The results determined were as follows:


C. Work plan

Open the data file “top list” with a notepad or similar and check its contents. Also check the other two files.

  • Step 1. From the home of the tool go to the “Gene List Analysis” tab. Upload this txt file “top list” or copy the ids of the genes in the window indicated for this purpose.
  • Step 2. Select the organism. In this case “human”.
  • Step 3. Choose the type of analysis. Then we review all the available options:

C.1. Functional description

  • We will start with a description of the functions annotated (GO terms, signaling pathways) to these genes and for this we will use these two options: “Functional classification viewed in gene list” and “Functional classification viewed in graphics charts” which will provide us respectively the list and the graphical summary of functions associated to this group of genes.
  • We need:
    1. After obtaining the functional classification list of these genes, we will save it in a file (“Send list to file”).
    2. From the option “Functional classification viewed in graphic charts” get a bar chart for each Gene Ontology ontology. The same information, please represent it from a “pie chart”.

C.2. Statistical analysis: Statistical overrepresentation test

  • This method provides us with the functions that are overrepresented in our gene list versus the rest of the reference genome.
  • The functional results will characterize the genes included in our list of interest.
  • We would like to:
    1. Perform an analysis using the PANTHER GO-Slim Biological Processes. The results we obtain will be visualized in a “multiple pie chart”.
      • Interpret each of the indicators that appear in the output.
      • Are there functional differences between the reference genome and our gene cluster?
    2. Repeat the analysis but this time we will use all the Biological Process. Could you comment these results?

Then reproduce the previous two points with the “Bottom list” file.

C.3. Statistical analysis: Statistical enrichment test

  • This functional enrichment method incorporates information of interest (clinical, biological or statistical) that ranks the genes in the list.
  • The functional results will characterize all the genes in our experiment, including additional information that weights the genes in this list, in this case according to their differential expression level (the contrast statistic). So input we will need will be: “list of all ranked genes”.
  • We would like to:
    1. Perform an analysis using the PANTHER GO-Slim Biological Process.
      • Interpret each of the indicators that appear in the output.
      • What does it mean that there are functions with a positive enrichment indicator and that other functions have a negative enrichment?
    2. Repeat the analysis but this time we will use all the Biological Process. Comment on the results.
    3. Some questions on the comparison of methods: GSA vs. ORA
      1. What is the difference in input between the two methods?
      2. Do you think you will find more results with GSA than with ORA methods?
      3. Will the Gene Sets that participate in the same function and that we detect as significant include only genes that are differentially significantly expressed or may there be genes that are not significant but have a common expression pattern in that Gene Set?

D. Working from R

ClusterProfiler implements methods for analyzing and visualizing functional profiles (GO, KEGG, DisGeNET, Reactome…) of genes and gene clusters. Please reproduce the approaches performed with PANTHER, but this time using directly R and the ClusterProfiler package (the script included in the exercise on “Functional evaluation of the effect of hormone T3 in RNA-Seq study” may be useful).

What do we ask for in this task?

  • A pdf report generated from R Markdow where we have the opportunity to see the code you have used, as well as the results obtained.
  • Also in this report, the results obtained should be commented, incorporating the answers to the following questions.

Questions

A. Over-Representation Analysis (ORA). We want to characterize on the one hand the list of top genes (more expressed in K than in L) and on the other hand the list of bottom genes (more expressed in L than in K):

  1. How many genes are included in the input of each analysis?
  2. Within a gene list, are there repeated genes? If there were repeats, would it have any impact on the final functional result?
  3. What nomenclature or “Type” do you use? (we refer to which database the gene ids belong to).
  4. In each overrepresentation analysis, which two gene lists are we comparing?
  5. Who is the universe of genes we are working with: all the genes of the experiment, all the genes that present some annotated function….?
  6. Regarding the functional annotation and considering the arguments of the enrichGO function:
    • Could you select the groups of genes annotated to a function, with a given size *Which function argument would you use?
    • What would it make sense for us to eliminate gene sets with few genes or those with many annotated genes?
    • Any function arguments to work with only one ontology or all of them simultaneously?
  7. Generate a functional overrepresentation analysis for each of the 3 GO ontologies: CC, MF, BP.
  8. Is there any ontology that presents a clearly higher number of results than another one? If there are differences, why do you think this is?
  9. What is the meaning of each of the indicators that appear in the results? “ID”, “Description”, “GeneRatio”, “BgRatio”, “pvalue”, “p.adjust”, “qvalue”, “geneID”, “Count”?
  10. How do you functionally interpret the obtained results?
  11. Interpret the obtained graphs.
  12. Could we perform such an enrichment analysis if we have NO significant genes?


B. Gene Set Analysis (GSE).

  1. How many genes are included in the input? Why is it different from the overrepresentation analysis?
  2. Generate a GSEA for each of the 3 ontologies: CC, MF, BP.
  3. In the results, how many significant features were obtained? Do they seem few or many? Is it due to the type of study? Is this amount of information obtained manageable?
  4. What does each of the statistical indicators that appear in the results mean?
  5. Comparing the results of the GSA with the overrepresentation analyses, what could you indicate?
  6. Could we perform such an enrichment analysis if we have NO significant genes?
  7. Interpret the resulting GSE plots, commenting on these points:
    • What do these graphs represent?
    • What do we show on the X and Y axes of each of the two graphs?
    • Do these groups of genes corresponding to the GO:0070161 function have a higher level of expression in kidney tumors or in lung tumors?