Set enrichment analysis

Methods inspired in systems biology can use lists of genes ranked by any biological criteria (e.g. differential expression when comparing cases and healthy controls, genes with different evolutionary rates, etc.) and directly search for the distribution of blocks of functionally related genes across it without imposing any artificial threshold. Any macroscopic observation that causes this ranked list of genes will be the consequence of cooperative action of genes arranged into functional classes or pathways.

Each functional class responsible for the macroscopic observation will, consequently, be found in the extremes of the ranking with highest probability. The imposition of a threshold based on the rank values which does not take into account the cooperation among genes is thus avoided under this perspective. Systems biology inspired methods will directly search for groups of functionally related genes significantly cumulated in the extremes of these ranked lists of genes.

RENATO includes two different gene set implementations: FatiScan and Logistic model. The necessary input for RENATO's set enrichment analysis is a ranked gene list.

FatiScan

FatiScan implements a segmentation test which checks for asymmetrical distributions of regulatory elements (microRNA and Transcription factors) associated to genes ranked in a list.

FatiScan

Unique in this type of approaches, this test only needs the list of ordered genes and not the original data which generated the sorting. This means that can be applied to the study of the relationship of regulatory elements to any type of experiment whose outcome is a sorted list of genes. Genes sorted by differential expression between two experimental conditions can be studied, but also genes correlated to a clinical variable (such as the level of a metabolite) or even to survival. Moreover, other lists of genes ranked by any other experimental or theoretical criteria can be studied (e.g. genes arranged by physico-chemical properties, mutability, structural parameters, etc.) in order to understand whether there is any regulatory element which is related to the experimental parameter studied.

FatiScan will work as follows:

  1. Ranking: Firstly a list of genes is ordered using experimental information on their differential expression, according to the phenotype studied in the experiment, or to other type of value (e.g. large-scale genotyping, evolutionary analysis, etc.). For example, genes can be ordered on the basis of their differential expression among two experimental conditions (e.g. healthy versus diseased samples, etc.).
  2. Distribution of regulatory elements: The second step involves the study of the distribution of functional terms in different partitions of this list. Using a fisher exact test to compare such partitions, FatiScan extracts significantly under- and over-represented functional terms in a set of genes. In the figure, rows transcription factor 1 (TF1), TF2 and TF3 represent the position of the genes that are targets of this TF across the ranking. In this case, TF1 is completely uncorrelated with the arrangement while TF2 and 3 are clearly associated to high expression in the experimental conditions B and A, respectively.
  3. A table with the significant terms obtained upon the application of the test can be used to detect significant asymmetrical distributions of genes, responsible for diverse biological processes, across the list.
  4. Multiple testing correction: The P-values from the test of each regulatory element, are adjusted for multiple testing by controlling the false discovery rate (FDR) (Benjamini et al., 1995; Storey andTibshirani, 2003).

Logistic model

Logistic regression has been extensively used as an enrichment method in genomics. The general approximation tests the probability of a gene belonging to a specific gene set based on a experimental dataset. RENATO implements the logistic regression method based on the implementation of Sartor et al. in 2008. In this case, the question that we try to answer using this methodology is 'Does the probability of a gene being regulated by a specific regulatory element (TF or miRNA) increase as the signicance of diferential expression increases?'

Logistic model analysis works as follows:

  1. Ranking: Like FatiScan, the logistic model's first step consists in ordering the list of genes according to the experimental information on their differential expression. For example, genes can be ordered on the basis of their differential expression among two experimental conditions (e.g. healthy versus diseased samples, etc.).
  2. Create the model: For each regulatory element we model the log-odds of a gene being regulated by this regulatory element as a linear function of the experimental measurement,
    Logistic model
    where P(Gr) is the probability of being regulated by a specific regulatory element, α is the intercept and β is the slope. Then, when β>0, we conclude that the regulatory element is ‘enriched’.
  3. Multiple testing correction: The P-values from the test of each regulatory element, are adjusted for multiple testing by controlling the false discovery rate (FDR) (Benjamini et al., 1995; Storey andTibshirani, 2003).
gene_set_analysis.txt · Last modified: 2012/04/23 18:13 by mbleda
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0 do yourself a favour and use a real browser - get firefox!!