Set enrichment analysis

Under a systems biology perspective, the simple functional enrichment analysis to understand the molecular basis of a genome-scale experiment is far away from being efficient.

Methods inspired in systems biology can use lists of genes ranked by any biological criteria (e.g. differential expression when comparing cases and healthy controls, genes with different evolutionary rates, etc.) and directly search for the distribution of blocks of functionally related genes across it without imposing any artificial threshold. Any macroscopic observation that causes this ranked list of genes will be the consequence of cooperative action of genes arranged into functional classes, pathways, etc.

Each functional class responsible for the macroscopic observation will, consequently, be found in the extremes of the ranking with highest probability. The imposition of a threshold based on the rank values which does not take into account the cooperation among genes is thus avoided under this perspective. Systems biology inspired methods will directly search for groups of functionally related genes significantly cumulated in the extremes of these ranked lists of genes.


FatiScan implements a segmentation test which checks for asymmetrical distributions of biological labels (GO, KEGG pathways, Interpro motifs, Swissprot keywords, microRNA, Transcription factor and cisRED cis-regulatory elements) associated to genes ranked in a list.

Unique in this type of approaches, this test only needs the list of ordered genes and not the original data which generated the sorting. This means that can be applied to the study of the relationship of biological labels to any type of experiment whose outcome is a sorted list of genes. Genes sorted by differential expression between two experimental conditions can be studied, but also genes correlated to a clinical variable (such as the level of a metabolite) or even to survival. Moreover, other lists of genes ranked by any other experimental or theoretical criteria can be studied (e.g. genes arranged by physico-chemical properties, mutability, structural parameters, etc.) in order to understand whether there is some biological feature (among the labels used) which is related to the experimental parameter studied.

We propose the use of such procedure to scan ordered lists of genes and understand the biological processes operating behind them. This procedure can be useful in situations in which it is not possible to obtain statistically significant differences based on the experimental measurements (low prevalence diseases, etc.).

FatiScan will work as follows:

  • Firstly a list of genes is ordered using experimental information on their differential expression, according to the phenotype studied in the experiment, or to other type of value (e.g. large-scale genotyping, evolutionary analysis, etc.). For example, genes can be ordered on the basis of their differential expression among two experimental conditions (e.g. pre and post drug administration, healthy versus diseased samples, etc.).
  • The second step involves the study of the distribution of functional terms in different partitions of this list. Using a fisher exact test to compare such partitions, FatiScan extracts significantly under- and over-represented functional terms in a set of genes.
  • Finally, a table with the significant terms obtained upon the application of the test can be used to detect significant asymmetrical distributions of genes, responsible for diverse biological processes, across the list.


MarmiteScan comes out as the application of a threshold free method (FatiScan) that extracts blocks of related genes from an ordered list of genes by an associated value to the Marmite tool, a tool that finds differential distributions of bioentities extracted from PubMed between two groups of genes.

Data is provided by BioAlma who generated the associations using almaKnowledgeServer.


We have human genes associated to a set of bioentities by a score indicating the importance of that co-occurrence in the literature. MarmiteScan gets a list of genes ordered by an associated value and applies a threshold free method (FatiScan) that produce a set of serial partitions dividing the list in two groups. MarmiteScan tests whether the distributions of the scores, that is, the importance of the association to a bioentity, differ in any of the groups.

MarmiteScan tells you whether there is an enrichment of any bioentity in your list of sorted genes.

Bioentites and gene co-occurences

Starting with a set of documents (e.g. the documents where a certain gene appears or a disease) we can define keywords as those words that are significantly overrepresented compared to a standard set or background. These words that appear with much higher frequencies than one would expect from chance alone can be considered as the content words that capture the main features in this set of documents. In addition to single words bi-grams (two adjacent words) were taken into account because in many cases these terms contain more information than single words (e.g. “cell cycle” vs. “cell”, “cycle”). We refer to words and bi-grams as terms in the following. All words were stemmed before further treatment to increase statistical significance of words. For each term i the number of documents where i appears in the whole collection of documents (xi in N, our background) and in a specific document set a (xia in na) is calculated. Then, based on the hypergeometric distribution, the likelihood to find Xia documents in a set of the size n is computed for each term. The more unlikely this event is the more specific is the term i for the document set.


   Na ... number of documents of the set a 
   Ndoc ... number of documents in the entire collection 
   Xi ... number of documents where term i appears in Ndoc 
   Xia ... number of documents where term i appears in Na Formula for calculating keyword relevance:
         Mean value for term i in collection Na :
         Mia = Na * (Xi /Ndoc)
         The standard deviation of the distribution :
         σia = sqrt(Mia * (1 - Xi/Ndoc) * (1 - Na/Ndoc))
         The Z-score for each term i in a; the higher the score the more relevant is a term for the document set :
         Zia = (Xia - Mia)/σia


MarmiteScan applies a serial partitioning process to the gene list according to the values they have associated. The size of the windows depends on the values associated to the genes.

For each partition MarmiteScan evaluates the differences between the gene-bioentity co-ocurrences values (scores) for the two groups of genes (top genes and bottom genes). We apply a Kolmogorov-Smirnov Test to each pair of distributions formed by the scores of the coocurrences between a bioentity and the genes within the list. No null values are included into the distributions to evaluate, that is, only genes with a score indicating co-occurrence with the bioentity are included.

MarmiteScan only evaluates bioentities associated to a minimum number of genes within both groups (minimum and default is 5 although it can be set by user).

We apply the test firstly in one side, testing whether top genes distribution is greater than bottom genes distribution. If the test p-value is greater than 0.5, then we apply the other hypothesis, bottom genes distribution is greater than top genes distribution. Finaly we show more probable hypothesis, that is, the one with smaller p-value.

MarmiteScan have into acount multiple test problems and adjusts p-values using FDR.

Options to select

  • Type of entity - Users can evaluate their genes using three categories of bioentities (disease associated words, chemical products, word roots).
  • Filtering entities to test - Select minimum number of genes with a score for an entity. Entities with less than this number in both lists will be excluded from the analysis. Deafault and minimum is 5
  • Number of partitions - Select number of partitions to make. Partitions are made based in the values associated to the genes. You may choose values between 20 and 100.
  • Threshold P value - Threshold value to clasify a bioentity as significative. Choose between 0 and 0.2 (Default: 0.05).
  • Number of entities to present in results - Select number of bioentities presented in result page. Entities with signicative p-values will be always shown anyway, so never this restriction produces a lack of relevant information. Setting as 0 means only significative bioentities are showed.
  • Submit gene lists - Please click this checkbox if your lists are made of only gene names [HGNC ids, HUGO ids, common names]. The annotations are done using HUGO ids, so what MarmiteScan does is to convert any gene id to HUGO id through an ensembl id, if you provides gene names the conversion process will be omitted. See that if you provide gene names and don't click the box some genes may be excluded from the analysis because they match with two ensembl ids or the ensembl id match with two HUGO names.
  • Do you want us to sort genes/values for you? Indicate direction - Indicate whether your gene list is ordered or do you want us to order it for you. This option may also be used to change the hypothesis to test.


We only provides data for human.

Files format

To submit your lists of genes make sure you provide a column of gene identifiers followed by a column of values (separated by tab) and a new line at the end of each gene/value pair. Something like:

  ENSG00000195449      2.05
  ENSG00000191414      2.02
  ENSG00000195603      1.95
  ENSG00000191766      1.83
  ENSG00000192778      1.56
  ENSG00000192318      1.23
  ENSG00000195909      1.22
  ENSG00000195044      1.10
  ENSG00000191421      0.85
  ENSG00000190549      0.84
  ENSG00000194579      0.79
  ENSG00000193697      0.53
  ENSG00000192817      0.41
  ENSG00000189656      0.12
  ENSG00000189674      0.01
  ENSG00000190567      -0.03
  ENSG00000195016      -0.12

Application example

This is a simple example on how MarmiteScan could be applied to the functional annotation of experiments.

We downloaded a microarray experiment from GEO: GDS715. It describes a set of Acute Myeloid Leukemia (AML) samples treated with a panel of compounds inducing, with different success, their differentiation to mature cells. The gene expression data of each AML sample treated with a compound was compared to the expression data of the negative controls, AML cells and AML cells treated with compounds that do not alter gene expression. For the comparison between both conditions, we applied a Student t-test to every pair of classes: AML+compound and control.

The output of T-Rex is a set of lists of genes, sorted by the t statistic or, in other words, by their importance in the difference between the compound action versus AML status. Then we wanted to give biomedical annotation to these lists using MarmiteScan.

Example of MarmiteScan input list files and output results:

AML + sulmazole:
Input file
Chemical products
Disease associated words

AML + fluorouridine:
Input file
Chemical products
Disease associated words

AML + phenanthroline:
Input file
Chemical products
Disease associated words

gene_set_analysis.txt · Last modified: 2010/05/31 00:10 by dmontaner
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0 do yourself a favour and use a real browser - get firefox!!