Simple enrichment analysis

The final aim of a typical genomic experiment is to find a molecular explanation for a given macroscopic observation. Knowing for instance which pathways are affected by the deprivation of glucose in a cell, what biological processes differentiate a healthy control from a diseased case, etc… This functional interpretation of the data is usually performed in two steps:

  1. Genes of interest are selected, because they co-express in a cluster or they are significantly over- or under-expressed when two classes of experiments are compared, etc…
  2. The enrichment of any type of biologically relevant annotation in these genes is compared to the corresponding distribution of the annotation in the background; typically the remaining genes.

There are different available tools, such as FatiGO (Al-Shahrour, et al., 2004) and others (Zeeberg, et al., 2003; Khatri and Draghici, 2005), that use different functionally relevant annotations, such as GO terms (Ashburner, et al., 2000), KEGG pathways (Kanehisa, et al., 2004), etc…

Simple enrichment approaches are known to be less sensitive than set enrichment analyses. Whenever is possible the use of set enrichment analysis is preferred over the simple enrichment analysis counterpart.

FatiGO

FatiGO takes two lists of genes (ideally a group of interest and the rest of the genes in the experiment, although any two groups, formed in any way, can be tested against each other) and convert them into two lists of GO annotations using the corresponding gene or protein - term annotation table. Then a Fisher's exact test for 2×2 contingency tables is used to check for significant over-representation of GO annotations in one of the sets with respect to the other one. Multiple test correction to account for the multiple hypothesis tested (one for each functional term) is applied.

In addition to Gene Ontology (Ashburner et al., 2000) annotations, FatiGO can test simultaneously for other functional and regulatory annotations including: KEGG pathways (Kanehisa et al., 2004), InterPro motifs (Mulder et al., 2003), microRNA (Griffiths-Jones et al., 2006), TFBSs (Wingender et al., 2000), cisRED motifs (Robertson et al., 2006), BioCarta pathways, etc.. The distribution of any combination (or all) of the annotations between two groups of genes can be simultaneously tested by means of a Fisher exact test. All the p-values are adjusted by FDR (B&H).

FatiGO and the inclusive analysis

The structure of the functional labels has an important impact in the strategy for performing the test. For example, KEGG pathways have a “flat” organization with a correspondence of one or more pathways per gene. On the other hand, terms in GO have a hierarchical structure called DAG (standing for directed acyclic graph, where each term can have one or more child terms as well as one or more parent terms). Terms at higher levels of the hierarchy (closer to the root) describe more general functions or processes while terms at lower levels are more specific. The level at which a gene is annotated in the GO hierarchy depends on the detail the annotator had on its biological behaviour. Testing terms organised in such way posses an additional difficulty because in same cases they are not exclusive but only constitute descriptions of the same behaviour at different levels of detail (e.g. where is the point in testing apoptosis versus regulation of apoptosis?). Genes annotated with terms that are descendant of the term corresponding to the level chosen therefore take the annotation from the parent. If the level corresponding to, for example, apoptosis was selected, any gene annotated as either apoptosis or as any children term was considered in the same category (apoptosis) for the test. This increases the power of the test. There are less terms, each with more genes, to be tested (Al-Shahrour et al., 2004, 2005).

FatiGO data and format

FatiGO supports many gene identifiers for each organism (HGNC symbol, UniProt/Swiss-Prot, UniProtKB/TrEMBL, Ensembl IDs, RefSeq, EntrezGene, Affymetrix, Agilent, PDB, Protein Id, IPI…), can be checked in the ID converter. These identifiers must be annotated in Ensembl and any gene not annotated in Ensembl will be lost in the analysis. (Please see the Ensembl documentation).

The input data format is a list with a gene or protein identifier per line. See an example of Saccharomyces cerevisiae identifiers list:

YAL011W
GAL83
YDR116C
YGL104C
KNS1
ECM2
YHL018W
CDC45
YHL010C
YHR199C
SNO2
YJR141W
YOR059C

A help with all parameters and output results explanation is available in the FatiGO tool parameters page.

FatiGO Worked examples

How the functional profiling should never be done
It is not uncommon to find the following assertion in papers and talks: “then we examined our set of genes selected in this way (whatever) and we discover that 65% of them were related to metabolism, so we can conclude that our experiment activates metabolism genes”. This could be true or not depending on the relative abundance of this term. If you look to the rest of genes not activated in the experiment and the proportion of them related to metabolism is, let's say 10%, then you are right. Contrarily, if the proportion is, let's say 61%, then the experiment has probably nothing to do with metabolism. The statistical comparison is compulsory to support such assertions.

Comparing two lists of genes
There are many situations in which the comparison of two lists of genes answers a relevant biological question. Actually a large number of problems can be addressed in this way. For example, one might be interested in knowing whether a group of genes that co-express are functionally related. Typically this implies the comparison of a set of genes that clustered together (by any clustering method) to the rest of genes. Other commonly addressed question is if genes differentially expressed when comparing two experimental conditions are functionally related. And many other similar questions are commonly asked when analysing microarray data or, in general, genomic data. The program FatiGO has specifically been designed to answer these kind questions.

1 Exploring differences in GO terms with FatiGO, basics

The simplest use of the tool is to have a quick look at the functional processes where a set of genes take part of. The list of genes submitted is going to be analysed against the rest of the genome to obtain significance of the GO terms or other sets abundance.

  1. Here you can find the corresponding file, for this worked example, containing a list of genes of Saccharomyces cerevisiae. Save this file to your desktop or local directory and upload it to Babelomics as an idlist type of data.
  2. Create a new project (e.g. workedExample1) and start a new FatiGO job in the Functional analysis » Single enrichment analysis section of the tools.
  3. Choose the Id List vs. Rest of genome option.
  4. Choose as List1 the data you already upload.
  5. In the Options section choose Over-represented terms in List1 as we want to compare the functions of our list of genes against de rest of the genome. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  6. Choose the organism database Saccharomyces cerevisiae
  7. Database section check the GO - biological process, then click the options link and change the range of levels from 3 to 6.
  8. Give a name to the new job (e.g. example1FatiGO)
  9. Maintain the rest of the parameters as default
  10. Submit the job (press the run button)

The number of significant functional terms is resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.

You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted along with a graphical distribution of their frequencies. As you can see the red bars are coloured with darker colour than the blue ones, that means that the terms found are only enriched in the List1, the one we submitted, as we have chosen the Over-represented terms of list 1 option. In this example the significant terms are quite general as they belong to levels 3 to 6, but you can see also a graphical representation of the Gene Ontology terms coloured by their adjusted pvalue.

Submit other jobs playing around with other parameters of the Gene Ontology database (ontology, maximum and minimum level and the direct annotation -using the parents of the terms where the genes are directly annotated-), other databases, pvalue.

2 Exploring differences in other functional information with FatiGO, basics

Identically to the previous worked example, FatiGO can be used to check more functional information as pathways, motifs, transcription factors…

  1. Here you can find the corresponding file, for the second worked example, containing a list of genes of Homo sapiens. Save this file to your desktop or local directory and upload it as an gene - idlist data type.
  2. Create a new project (e.g. workedExample2) and start a new FatiGO job in the Functional analysis » Single enrichment analysis section of the tools.
  3. Choose the Id List vs. Rest of genome option.
  4. Choose as List1 the data you already upload.
  5. In the Options section choose Over-represented terms in List1 as we want to compare the functions of our list of genes against de rest of the genome. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  6. Choose the organism database Homo sapiens
  7. Database section check the GO - biological process, KEGG Pathways and Biocarta check box, then click the options link of GO - biological process and change the range of levels from 9 to 10 to perform a deeper analysis on the ontology.
  8. Give a name to the new job (e.g. example2FatiGO)
  9. Maintain the rest of the parameters as default
  10. Submit the job (press the run button)

The number of significant functional terms for each database are resumed in a table. If you take a look to the significant results you can sort them by the adjusted pvalue.

You will get a resume table with the number of significant GO terms associated to the genes and then a table for each database with information about the test in each of the significant functional terms. The table can be sorted by the different percentage between the genes annotated in this GO term in each list or by the p-value or p-value adjusted along with a graphical distribution of their percentages. As you can see the red bars are coloured with darker colour than the blue ones, that means that the terms found are only enriched in the List1, the one we submitted, as we have chosen the Over-represented terms of List 1 option.

The GO terms are related to apoptosis and the significant KEGG pathways are related as well. The significant BioCarta pathways are also related to the apoptotic process (nothing surprising if given that the list was selected to contain genes related to apoptosis).

Afterwards, launch more jobs choosing other or more databases at a time and change the options parameters.

3 Exploring differences in GO terms with FatiGO

Let us exemplify the application of FatiGO with a classical example. We use the data from Chu et al. (1998), The Transcriptional Program of Sporulation in Budding Yeast, Science, 282, 699-705 and cluster the genes according to their expression patterns. We choose a cluster of co-expressing genes and check the hypothesis of ”genes of similar function will tend to co-express”.

  1. The files for the third worked example correspond to a cluster of co-expressing genes and the rest of genes in the experiment of Saccharomyces cerevisiae. Save both files to your desktop or local directory and upload it as an gene - idlist data type.
  2. Create a new project (e.g. sporulation) and start a new FatiGO job in the Functional analysis » Single enrichment analysis section of the tools.
  3. Choose the Id List vs. Id List option.
  4. Choose as List1 the data sporulation_clus42 and as List2 sporulation_all_but_clus42 you already upload.
  5. In the Options section choose Over-represented terms in List1 as we want to compare the functions of our cluster of co-expressed genes against de rest of the clusters. As we already know our list of genes we are sure that contains no duplicates but is better to make it sure or apply a duplicate management option.
  6. Choose the organism database Saccharomyces cerevisiae
  7. Database section check the GO - biological process check box, click the options link and change the range of levels from 3 to 13 to perform a deeper analysis on the ontology. Check also the GO - Cellular Component and change the range from 6 to 9 levels.
  8. Give a name to the new job (e.g. sporulationFatiGO)
  9. Maintain the rest of the parameters as default
  10. Submit the job (press the run button)

If we compare it to the rest of genes in the experiment we can see that several terms related with meiosis and chromosome component are significantly overrepresented in the cluster of co-expressing genes. Keep in mind that this test assumes that you do not have any a priori hypothesis on what biological process is operating in this particular cluster of genes.

4 Exploring differences in gene ontology, pathways and reactions with FatiGO

Similarly you can explore functional differences using other biologically relevant terms such as pathways's membership or reactions in the Reactome. We can use FatiGO for this purpose.

  1. The files for the fourth worked example correspond to a list of genes related to apoptosis which will be compared to a list of genes extracted from chromosome 19 of Homo sapiens. Save both files to your desktop or local directory and upload it as a gene - idlist data type.
  2. Create a new project (e.g. apoptosis) and start a new FatiGO job in the Functional analysis » Single enrichment analysis section of the tools.
  3. Choose the Id List vs. Id List option.
  4. Choose as List1 the data fatigo_apoptosis and as List2 fatigo_chr19 that you already uploaded.
  5. In the Options section choose Over-represented terms in List1. Choose the option to manage duplicates separately in the two lists.
  6. Choose the organism database Homo sapiens
  7. Database section check the GO - Biological Process, GO - Cellular Component, GO - Molecular Function, Reactome and Biocarta.
  8. Give a name to the new job (e.g. apoptosis_vs_chr19_FatiGO)
  9. Maintain the rest of the parameters as default
  10. Submit the job (press the run button)

Observing the significant results are enriched only in the apoptosis related list. The terms are associated to the cell programmed death as can be seen in the GO terms description or the BioCarta pathways. The most clear result is the only one Reactome reaction significant that is not surprisingly apoptosis.

FatiGO exercise

Studying the function of a set of co-expressing genes using FatiGO

We are going to perform different steps. Firstly we are going to cluster the genes, then we will extract a cluster which finally will be compared to the rest of the genes in the experiment in order to see if one or more biologically relevant terms are overrepresented in the cluster.

The data set used corresponds to an experiment carried out by a group of the Stanford University about the diauxic shift in S. cerevisiae previously mentioned (DeRisi et al., 1997, Exploring the Metabolic and Genetic Control of Gene Expression on a genomic Scale. Science, 278, 680-686). Diauxie describes the growth phases of a bacterial colony as it metabolizes a mixture of sugars. During the first phase, cells preferentially metabolize the sugar whose catabolism is most efficient (often glucose). Only after the first sugar has been exhausted do the cells switch to the second. At the time of the diauxic shift there is often a lag period during which the cell produces the enzymes needed to metabolize the second sugar. The diauxic shift frequently represents a change in metabolism from glucose fermentation to aerobic respiration as the glucose is depleted.

1st part: clustering the gene expression patterns

  1. Download the Diauxic Shift Data Set.
  2. Log into Babelomics package.
  3. Upload the data set as a data matrix - expression data type.
  4. Choose the clustering tool in the Expression tab and set these parameters:
    • Select the diauxic data you just uploaded to Babelomics.
    • Type of clustering select to cluster by genes.
    • Select SOTA as clustering method
    • Select Pearson correlation coeff. as distance measure.
    • Give a name to your job (eg. diauxic_clust) and press run button.
  5. Once the job has finished you can observe the clustering of genes. We want to extract a cluster and you can do it moving the mouse around the picture until you reach the 1447 gene cluster.

2nd part: extract the genes of the cluster

Click on the profile of the 1447 gene cluster. You will get a pop-up window containing a list of the genes belonging to the cluster. You could download the cluster by copying and pasting in a text file or can directly send it to FatiGO to do the functional analysis. The aim is to test our cluster against the rest of the genes in the cluster.

3rd part: analyse the cluster in FatiGO

The clustering tool will redirect the cluster extracted as List 1 and the remaining genes as List 2. Don't forget to choose Over-represented terms in List1 as Fisher exact test, complementary list, the specie and the functional databases to test in your cluster. Keep in mind that we want to functionally characterize the 1447 gene cluster respect to the rest of the genes in the experiment.

Terms related to phosphorylation in the biological process or terms related to the mitochondria the respiratory chain in cellular component and the oxidative phosphorylation KEGG pathway are directly involved in the diauxic shift process studied in this experiment.

Try other clusters and other functional databases.

Marmite: single enrichment with text-mining derived annotations

Marmite stands for My Accurate Resource for MIning TExt and implements single enrichment analysis with text-mining derived annotations. Text-mining methods allow extracting informative annotations (bioentities) with different functional, chemical, clinical, etc. meanings, that can be associated to genes. In this case, the association of an annotation to a gene has a strength derived from the number of times that the gene and the annotation are co-cited in a PubMed abstract. A Kolmogorov-Smirnov test is used instead of the conventional Fisher's exact test. Multiple test correction to account for the multiple hypothesis tested (one for each annotation) is applied.
Data is provided by BioAlma who generated the associations using almaKnowledgeServer.

Bioentites and gene co-occurrences

Starting with a set of documents (e.g. the documents where a certain gene appears or a disease) we can define keywords as those words that are significantly overrepresented compared to a standard set or background. These words that appear with much higher frequencies than one would expect by chance can be considered as the content words that capture the main features in this set of documents. In addition to single words bi-grams (two adjacent words) were taken into account because in many cases these terms contain more information than single words (e.g. “cell cycle” vs. “cell”, “cycle”). We refer to words and bi-grams as terms in the following. All words were stemmed before further treatment to increase statistical significance of words. For each term i the number of documents where i appears in the whole collection of documents (xi in N, our background) and in a specific document set a (Xia in Na) is calculated. Then, based on the hypergeometric distribution, the likelihood to find Xia documents in a set of the size n is computed for each term. The more unlikely this event is the more specific is the term i for the document set.

Definitions:

 Na ... number of documents of the set a 
 Ndoc ... number of documents in the entire collection 
 Xi ... number of documents where term i appears in Ndoc 
 Xia ... number of documents where term i appears in Na Formula for calculating keyword relevance:
 Mean value for term i in collection Na :
 Mia = Na * (Xi /Ndoc)
 The standard deviation of the distribution :
 σia = sqrt(Mia * (1 - Xi/Ndoc) * (1 - Na/Ndoc))
 The Z-score for each term i in a; the higher the score the more relevant is a term for the document set :
 Zia = (Xia - Mia)/σia

Statistics

Marmite evaluates the differences between the gene-bioentity co-occurrence values (scores) for two lists of genes. We apply a Kolmogorov-Smirnov Test to each pair of distributions (one per list) formed by the scores of the co-occurrences between a bioentity and the genes within the list. No null values are included into the distributions to evaluate, that is, only genes with a score indicating co-occurrence with the bioentity are included.

Marmite only evaluates bioentities associated to a minimum number of genes within both list (minimum and default is 5 although it can be set by user), this is the way the user may have to control the level of representation of the bioentities presented in the results and the list as a unity. Each test is applied in both sides, that is, per each entity we apply two tests, to see in distribution of list 1 is greater or smaller than distribution of list 2. Marmite have into account multiple test problems and adjusts p-values using FDR.

Options to select

  • Select your data Your data must be a list of human gene identifiers as an idlist data type. The annotations were created using HUGO ids. Then Marmite converts any gene identifier to HUGO id through an ensembl id, if you provide gene names the conversion process will be omitted.
  • Bioentity name Users can evaluate their genes using three categories of bioentities: disease associated words, chemical products or word roots.
  • Filtering entities to test Select minimum number of genes with a score for an entity. Entities with less than this number in both lists will be excluded from the analysis. Minimum default is 5.
  • Filtering entities in the results Select number of bioentities presented in results page. A normal analysis may evaluate about 200-300 words depending on lists and gene relevance. The output can be extra informative or even too big for a normal browser to display if we show all bioentiites evaluated. Therefore, Marmite restricts the output to 50 entities showed, this can be set up by user. Entities with significant pvalues will be always shown anyway, so never this restriction produces a lack of relevant information.

Species

We only provide annotations for Homo sapiens.

Application example

This is a simple example on how Marmite could be applied to the functional annotation of experiments.

We used data coming from a microarray experiment [ West et al. (2005) PLoS Biol 3:e187 ] that studies the differences in the transcriptome in two types of tumours in soft tissues (muscles, tendons, fibrous tissues, nerves, etc): SFT (Solitary Fibrons Tumor) and DTF (Desmoid-type fibromatosis).

These tumours are very different in clinical behaviour but quite similar histologically. This feature makes microarrays a very useful experimental approach to learn more about these kind of cancers, their differences in gene expression and a valuable technique to infer new markers for diagnosis.

Barely, the experiment includes classes for DTF, SFT and other types of cancer, see West et al. for more details.

According to the authors the gene expression patterns are quite different for these two types of tumours and their separation in clusters is very clear.

With such premises, we extracted the samples (columns) for these two classes and then applied an unsupervised clustering algorithm (SOTA method) to the preprocessed matrix of gene expressions in the set of classes to get groups of genes with similar expression patterns. Using the visualization of the cluster we got the two main clusters of genes that if the premise is true will give two groups of genes with different expression profiles in the two classes of tumours (SFT and DTF). Important differences in expression patterns can be appreciated between both clusters in classes SFT and DTF. Therefore, the grouping has been successful in the separation of genes with important roles in both classes.

We extracted these two clusters and got two lists of genes that we used as input for Marmite (list1,list2).

Marmite can extract the differences in the distribution of the co-occurrences measures that these two lists of genes had against three lists of bioentities: Disease associated words, Chemical products and Word roots. The bioentities are associated to genes by a score, being this score a measure of the weight of the association found between each pair (gene-word) in the scientific literature.

Marmite gave as results of a few bioentities (words) with significative differences in the two lists (list1 > list2).

These data is a very valuable information to annotate our experiment, we can see that list 1 contains genes with very specific co-occurrences with words of great importance in the characterization of these kind of tumours.

enrichment_analysis.txt · Last modified: 2011/03/10 11:10 by mmarba
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0 do yourself a favour and use a real browser - get firefox!!