Table of Contents

Evolutionary tests

Phylemon proposes a selection of tools and software to test different models of molecular evolution (Model selection), to test divergence of rates (Relative rates tests) and to detect adaptation (Adaptation tests).

Model selection

Practical Problem

We have just obtained a multiple alignment of our favorite gene or protein against other species and would like to start uncovering the history of its evolution. However first we have to define what parameters to measure and correct for in order to calculate the simplest most informative measure of a distance or change among these species.

Fortunately, many models are available that cover the range and combinations of parameters that may explain how our sequence has changed/evolved. However, which one should we use? One that is too simple will be inaccurate since it will miss out on important variations and may lead to wrong inferences. One that is too complex may complicate our analysis by increasing the variance of our measurements and possibly preventing us from making inferences.

Methods

Model selection consists in evaluating the fit of various nucleotide or amino acid substitution models on sequences through an implementation of the Method of JModelTest, and Prot Test. The implementation for DNA will actually evaluate the fit of each model on our nucleotide data and obtain the log-likelihood values necessary for selecting the best model through “hierarchical likelihood ratio tests” (hLRTs), “dynamic likelihood ratio tests” (dLRTs), “Akaike Information Criterion” (AIC), “Bayesian Information Criterion” (BIC) or “decision-theoretic performance-based” approach (DT). Nevertheless, in case of amino acid sequences ProtTest provides many more options and models to test on our data and allow using more than one rate class for the analysis.

LRT methods (hLRT, dLRT) and Information Criteria methods (AIC, BIC DT) provide two different methods for the comparison of the fit of the different models, in example:

ProtTest HyPhy (version 1.0)

Purpose

ProtTest is a bioinformatic tool for the selection of the most appropriate model of protein evolution (among the set of candidate models) for the data at hand. ProtTest makes this selection by finding the model with the smallest Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) score. At the same time, ProtTest obtains model-averaged estimates of different parameters (Posada and Buckley 2004 1)) and calculates the importance of each of these parameters. ProtTest differs from its nucleotide homologue Modeltest (Posada and Crandall 1998 2)) in that it does not include likelihood ratio tests (many models implemented in ProtTest are not nested).

You can get more information here ProtTest documentation, or by downloading this documentation prottest_manual.pdf.

jModelTest (version 0.1.1)

Purpose

jModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution. It implements five different model selection strategies: hierarchical and dynamical likelihood ratio tests (hLRT 3) and dLRT 4)), Akaike and Bayesian information criteria (AIC and BIC), and a decision theory method (DT). It also provides estimates of model selection uncertainty, parameter importances and model-averaged parameter estimates, including model-averaged phylogenies. The theoretical background is described elsewhere (Posada and Buckley 2004 5); Sullivan and Joyce 2005 6)).

You can get more information by downloading the manual.

Relative rates tests

RRTree (version 1.1.11)

The program RRTree compares substitution rates between DNA or protein sequences grouped or not in phylogenetically defined lineages. The methods involved are mostly described in article.

Description of main parameters in Phylemon

(Chicken,(Human,((Sheep,Pig),Horse,Mink),Mouse,Guinea_pig));
(Chicken,((Human,((Sheep,Pig)80,(Horse,Mink)45)95)45,(Mouse,Guinea_pig)45)100);
(Chicken:0.4,((Human:0.2,((Sheep:0.05,Pig:0.05):0.05,(Horse:0.1,Mink:0.1):0.0):0.1):0.0,(Mouse:0.2,Guinea_pig:0.2):0.0):0.2);

You can get more at the official help page.

Adaptation tests

SLR

Documentation is extracted from BioPerl documentation 7), and T. Massingham and N. Goldman (2005).

Documentation of main methods

This options are available in Phylemon, other options are setted to default (value under square braquets), and described here.

Results

Results are presented in nine columns:

  1. Site: Number of sites in alignment
  2. Neutral: (minus) Log-probability of observing site given that it was evolving neutrally (omega=1)
  3. Optimal: (minus) Log-probability of observing site given that it was evolving at the optimal value of omega.
  4. Omega: The value of omega which maximizes the log-probability of observing
  5. LRT_Stat: Log-likelihood ratio statistic for non-neutral selection (or positive selection if the positive_only option is set to 1). LRT_Stat = 2 * (Neutral-Optimal)
  6. Pval: P-value for non-neutral (or positive) selection at a site, unadjusted for multiple comparisons.
  7. Adj. Pval: P-value for non-neutral (or positive) selection at a site, after adjusting for multiple comparisons using the Hochberg procedure (see the file “MultipleComparisons.txt” in the doc directory).
  8. Result: A simple visual guide to the result. Sites detected as having been under positive selection are marked with a '+', sites under purifying selection are marked with '-'. The number of symbols :
Number symbols Threshold
1 95%
2 99%
3 95% after adjustment
4 99% after adjustment

(9. Occasionally the result may also contain an exclamation mark “!”. This indicates that the observation at a site is not significantly different from random (equivalent to infinitely strong positive selection). This may indicate that the alignment at that site is bad.)

Note:

The following events are flagged:

Synonymous All codons at a site code for the same amino acid.
Single character Only one sequence at the site is ungapped, the result of a recent insertion for example.
All gaps All sequences at a site contain a gap character.

Sites marked “Single character” or “All gaps” are not counted towards the number of sites for the purposes of correcting for multiple comparisons since it is not possible to detect selection from none or one observation under the assumptions made by the sitewise likelihood ratio test.

PAML (version 4.4b)

PAML (for Phylogenetic Analysis by Maximum Likelihood) is a package of programs for phylogenetic analyses of DNA and protein sequences using maximum likelihood.

The PAML package currently includes the following programs: baseml, basemlg, codeml, evolver, pamp, yn00, mcmctree, and chi2. In Phylemon, we made available 2 of those tools.

yn00

The program yn00 implements the method of Yang and Nielsen (2000) 8) for estimating synonymous and nonsynonymous substitution rates between two sequences (dS and dN). The method of Nei and Gojobori (1986) 9) is also included. The ad hoc method implemented in the program accounts for the transition/transversion rate bias and codon usage bias, and is an approximation to the ML method accounting for the transition/transversion rate ratio and assuming the F3x4 codon frequency model.

CodeML

The program codeml is formed by merging two old programs: codonml, which implements the codon substitution model of Goldman and Yang (1994) for protein-coding DNA sequences, and aaml, which implements models for amino acid sequences. These two are now distinguished by the variable seqtype in the control file codeml.ctl, with 1 for codon sequences and 2 for amino acid sequences. In this document I use codonml and aaml to mean codeml with seqtype = 1 and 2, respectively. The programs baseml, codonml, and aaml use similar algorithms to fit models by maximum likelihood, the main difference being that the unit of evolution in the Markov model, referred to as a “site” in the sequence, is a nucleotide, a codon, or an amino acid for the three programs, respectively. Markov process models are used to describe substitutions between nucleotides, codons or amino acids, with substitution rates assumed to be either constant or variable among sites.

The main options that we made available in Phylemon are:

Table 1 extracted from PAML documentation: Setups of partition models of nucleotide substitution:

Sequence fileControl fileParameters across genes
No Geverything equalMgene = 0
Option GMgene = 0the same κ and π, but different cs (proportional branch lengths)
Option GMgene = 2the same κ, but different πs and cs
Option GMgene = 3the same π, but different κs and cs
Option GMgene = 4different κ, πs, and cs
Option GMgene = 1different κ, πs, and different (unproportional) branch lengths

We highly encourage users of both yn00 and CodeML to read the main PAML documentation file from which this help section is extracted, to understand all the parameters and their relations.

Citation

ProtTest

ProtTest

ProtTest: selection of best-fit models of protein evolution.
Abascal F, Zardoya R, Posada D
Bioinformatics21p2104-5(2005 May 1)

PhyML

PAL

jModelTest

jModelTest

jModelTest: phylogenetic model averaging.
Posada D
Mol Biol Evol25p1253-6(2008 Jul)

PhyML

PHYLIP

If you use jModelTest to build a model-averaged tree:

PHYLIP (Phylogeny Inference Package) version 3.6.
Felsenstein, J.
Distributed by the author. Department of Genome Sciences, University of Washington, Seattle (USA) (2004)

RRTree

RRTree: relative-rate tests between groups of sequences on a phylogenetic tree.
Robinson-Rechavi M, Huchon D
Bioinformatics16p296-7(2000 Mar)

SLR

Detecting amino acid sites under positive selection and purifying selection.
Massingham T, Goldman N
Genetics169p1753-62(2005 Mar)

PAML

Codeml

PAML 4: phylogenetic analysis by maximum likelihood.
Yang Z
Mol Biol Evol24p1586-91(2007 Aug)

yn00

1)
Posada, D., and Buckley, T.R. 2004. Model Selection and Model Averaging in Phylogenetics: Advantages of AIC and Bayesian approaches over Likelihood Ratio Tests. Systematic Biology 53: 793-808.
2)
Posada, D., and Crandall, K.A. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14: 817-818.
3)
Example of a particular forward hierarchy of likelihood ratio tests for 24 models. At any level the null hypothesis (model on top) is either accepted (A) or rejected (R). In this example the model selected is GTR+I. Extracted from jModelTest documentation.
4)
Dynamical likelihood ratio tests for 24 models. At any level a hypothesis is either accepted (A) or rejected (R). In this example the model selected is GTR+I. Hypotheses tested are: F = base frequencies; S = substitution type; I = proportion of invariable sites; G = gamma rates. Extracted from jModelTest documentation.
5)
Posada, D., and T. R. Buckley. 2004b. Model selection and model averaging in phylogenetics: advantages of Akaike Information Criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53:793-808.
6)
Sullivan, J., and P. Joyce. 2005. Model selection in phylogenetics. Annual Review of Ecology, Evolution and Systematics 36:445-466.
8)
Yang, Z., & Nielsen, R. (2000). Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular biology and evolution, 17(1), 32-43.
9)
Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution 3:418-426.
10)
Yang, Z. 2000b. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. Journal of Molecular Evolution 51:423-432.
11)
Yoder, A. D., and Z. Yang. 2000. Estimation of primate speciation dates using local molecular clocks. Molecular Biology and Evolution 17:1081-1090
12)
Yang, Z., and A. D. Yoder. 2003. Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species. Systematic Biology 52:705-716.
13)
Yang, Z., and A. D. Yoder. 2003. Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species. Systematic Biology 52:705-716.
14)
Yang, Z. 2004. A heuristic rate smoothing procedure for maximum likelihood estimation of species divergence times. Acta Zoologica Sinica 50:645-656
15)
Yang, Z. 1996a. Maximum-likelihood models for combined analyses of multiple sequence data. Journal of Molecular Evolution 42:587-596.
16) , 21)
Yang, Z. 2006. Computational Molecular Evolution. Oxford University Press, Oxford, England.
17)
McCullagh, P., and J. A. Nelder. 1989. Generalized linear models. Chapman and Hall, London.
18)
Yang, Z. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005
19)
Yang, Z. 1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution 39:306-314.
20)
Yang, Z. 1996b. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367-372.
22)
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts.