Evolutionary tests
Phylemon proposes a selection of tools and software to test different models of molecular evolution (Model selection
), to test divergence of rates (Relative rates tests
) and to detect adaptation (Adaptation tests
).
Model selection
Practical Problem
We have just obtained a multiple alignment of our favorite gene or protein against other species and would like to start uncovering the history of its evolution. However first we have to define what parameters to measure and correct for in order to calculate the simplest most informative measure of a distance or change among these species.
Fortunately, many models are available that cover the range and combinations of parameters that may explain how our sequence has changed/evolved. However, which one should we use? One that is too simple will be inaccurate since it will miss out on important variations and may lead to wrong inferences. One that is too complex may complicate our analysis by increasing the variance of our measurements and possibly preventing us from making inferences.
Methods
Model selection consists in evaluating the fit of various nucleotide or amino acid substitution models on sequences through an implementation of the Method of JModelTest, and Prot Test. The implementation for DNA will actually evaluate the fit of each model on our nucleotide data and obtain the log-likelihood values necessary for selecting the best model through “hierarchical likelihood ratio tests” (hLRTs), “dynamic likelihood ratio tests” (dLRTs), “Akaike Information Criterion” (AIC), “Bayesian Information Criterion” (BIC) or “decision-theoretic performance-based” approach (DT). Nevertheless, in case of amino acid sequences ProtTest provides many more options and models to test on our data and allow using more than one rate class for the analysis.
LRT methods (hLRT, dLRT) and Information Criteria methods (AIC, BIC DT) provide two different methods for the comparison of the fit of the different models, in example:
The AIC or Akaike Information Criterion is an asymptotically unbiased estimator of the Kullback - Leibler information quantity, which is a measure of the information that is lost when a model is used to approximate full reality. It permits comparing models which are not nested by attempting to find the minimal model (least number of parameters) that correctly explains the data. The model with the lowest AIC score is selected as the one having the best fit to the data.
The hierarchical comparison of models allows comparing nested models by goodness of fit through likelihood ratio test (hLRT). The available models are tested in a hierarchical manner following a decision tree scheme of which one example is shown to the right. It is important to note that here, depending on the confidence value threshold selected for rejecting the null hypothesis, the best model may actually not be tested depending on the route taken after each step or step
ProtTest HyPhy (version 1.0)
Purpose
ProtTest is a bioinformatic tool for the selection of the most appropriate model of protein evolution (among the set of candidate models) for the data at hand. ProtTest makes this selection by finding the model with the smallest Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) score. At the same time, ProtTest obtains model-averaged estimates of different parameters (Posada and Buckley 2004 1)) and calculates the importance of each of these parameters. ProtTest differs from its nucleotide homologue Modeltest (Posada and Crandall 1998 2)) in that it does not include likelihood ratio tests (many models implemented in ProtTest are not nested).
You can get more information here ProtTest documentation, or by downloading this documentation prottest_manual.pdf.
jModelTest (version 0.1.1)
Purpose
jModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution. It implements five different model selection strategies: hierarchical and dynamical likelihood ratio tests (hLRT 3) and dLRT 4)), Akaike and Bayesian information criteria (AIC and BIC), and a decision theory method (DT). It also provides estimates of model selection uncertainty, parameter importances and model-averaged parameter estimates, including model-averaged phylogenies. The theoretical background is described elsewhere (Posada and Buckley 2004 5); Sullivan and Joyce 2005 6)).
You can get more information by downloading the manual.
Relative rates tests
RRTree (version 1.1.11)
The program RRTree compares substitution rates between DNA or protein sequences grouped or not in phylogenetically defined lineages. The methods involved are mostly described in article.
Description of main parameters in Phylemon
Number of lineage: positive integral number up to 3, this number should not include outgroup species.
Lineage affectation field: this field will display content when your alignment file is uploaded. You will then have to decide to which lineage you assign each one of your species. (-1 option will exclude the corresponding species from the analysis)
in the lineage names field (that depends on the number of lineage you decided to compare), you can set the name of each of your lineage.
About distances field, this field is only available for DNA sequences and lets you define the kind of substitution model to use for calculation of distances (with proteins RRTree computes a simple adaptation of the Jukes-Cantor method).
About topology field: RRTree can use a given guide tree. It is important to note that branches of the guide tree supported by less than a given support score (bootstrap or whatever kind of numerical parameter chosen by the user) can be considered as unresolved, resulting in additional polytomies. Branches of length zero are always considered as unresolved. Polytomies can also be noted by absence of branches. The branch separating outgroups from ingroups must not be considered unresolved (length 0 or support under threshold), and the outgroup must be monophyletic relative to the ingroups. The following examples are three possible notations of the same tree, with a polytomy between mouse, guinea pig and human (assuming a score limit of 50):
(Chicken,(Human,((Sheep,Pig),Horse,Mink),Mouse,Guinea_pig));
(Chicken,((Human,((Sheep,Pig)80,(Horse,Mink)45)95)45,(Mouse,Guinea_pig)45)100);
(Chicken:0.4,((Human:0.2,((Sheep:0.05,Pig:0.05):0.05,(Horse:0.1,Mink:0.1):0.0):0.1):0.0,(Mouse:0.2,Guinea_pig:0.2):0.0):0.2);
You can get more at the official help page.
Adaptation tests
SLR
Documentation of main methods
This options are available in Phylemon, other options are setted to default (value under square braquets), and described here.
seqfile: File from which to read alignment of codon sequences. The file should be in PAML format.
treefile: File from which tree should be read. The tree should be in Nexus format.
positive_only: If only positively selected sites are of interest, set this to “1”. Calculation will be slightly faster, but information about sites under purifying selection is lost.
Results
Results are presented in nine columns:
Site: Number of sites in alignment
Neutral: (minus) Log-probability of observing site given that it was evolving neutrally (omega=1)
Optimal: (minus) Log-probability of observing site given that it was evolving at the optimal value of omega.
Omega: The value of omega which maximizes the log-probability of observing
LRT_Stat: Log-likelihood ratio statistic for non-neutral selection (or positive selection if the positive_only option is set to 1). LRT_Stat = 2 * (Neutral-Optimal)
Pval: P-value for non-neutral (or positive) selection at a site, unadjusted for multiple comparisons.
Adj. Pval: P-value for non-neutral (or positive) selection at a site, after adjusting for multiple comparisons using the Hochberg procedure (see the file “MultipleComparisons.txt” in the doc directory).
Result: A simple visual guide to the result. Sites detected as having been under positive selection are marked with a '+', sites under purifying selection are marked with '-'. The number of symbols :
Number | symbols Threshold |
1 | 95% |
2 | 99% |
3 | 95% after adjustment |
4 | 99% after adjustment |
(9. Occasionally the result may also contain an exclamation mark “!”. This indicates that the observation at a site is not significantly different from random (equivalent to infinitely strong positive selection). This may indicate that the alignment at that site is bad.)
Note:
The following events are flagged:
Synonymous | All codons at a site code for the same amino acid. |
Single character | Only one sequence at the site is ungapped, the result of a recent insertion for example. |
All gaps | All sequences at a site contain a gap character. |
Sites marked “Single character” or “All gaps” are not counted towards the number of sites for the purposes of correcting for multiple comparisons since it is not possible to detect selection from none or one observation under the assumptions made by the sitewise likelihood ratio test.
PAML (version 4.4b)
PAML (for Phylogenetic Analysis by Maximum Likelihood) is a package of programs for phylogenetic analyses of DNA and protein sequences using maximum likelihood.
The PAML package currently includes the following programs: baseml, basemlg, codeml, evolver, pamp, yn00, mcmctree, and chi2. In Phylemon, we made available 2 of those tools.
yn00
The program yn00 implements the method of Yang and Nielsen (2000) 8) for estimating synonymous and nonsynonymous substitution rates between two sequences (dS and dN). The method of Nei and Gojobori (1986) 9) is also included. The ad hoc method implemented in the program accounts for the transition/transversion rate bias and codon usage bias, and is an approximation to the ML method accounting for the transition/transversion rate ratio and assuming the F3x4 codon frequency model.
CodeML
The program codeml is formed by merging two old programs: codonml, which implements the codon substitution model of Goldman and Yang (1994) for protein-coding DNA sequences, and aaml, which implements models for amino acid sequences. These two are now distinguished by the variable seqtype in the control file codeml.ctl, with 1 for codon sequences and 2 for amino acid sequences. In this document I use codonml and aaml to mean codeml with seqtype = 1 and 2, respectively. The programs baseml, codonml, and aaml use similar algorithms to fit models by maximum likelihood, the main difference being that the unit of evolution in the Markov model, referred to as a “site” in the sequence, is a nucleotide, a codon, or an amino acid for the three programs, respectively. Markov process models are used to describe substitutions between nucleotides, codons or amino acids, with substitution rates assumed to be either constant or variable among sites.
The main options that we made available in Phylemon are:
Input sequence in phylip-paml format (go to readAl utility to convert your alignment)
Input tree in newick format
Sequence related parameter: here you can set the option 1 or 3 if you are working with dna sequences or 2 if you are working with amino acids.
Model selection:
Run mode: -2 performs ML estimation of dS and dN in pairwise comparisons of protein-coding sequences (seqtype = 1). The program will collect estimates of dS and dN into the files 2ML.dS and 2ML.dN. Since many users seem interested in looking at dN/dS ratios among lineages, examination of the tree shapes indicated by branch lengths calculated from the two rates may be interesting although the analysis is ad hoc. If your species names have no more than 10 characters, you can use the output distance matrices as input to Phylip programs such as neighbor without any change. Otherwise you need to edit the files to cut the names short. For amino acid sequences (seqtype = 2), option runmode = -2 lets the program calculate ML distances under the substitution model by numerical iteration, either under the model of one rate for all sites (alpha = 0) or under the gamma model of rates for sites (alpha ≠ 0). In the latter case, the continuous gamma is used and the variable Number of categories of gamma is ignored. (With runmode = 0, the discrete gamma is used.)
Model:
for codon models: specifies the branch models (Yang 1998; Yang and Nielsen 1998). model = 0 means one ω ratio for all branches; 1 means one ratio for each branch (the free-ratio model); and 2 means an arbitrary number of ratios (such as the 2-ratios or 3-ratios models). When model = 2, you have to group branches on the tree into branch groups using the symbols # or $ in the tree. See the section above about specifying branch/node labels. With model = 2, the variable fix_omega fixes the last ω ratio (ωk − 1 if you have k ratios in total) at the value of omega specified in the file. This option is used to test whether the ratio for the foreground branch is significantly greater than one.
for amino acid models: specifies the model of amino acid substitution: 0 for the Poisson model assuming equal rates for any amino acid substitutions (Bishop and Friday 1987); 1 for the proportional model in which the rate of change to an amino acid is proportional to the frequency of that amino acid. Model = 2 specifies a class of empirical models, and the empirical amino acid substitution rate matrix is given in the file specified by aaRatefile. Files included in the package are for the empirical models of Dayhoff (dayhoff.dat), JTT, WAG (wag.dat), mtMAM (mtmam.dat), mtREV24 (mtREV24.dat), etc.
NS sites: specifies the site models, with NSsites =
m corresponds to model M
m in Yang et al. (2000b)
10). The variable
Number of categories of gamma is used to specify the number of categories in the ω distribution under some models. In Yang et al. (2000b), this is 3 for M3 (discrete), 5 for M4 (freq), 10 for the continuous distributions (M5: gamma, M6: 2gamma, M7: beta, M8:beta&w, M9:beta&gamma, M10: beta&gamma+1, M11:beta&normal>1, and M12:0&2normal>1, M13:3normal>0). For example, M8 has 11 site classes (10 from the beta distribution plus 1 additional class for positive selection with ωs ≥ 1). See the section Codon substittion models above about the changes to M1 and M2 introduced in PAML version 3.14.
The posterior probabilities for site classes as well as the expected ω values for sites are listed in the file rst, which may be useful to identify sites under positive selection.
Other parameters:
Genetic code
Number of categories of gamma: maximum number of rate categories in the discrete-gamma model (but see NS sites description for some extra issue about this parameter).
Remove sites with ambiguity data: Checking this option will remove columns with gaps or ambiguity characters, otherwise they will be taken into account by CodeML.
Codon frequency: specifies the equilibrium codon frequencies in codon substitution model. These frequencies can be assumed to be equal (1/61 each for the standard genetic code, CodonFreq = 0), calculated from the average nucleotide frequencies (CodonFreq = 1), from the average nucleotide frequencies at the three codon positions (CodonFreq = 2), or used as free parameters (CodonFreq = 3). The number of parameters involved in those models of codon frequencies is 0, 3, 9, and 60 (for the universal code), for CodonFreq = 0, 1, 2, and 3 respectively.
Clock: specifies models concerning rate constancy or variation among lineages.
clock = 0 means no clock and rates are entirely free to vary from branch to branch. An unrooted tree should be used under this model. For a binary tree with n species (sequences), this model has (2n –3) parameters (branch lengths). clock = 1 means the global clock, and all branches have the same rate. This model has (n – 1) parameters corresponding to the (n – 1) internal nodes in the binary tree. So a test of the molecular clock assumption, which compares those two models, should have d.f. = n – 2.
clock = 2 specifies the local clock models(Yoder and Yang 2000
11); Yang and Yoder 2003
12) ). Most branches in the phylogeny conform with the clock assumption and has the default rate (r0 = 1), but some pre-defined branches may have different rates. Rates for branches are specified using branch labels (
#
and
$
) in the tree file. For example, the tree
( ( (1,2) #1, 3), 4)
specifies rate r1 for the branch ancestral to species 1 and 2 while all other branches have the default rate r0, which does not have to be specified. The user need to specify which branch has which rate, and the program estimates the unknown rates (such as r1 in the above example; r0 = 1 is the default rate). You need to be careful when specifying rates for branches to make sure that all rates can be estimated by the model; if you specify too many rate parameters, especially for branches around the root, it may not be possible to estimate all of them and you will have a problem with identifiability. The number of parameters for a binary tree in the local clock model is (n – 1) plus the number of extra rate parameters for branches. In the above tree of 4 species, you have only one extra rate parameter r1, and so the local clock model has (n – 1) + 1 = n = 4 parameters. The no-clock model has 5 parameters while the global clock has 3 parameters for that tree.
The option clock = 3 is for combined analysis of multiple-gene or multiple-partition data, allowing the branch rates to vary in different ways among the data partitions (Yang and Yoder 2003). To account for differences in the evolutionary process among data partitions, you have to use the option G in the sequence file as well as the control variable Mgene in the control file (baseml.ctl or codeml.ctl). Read Yang and Yoder (2003)
13) and the readme file in the examples/MouseLemurs/ folder to duplicate the analysis of that paper. Also the variable (= 5 or 6) is used to implement the ad hoc rate smoothing procedure of Yang (2004)
14). See the file readme2.txt for instructions and the paper for details of the model.
For clock = 1, 2, or 3, a rooted tree should be used. If fossil calibration information is specified in the tree file using the symbol @ or =, the absolute rate will be calculated. Multiple calibration points can be specified this way, but only point calibrations (where a node age is assumed to be known without error) are accepted and bounds are not accepted. See instructions about the mcmctree program, which accepts bounds or other distributions as calibrations. If sequences have dates, this option will fit Andrew Rambaut’s TipDate model.
MGene: is used in combination with option G in the sequence data file, for combined analysis of data from multiple genes or multiple site partitions (such as the three codon positions) (Yang 1996a
15); pages 116-8 in Yang 2006
16) ). Such models are also called partition models. Choose 0 if option G is not used in the data file. The description here refers to multiple genes, but apply to any strategy of partitioning sites, such as by codon position. Similar partition models are implemented in codeml for analyzing codon or amino acid sequences. The models are summarized in table 1. The simplest model which assumes complete homogeneity among genes can be fitted by concatenating different genes into one sequence without using the option G (and by specifying Mgene = 0 in the control file). The most general model is equavilent to a separate analysis, and can be done by fitting the same model to each data set (each gene), but can also be done by specifying Mgene = 1 with the option G in the combined data file. The sum of the log-likelihood values over different genes is then the log likelihood of the most general model considered here. Models accounting for some aspects of the heterogeneity of multiple genes are fitted by specifying Mgene in combination with the option G in the sequence data file. Mgene = 0 means a model that asumes different substitution rates but the same pattern of nucleotide substitution for different genes. Mgene = 2 means different frequency parameters for different genes but the same rate ratio parameters (κ in the K80, F84, HKY85 models or the rate parameters in the TN93 and REV models). Mgene = 3 means different rate ratio parameters and the same frequency parameters. Mgene = 4 means both different rate ratio parameters and different frequency parameters for different genes. Parameters and assumptions made in these models are summarized in the following table, with the HKY85 model used as an example. When substitution rates are assumed to vary from site to site, the control variable Malpha specifies whether one gamma distribution will be applied across all sites (Malpha = 0) or a different gamma distribution is used for each gene (or codon position). The different cs for different genes mean that branch lengths estimated for different genes are proportional (see Table 1 extracted from PAML documentation, above). Parameters π represent the equilibrium nucleotide frequencies, which are estimated using the observed frequencies (nhomo = 0). The transition/transversion rate ratio κ in HKY85 can be replaced by the two or five rate ratio parameters under the TN93 or REV models, respectively. The likelihood ratio test can be used to compare these models, using an approach called the analysis of deviance (McCullagh and Nelder 1989)
17), which is very similar to the more familiar analysis of variance.
Rate Ancestor: 0 or 1. Usually use 0. The value 1 forces the program to do two additional analyses, which you can ignore if you don’t need the results. First under a model of variable rates across sites such as the gamma, RateAncestor = 1 forces the program to calculate rates for individual sites along the sequence (output in the file rates), using the empirical Bayes procedure (Yang and Wang 1995)
18).
Small Diff: is a small value used in the difference approximation of derivatives.
Fix branch-lengths: This tells the program what to do if the tree has branch lengths. Use 0 if you want to ignore the branch lengths. Use –1 if you want the program to start from random starting points. This might be useful if there are multiple local optima. Use 1 if you want to use the branch lengths as initial values for the ML iteration. Try to avoid using the “branch lengths” from a parsimony analysis from PAUP, as those are numbers of changes for the entire sequence (rather than per site) and are very poor initial values. Use 2 if you want the branch lengths to be fixed at those given in the tree file (rather than estimating them by ML). In this case, you should make sure that the branch lengths are sensible; for example, if two sequences in the data file are different, but the branch lengths connecting the two sequences in the tree are all zero, the data and tree will be in conflict and the program will crash.
Method: This variable controls the iteration algorithm for estimating branch lengths under a model of no clock. method = 0 implements the old algorithm in PAML, which updates all parameters including branch lengths simultaneously. method = 1 specifies an algorithm newly implemented in PAML, which updates branch lengths one by one. method = 1 does not work under the clock models (clock = 1, 2, 3).
Fixed or estimated parameters:
Fix kappa: specifies whether κ in K80, F84, or HKY85 is given at a fixed value or is to be estimated by iteration from the data. If Fix kappa is checked, the value of another variable, kappa, is the given value, and otherwise the value of kappa is used as the initial estimate for iteration (value that you should input in corresponding text field). The variables fix_kappa and kappa have no effect with JC69 or F81 which does not involve such a parameter, or with TN93 and REV which have two and five rate parameters respectively, when all of them are estimated from the data.
Fix omega: specifies whether ω is given at a fixed value or is to be estimated by iteration from the data. If Fix omega is checked, the value of another variable, ω, is the given value, and otherwise the value of ω is used as the initial estimate for iteration (value that you should input in corresponding text field).
Fix alpha: work in a similar way, where alpha refers to the shape parameter α of the gamma distribution for variable substitution rates across sites (Yang 1994a)
19). The model of a single rate for all sites is specified as fix_alpha = 1 and alpha = 0 (0 means infinity), while the (discrete-) gamma model is specified by a positive value for alpha, and ncatG is then the number of categories for the discrete-gamma model (baseml). Values such as 5, 4, 8, or 10 are reasonable. Note that the discrete gamma model has one parameter (α), like the continuous gamma model, and the number of categories is not a parameter. The mdel of invariable sites is not implemented in PAML programs. I don’t like the model as it generates a strong correlation between the proportion of invariable sites and the gamma shape parameter. It is implemented in PAUP and MrBayes, for example. To infer rates at individual sites, use RateAncestor = 1. See below. Using a large number of categories (say, ncatG = 40) may be helpful if you are interested in calculating such rates. For detailed descriptions of those models, see Yang (1996b)
20), Chapter 1 of Yang (2006)
21) and chapters 13 and 16 of Felsenstein (2004)
22).
Fix rho: work in a similar way and concern independence or correlation of rates at adjacent sites, where ρ (rho) is the correlation parameter of the auto-discrete-gamma model (Yang 1995). The model of independent rates for sites is specified as fix_rho = 1 and rho = 0; choosing alpha = 0 further means a constant rate for all sites. The auto-discrete-gamma model is specified by positive values for both alpha and rho. The model of a constant rate for sites is a special case of the (discrete) gamma model with α = ∞ (alpha = 0), and the model of independent rates for sites is a special case of the auto-discrete-gamma model with ρ = 0 (rho = 0).
Output options:
Noisy: controls how much output you want on the screen. If the model being fitted involves much computation, you can choose a large number for noisy to avoid loneliness. verbose controls how much output in the result file.
Verbose: controls the amount of output.
getSE: 0, 1, or 2 tells whether we want estimates of the standard errors of estimated parameters. These are crude estimates, calculated by the curvature method, i.e., by inverting the matrix of second derivatives of the log-likelihood with respect to parameters. The second derivatives are calculated by the difference method, and are not always reliable. Even if this approximation is reliable, tests relying on the SE's should be taken with caution, as such tests rely on the normal approximation to the maximum likelihood estimates. The likelihood ratio test should always be preferred. The option is not available and choose getSE = 0 when tree-search is performed.
Table 1 extracted from PAML documentation: Setups of partition models of nucleotide substitution:
Sequence file | Control file | Parameters across genes |
No G | everything equal | Mgene = 0 |
Option G | Mgene = 0 | the same κ and π, but different cs (proportional branch lengths) |
Option G | Mgene = 2 | the same κ, but different πs and cs |
Option G | Mgene = 3 | the same π, but different κs and cs |
Option G | Mgene = 4 | different κ, πs, and cs |
Option G | Mgene = 1 | different κ, πs, and different (unproportional) branch lengths |
We highly encourage users of both yn00 and CodeML to read the main PAML documentation file from which this help section is extracted, to understand all the parameters and their relations.
Citation
ProtTest
ProtTest
PhyML
PAL
jModelTest
jModelTest
PhyML
PHYLIP
If you use jModelTest to build a model-averaged tree:
RRTree
SLR
PAML
Codeml
yn00