Phylogeny

Phylogeny

Phylogeny programs in Phylemon compute Distances, Maximum Parsimony and Statistics Methods. The last includes Maximum Likelihood and Bayesian approaches. Phylogenetic solutions in Phylemon emphasizes on Distances and Statistics methods. Maximum Parsimony in Phylemon uses the very basic programs of Phylip for protein (ProtPars) and DNA sequence data (DnaPars).

Distance Methods

Distances in PHYLIP

Distance matrices for DNA and protein sequence data are computed by DnaDist and ProtDist programs, respectively. The output consist in a single outfile with all the pairwise distances between sequences. The methods are limited by the few number of evolutionary models they have to correct for multiple hits.
Distances tree reconstruction in PHYLIP uses Neighbor and Fitch programs. Neighbor build trees by means of cluster analysis methods such as UPGMA and Neighbor-Joining (NJ) algorithms. Fitch program computes Minimum Evolution and Least Square methods. The output in all the cases consist in two files the outfile and the outtree. The last file contain the tree in a newick format that can be read by tree viewer programs such as ETE.The outfile depicts a rough tree able to read and understand the solution but it is not useful to interact with other programs.
- We recommend to use K2p option (DnaDist) and PAM option (ProtDist) (depending if your data is DNA or protein sequences) to correct initially for multiple hits and to follow with a NJ tree to compute a rough approximation to the best tree. It is a good idea to upload from the server this outtree solution as the intree file when you run by first time ModelTest or Protest programs. Take into account that it is no necessary to remove distances (from the outtre) when you send this file to ModelTest or Protest to search for the best evolutionary model for the sequences.

You can get more information here Phylip Documentation

NJ trees using ML distances

Phylemon computes a Neighbour-Joining (NJ) tree from maximum likelihood (ML) distances by means of NJ_ML_D algorithm for DNA sequences. Users can select genetic distances produced by the best evolutionary model (selected from ModelTest) from HLRT or AIC approximation to make the NJ reconstruction. This program is based on HYPHY (Hypothesis Testing using Phylogenies) algorithms.
- We recommend to use this program for a good approximation to the distance tree. Note that you need to follow our previous recommendation (see above) to find this tree.

Maximum Likelihood (ML) Methods

Phylemon runs Maximum Likelihood (ML) methods of phylogenetic reconstruction by means of: DnaML and ProML programs in PHYLIP, TREE-PUZZLE and PhyML. We encourage the use of PhyML to obtain fast ML tree solutions for DNA and protein sequence data. Alternatively, TREE-PUZZLE can be used to test for alternative topologies (see topology testing below).Since Phylemon is a good tool for learning porpoise we added ML programs of PHYLIP, the first tools for ML reconstruction of sequence data developed by Joel Felsenstein.

ML trees with PhyML

PhyML 3.0 find ML tree for DNA or amino-Acids sequence data. The input sequence format can be interleaved (default) or sequential (see ReadAl). PhyML has a large number of substitution models to correct for non-observed number of changes. For DNA sequences, the default choice is HKY85 and there are another 6 alternative models K80, JC69, F81, F84, TN93 and GTR. For amino-acid sequences, the default choice is JTT, and others 9 models are available: Dayhoff (PAM), mtREV, WAG, DCMut, RtREV, CpREV, VT, Blosum62 and MtMam.

Parameters such as the transition/transversion ratio (for DNA sequences), the proportion of invariable sites (P), and the Gamma distribution parameter can be jointly optimized to fit the observed data ( the sequences) at the highest probability. The number of substitution rate categories is 4 by default. The shape of a gamma distribution is defined by the alpha parameter. Starting unrooted trees(s) (with branch lenghts) in newick format can be used to approximate the tree solution. By default PhyML uses a BIONJ distance-based tree to begin with the tree topology search process.

Finally, users can optimize topology and all the parameters, or can optimize the branch lengths and rate parameters by fixing the topology. If you choose for no optimization, PHYML just returns the likelihood of the starting tree(s). PHYML can solve ML bootstrap solutions very fast and it is very common to run 1,000 pseudoreplicates with a medium size phylogenetic problem (approx. 1,000 characters by 15 species). The waiting time probably long for a day. However a very interesting alternative is to run the aLRT solution to search for other kind of pseudoparametic support (aLRT values higher than 30 correlates to bootstrap values higher than 95%).

You can get more information here: PhyML 3.0 web page

Example on how correlate bootstrap and aLRT values:

TREE-PUZZLE and ML tests of topologies

TreePuzzle searches for the best ML tree solution using the quartet-puzzling algorithm. TREEPUZZLE also computes pairwise maximum likelihood distances that can be followed by a Neighbor Joining tree in Phylemon to obtain a NJ distance reconstruction using ML computation of genetic differences.In addition, TREE-PUZZLE computes the likelihood mapping, a method to investigate the phylogenetic inertia of the data without computing an overall tree. We recommend to use PhyML to search for the best tree and to use TreePuzzle to test for the best topology against alternatives. The example 2 runs this kind of analysis. You need to define two alternative topology (at least) and evaluate the best tree according to a pre-defined model (of course it is the best for your data). TreePuzzle run one and two-sided Kishino-Hasegawa test, Shimodaira-Hasegawa test, Expected Likelihood Weights. The outfile point out the best tree and the statistical differences (if any) with the alternative trees.

You have more information here: TREE-PUZZLE web page

ML trees in PHYLIP

PROML and DNAML are ML methods of tree reconstruction using DNA or protein sequences in the Phylip package. Both programs make tree inference by using all the parameters defined by the user. That means that the program can not search for the best combination of parameters (for instance alpha, invariant proportion and rates). Since they were the first programs to build ML trees all this option were not included. We added this programs to teach about the use of the first programs computing likelihoods.

You can get more information here: PHYLIP web page

PhyML Best AIC tree

PhyML Best AIC tree is a python script allowing the reconstruction of ML trees using the best AIC-DNA or protein model over all available in PhyML. With AIC criteria are also calculated respective weight of each model¹⁾.
- This program can be run in three ways:
  1. fast: with no topology optimization entering in the computation of likelihoods.
  2. smart: same as fast, but re-run best model with topology optimization (selection of best models is done according to AIC weights, re-run models are the one that sums a weight of 0.95)
  3. slow: with topology optimization entering in the computation of likelihoods, for all tested models.
- Other options:
  1. Do not check for invariant sites
  2. Do not check for gamma distribution
  3. Do not check for differences in rate frequencies
  4. Compute branch support for all trees: by checking this option you will be able to see how behaves the SH-like support of internal nodes, for each of the tested models. (This option makes program run a bit slower)

Source code of this program is available here.

Bayesian Methods

Many people disagree with ML analysis because the method provides an statistical solution that explain with the best accuracy the probability of the data (aligned sequences) according to the model (topology, branches length, and all the parameters of the evolutionary model). If the model you chose to solve the tree is not true, the tree is false. An alternative solution is to maximize the probability to find the tree and all the parameters given the data. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem.

MrBayes

The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

Interactive mode

MrBayes allows user to run it interactively, letting user executing usual commands directly in a shell ²⁾, otherwise, if you select the non-interactive option MrBayes will stop at the end of your block of commands.

Interactive option is useful for experimented users, and allow to adjust the number of generations to run instead of assuming that 100.000 would be enough.

Note that MrBayes have extensive help available from inside this shell, you can start by typing Help to get start:

Make your MrBayes commands block

in the case you have not built a MrBayes command block (you are only providing to MrBayes an alignment), you could be interested in checking this option. Once checked you will see a consequent increase of the form size, those new options should appear:

By choosing different values for this set of parameters we propose, user will be able to circumvent the usual manual build of MrBayes command block and its problems (typos, missing arguments…). It is important to note that this option is independent of the Interactive option, user can build his block using the form and go deeper in the analysis in a second step using the interactive option.

Warning: building a commands block will not merge with any other information in file uploaded, if another commands block is found Phylemon will replace it.

Citation

MrBayes

MrBayes 3: Bayesian phylogenetic inference under mixed models.
Ronquist F, Huelsenbeck JP
Bioinformatics19p1572-4(2003 Aug 12)

PHYLIP

PHYLIP (Phylogeny Inference Package) version 3.6.
Felsenstein, J.
Distributed by the author. Department of Genome Sciences, University of Washington, Seattle (USA) (2004)

TREE-Puzzle

TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing.
Schmidt HA, Strimmer K, Vingron M, von Haeseler A
Bioinformatics18p502-4(2002 Mar)

NJ trees using ML distances (HYPHY)

HyPhy: hypothesis testing using phylogenies.
Pond SL, Frost SD, Muse SV
Bioinformatics21p676-9(2005 Mar 1)

PhyML

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.
Guindon S, Gascuel O
Syst Biol52p696-704(2003 Oct)

¹⁾

Burnham, K. P., and D. R. Anderson. 2003. Model selection and multimodel inference: a practical information-theoretic approach. Springer-Verlag, New York, NY.

²⁾

Table of Contents

Phylogeny

Distance Methods

Distances in PHYLIP

NJ trees using ML distances

Maximum Likelihood (ML) Methods

ML trees with PhyML

TREE-PUZZLE and ML tests of topologies

ML trees in PHYLIP

PhyML Best AIC tree

Bayesian Methods

MrBayes

Interactive mode

Make your MrBayes commands block

Citation

MrBayes

PHYLIP

TREE-Puzzle

NJ trees using ML distances (HYPHY)

PhyML