Input Data Format

Sequences and alignments

Phylemon usually allows users to forget about formats, as it internally converts all input data to the format needed by the any tool. However we would recommend some general rules in order to ease this work of format conversion and avoid errors:

  1. Make it simple: try to avoid wired characters like “,”, @, \, /, (, ), ~…. in your sequence names, and also it is a good idea to avoid spaces or replace them by underscores (_). Using only standard alpha-numeric characters will ease the format conversion done by Phylemon. Most of the errors raised by Phylemon's programs can be avoided by simplifying sequence's names.
  2. Sequence name length: many programs in Phylemon (at least all programs from PHYLIP package) truncate the names of the sequences to 10 characters sometimes resulting in duplicated names. By example, if you have 2 sequences named homo_sapiens_1 and homo_sapiens_2, this will result in 2 sequences both named “homo_sapie”. Than it is a good thing to try to always use short names unless you know that the tools you are going to use can deal with long names.
  3. Invisible formatting characters: Complex text editors (eg: Microsoft word) usually insert extra hidden characters containing format information, you will be able to get rid of them by using one of those more “simple” text editors
    1. Windows' note pad
    2. “Zap Gremlins”1) option under MacOS (if you are using TextWrangler)
    3. Kedit, Kate, Gedit, Emacs, vi… under GNU/linux

Trees

Phylemon supports essentially newick format for trees, with some variation for programs like SLR or CodeML (conversion normally handled by phylemon).

  1. Usually programs do not like internal nodes with only one descendant (here repensented by –/– ):
    • tree with internal node with only one child: (a,(b));
     /-a
----|
     \--- /-b
     
     /-a
----|
     \-b