help/help/html/calculations/tree.html

   1 <html>
   2 <!--
   3  * Jalview - A Sequence Alignment Editor and Viewer ($$Version-Rel$$)
   4  * Copyright (C) $$Year-Rel$$ The Jalview Authors
   5  *
   6  * This file is part of Jalview.
   7  *
   8  * Jalview is free software: you can redistribute it and/or
   9  * modify it under the terms of the GNU General Public License
  10  * as published by the Free Software Foundation, either version 3
  11  * of the License, or (at your option) any later version.
  12  *
  13  * Jalview is distributed in the hope that it will be useful, but
  14  * WITHOUT ANY WARRANTY; without even the implied warranty
  15  * of MERCHANTABILITY or FITNESS FOR A PARTICULAR
  16  * PURPOSE.  See the GNU General Public License for more details.
  17  *
  18  * You should have received a copy of the GNU General Public License
  19  * along with Jalview.  If not, see <http://www.gnu.org/licenses/>.
  20  * The Jalview Authors are detailed in the 'AUTHORS' file.
  21  -->
  22 <head>
  23 <title>Tree Calculation</title>
  24 </head>
  25 <body>
  26   <p>
  27     <strong>Calculation of trees from alignments</strong>
  28   </p>
  29   <p>
  30     Trees are calculated on either the complete alignment, or just the
  31     currently selected group of sequences, via the <a href="calculations.html">calculations dialog</a> opened from the <strong>Calculate&#8594;Calculate
  32       Tree or PCA...</strong> menu entry. Once calculated, trees are displayed in a new <a
  33       href="../calculations/treeviewer.html">tree viewing
  34       window</a>. There are four different calculations, using one of two
  35     distance measures and constructing the tree from one of two
  36     algorithms :
  37   </p>
  38   <p>
  39     <strong>Distance Measures</strong>
  40   </p>
  41   <p>Trees are calculated on the basis of a measure of similarity
  42     between each pair of sequences in the alignment :
  43   <ul>
  44     <li><strong>PID</strong><br>The percentage identity
  45       between the two sequences at each aligned position.
  46       <ul>
  47         <li>PID = Number of equivalent aligned non-gap symbols *
  48           100 / Smallest number of non-gap positions in either of both
  49           sequences<br> <em>This is essentially the 'number of
  50             identical bases (or residues) per 100 base pairs (or
  51             residues)'.</em>
  52         </li>
  53       </ul>
  54     <li><strong>BLOSUM62, PAM250, DNA</strong><br />These options
  55       use one of the available substitution matrices to compute a sum of
  56       scores for the residue pairs at each aligned position.
  57       <ul>
  58         <li>For details about each model, see the <a
  59           href="scorematrices.html">list of built-in score
  60             matrices</a>.
  61         </li>
  62       </ul></li>
  63     <li><strong>Sequence Feature Similarity</strong><br>Trees
  64       are constructed from a distance matrix formed from Jaccard
  65       distances between sequence features observed at each column of the
  66       alignment.
  67       <ul>
  68         <li>Similarity at column <em>i</em> = (Total number of
  69           features displayed - Sum of number of features in common at <em>i</em>)
  70           <br />Similarities are summed over all columns and divided by
  71           the number of columns. <br />Since the total number of
  72           feature types is constant over all columns of the alignment,
  73           we do not scale the matrix, so tree distances can be
  74           interpreted as the average number of features that differ over
  75           all sites in the aligned region.
  76         </li>
  77
  78       </ul> Distances are computed based on the currently displayed feature
  79       types. Sequences with similar distributions of features of the
  80       same type will be grouped together in trees computed with this
  81       metric. <em>This measure was introduced in Jalview 2.9</em></li>
  82
  83           <li><strong>Secondary Structure Similarity</strong><br>Trees are
  84           generated using a distance matrix, which is constructed from Jaccard
  85           distances that specifically consider the secondary structure features
  86           observed at each column of the alignment.
  87       <ul>
  88         <li>For secondary structure similarity analysis, at any given column
  89                 <em>i</em>, the range of unique secondary structures is between 0 and 2,
  90                 reflecting the presence of helices, sheets, coils and gaps.
  91                 <br>The similarity at column <em>i</em> = Total
  92                 number of unique secondary structures (which can range from 0 to 2)
  93                 - Sum of the number of secondary structures in common at column
  94                 <em>i</em> (which can be either 0 or 1)<br>The similarity scores are
  95                 summed across all columns and then divided by the total number of
  96                 columns to calculate an average similarity score.
  97         </li>
  98       </ul>
  99           Distance calculations are based on the secondary structures
 100           currently displayed. Sequences with similar distributions of secondary
 101           structures will be grouped together in trees.<br>
 102           <em>The distance between two sequences is maximum when one
 103           sequence has a defined secondary structure annotation track and the
 104           other does not, indicating complete dissimilarity between them.
 105           Whereas, the distance between two sequences is minimum when both of
 106           the sequences within the comparison do not have a defined secondary
 107           structure annotation track.</em>
 108           </li>
 109   </ul>
 110   <p>
 111     <strong>Tree Construction Methods</strong>
 112   </p>
 113   <p>Jalview currently supports two kinds of agglomerative
 114     clustering methods. These are not intended to substitute for
 115     rigorous phylogenetic tree construction, and may fail on very large
 116     alignments.
 117   <ul>
 118     <li><strong>UPGMA tree</strong><br> UPGMA stands for
 119       Unweighted Pair-Group Method using Arithmetic averages. Clusters
 120       are iteratively formed and extended by finding a non-member
 121       sequence with the lowest average dissimilarity over the cluster
 122       members.
 123       <p></p></li>
 124     <li><strong>Neighbour Joining tree</strong><br> First
 125       described in 1987 by Saitou and Nei, this method applies a greedy
 126       algorithm to find the tree with the shortest branch lengths.<br>
 127       This method, as implemented in Jalview, is considerably more
 128       expensive than UPGMA.</li>
 129   </ul>
 130   <p>
 131     A newly calculated tree will be displayed in a new <a
 132       href="../calculations/treeviewer.html">tree viewing
 133       window</a>. In addition, a new entry with the same tree viewer window
 134     name will be added in the Sort menu so that the alignment can be
 135     reordered to reflect the ordering of the leafs of the tree. If the
 136     tree was calculated on a selected region of the alignment, then the
 137     title of the tree view will reflect this.
 138   </p>
 139
 140   <p>
 141     <strong>External Sources for Phylogenetic Trees</strong>
 142   </p>
 143   <p>
 144     A number of programs exist for the reliable construction of
 145     phylogenetic trees, which can cope with large numbers of sequences,
 146     use better distance methods and can perform bootstrapping. Jalview
 147     can read <a
 148       href="http://evolution.genetics.washington.edu/phylip/newick_doc.html">Newick</a>
 149     format tree files using the 'Load Associated Tree' entry of the
 150     alignment's File menu. Sequences in the alignment will be
 151     automatically associated to nodes in the tree, by matching Sequence
 152     IDs to the tree's leaf names.
 153   </p>
 154
 155
 156 </body>
 157 </html>