<mapID target="edit" url="html/editing/index.html"/>\r
\r
<mapID target="trees" url="html/calculations/tree.html"/>\r
- <mapID target="conservation" url="html/calculations/conservation.html"/> \r
<mapID target="pca" url="html/calculations/pca.html"/>\r
<mapID target="pairwise" url="html/calculations/pairwise.html"/>\r
<mapID target="redundancy" url="html/calculations/redundancy.html"/>\r
<indexitem text="Input / Output" target="io"/>\r
<indexitem text="Editing Alignment" target="edit"/>\r
<indexitem text="Colour Schemes" target="colours"/>\r
- <indexitem text="Calculating trees" target="trees"/>\r
- <indexitem text="Conservation" target="conservation"/>\r
+ <indexitem text="Calculating and viewing trees" target="trees"/>\r
+ <indexitem text="Conservation" target="colours.conservation"/>\r
<indexitem text="Principal Component Analysis" target="pca"/>\r
</indexitem>\r
<indexitem text="Useful Information" target="home"/>\r
<head><title>Principal Component Analysis</title></head>\r
<body>\r
<p><strong>Principal Component Analysis</strong></p>\r
-<p>This is a method of clustering sequences based on the method developed by G.\r
- Casari, C. Sander and A. Valencia. Structural Biology volume 2, no. 2, February\r
- 1995 . Extra information can also be found at the SeqSpace server at the EBI.\r
- <br>\r
- The version implemented here only looks at the clustering of whole sequences\r
- and not individual positions in the alignment to help identify functional residues.\r
- For large alignments plans are afoot to implement a web service to do this 'residue\r
- space' PCA remotely. </p>\r
-<p>When the Principal component analysis option is selected all the sequences\r
- ( or just the selected ones) are used in the calculation and for large numbers\r
- of sequences this could take quite a time. When the calculation is finished\r
- a new window is displayed showing the projections of the sequences along the\r
- 2nd, 3rd and 4th vectors giving a 3dimensional view of how the sequences cluster.\r
+<p>This calculation creates a spatial representation of the\r
+similarities within a selected group, or all of the sequences in\r
+an alignment. After the calculation finishes, a 3D viewer displays the\r
+set of sequences as points in 'similarity space', and similar\r
+sequences tend to lie near each other in the space.</p>\r
+<p>Note: The calculation is computationally expensive, and may fail for very large sets of sequences -\r
+ usually because the JVM has run out of memory. The next release of\r
+ Jalview release will execute this calculation through a web service.</p>\r
+<p>Principal components analysis is a technique for examining the\r
+structure of complex datasets. The components are a set of dimensions\r
+formed from the measured values in the dataset, and the principle\r
+component is the one with the greatest magnitude, or length. The\r
+sets of measurements that differ the most should lie at either end of\r
+this principle axis, and the other axes correspond to less extreme\r
+patterns of variation in the dataset.\r
</p>\r
-<p>This 3d view can be rotated by holding the left mouse button down in the PCA\r
- window and moving it. The user can also zoom in and out by using the up and\r
- down arrow keys. </p>\r
-<p>Individual points can be selected using the mouse and selected sequences show\r
- up green in the PCA window and the usual grey background/white text in the alignment\r
- and tree windows. </p>\r
-<p>Different eigenvectors can be used to do the projection by changing the selected\r
- dimensions in the 3 menus underneath the 3d window. <br>\r
+\r
+<p>In this case, the components are generated by an eigenvector\r
+decomposition of the matrix formed from the sum of BLOSUM scores at\r
+each aligned position between each pair of sequences. The basic method\r
+is described in the paper by G. Casari, C. Sander and\r
+A. Valencia. Structural Biology volume 2, no. 2, February 1995 (<a\r
+href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=7749921">pubmed</a>)\r
+ and implemented at the SeqSpace server (<a\r
+ href="http://industry.ebi.ac.uk/SeqSpace/">http://industry.ebi.ac.uk/SeqSpace</a>) at the EBI.\r
</p>\r
+\r
+<p><strong>The PCA Viewer</strong></p> \r
+<p>This is an interactive display of the sequences positioned within\r
+ the similarity space. The colour of each sequence point is the same\r
+ as the sequence group coloring, white if no colour has been\r
+ defined for the sequence, and green if the sequence is part of a\r
+ the currently selected group.\r
+</p>\r
+ <p>The 3d view can be rotated by dragging the mouse with the\r
+ <strong>left mouse button</strong> pressed. The view can also be\r
+ zoomed in and out with the up and down <strong>arrow\r
+ keys</strong>.</p> \r
+<p>A tool tip gives the sequence ID corresponding to a point in the\r
+ space, and clicking a point toggles the selection of the\r
+ corresponding sequence in the alignment window. Rectanglar region\r
+ based selection is also possible, by holding the 's' key whilst\r
+ left-clicking and dragging the mouse over the display.\r
+</p>\r
+<p>Initially, the display shows the first three components of the\r
+ similarity space, but any eigenvector can be used by changing the selected\r
+ dimension for the x, y, or z axis through each ones menu located\r
+ below the 3d display.\r
+</p>\r
+\r
</body>\r
</html>\r
<html>\r
<head><title>Tree Calculation</title></head>\r
<body>\r
-<p><strong>UPGMA tree</strong></p>\r
-<p>If this option is selected then all sequences are used to generate a UPGMA\r
- tree. The pairwise distances used to cluster the sequences are the percentage\r
- mismatch between two sequences. For a reliable phylogenetic tree I recommend\r
- other programs (phylowin, phylip) should be used as they have the speed to use\r
- better distance methods and bootstrapping. Again, plans are afoot for a server\r
- to do this and to be able to read in tree files generated by other programs.\r
- <br>\r
- When the tree has been calculated a new window is displayed showing the tree\r
- with labels on the leaves showing the sequence ids. The user can select the\r
- ids with the mouse and the selected sequences will also be selected in the alignment\r
- window and the PCA window if that analysis has been calculated. </p>\r
+<p><strong>Calculation of trees from alignments</strong></p>\r
+<p>Trees are calculated on either the complete alignment, or just the\r
+currently selected group of sequences. There are four different\r
+calculations, using one of two distance measures and constructing the\r
+tree from one of two algorithms :\r
+</p>\r
+<p><strong>Distance Measures</strong></p>\r
+<p>Trees are calculated on the basis of a measure of similarity\r
+between each pair of sequences in the alignment :\r
+<ul>\r
+<li><strong>PID</strong><br>The percentage identity between the two\r
+sequences at each aligned position.\r
+<li><strong>BLOSUM62</strong><br>The sum of BLOSUM62 scores for the\r
+residue pair at each aligned position.\r
+</ul>\r
+</p>\r
+<p><strong>Tree Construction Methods</strong></p>\r
+<p>Jalview currently supports two kinds of agglomerative clustering\r
+methods. These are not intended to substitute for rigorous\r
+phylogenetic tree construction, and may fail on very large alignments.\r
+<ul>\r
+<li><strong>UPGMA tree</strong><br>\r
+ UPGMA stands for Unweighted Pair-Group Method using Arithmetic\r
+ averages. Clusters are iteratively formed and extended by finding a\r
+ non-member sequence with the lowest average dissimilarity over the\r
+ cluster members.\r
+<p></p>\r
+</li>\r
+<li><strong>Neighbour Joining tree</strong><br>\r
+ First described in 1987 by Saitou and Nei, this method applies a\r
+ greedy algorithm to find the tree with the shortest branch\r
+ lengths.<br>\r
+ This method, as implemented in Jalview, is considerably more\r
+ expensive than UPGMA.\r
+</li>\r
+</ul>\r
+</p>\r
+<p></p>\r
+<p><strong>The Tree Viewing Window</strong></p>\r
+<p>\r
+ When the tree has been calculated a window is displayed showing the\r
+ tree, with the leaves labelled with sequence ids. \r
<p>Selecting the 'show distances' checkbox will put branch lengths on the branches.\r
These branch lengths are the percentage mismatch between two nodes. </p>\r
-<p> </p>\r
-<p><strong>Neighbour Joining tree</strong></p>\r
-<p> The distances between sequences for this tree are generated in the same way\r
- as for the UPGMA tree. The method of clustering is the neighbour joining method\r
- which doesn't just pick the two closest leaves to cluster together but compensates\r
- for long edges by subtracting from the distances the average distance from each\r
- leaf to all the others. <br>\r
- Selection and output options are the same as for the UPGMA tree.<br>\r
+ \r
+<p>\r
+ Selecting sequence ids at the leaves of the tree selects sequences\r
+ in the original alignment. These selections are reflected in any\r
+ other analysis windows open on the same alignment. </p>\r
+<p>\r
+ Clicking on an internal node of the tree will rearrange the tree\r
+ diagram, inverting the ordering of the branches at that node.\r
+</p>\r
+<p>\r
+ Clicking anywhere along the extent of the tree (but not on a leaf or\r
+ internal node) defines a tree 'partition', by cutting every branch\r
+ of the tree spanning the depth where the mouse-click occured. Groups\r
+ are created containing sequences at the leaves of each connected\r
+ subtree. These groups are each given a different colour, which are\r
+ reflected in other windows in the same way as if the sequence ids\r
+ were selected, and can be edited in the same way as user defined\r
+ sequence groups.\r
</p>\r
+<p>Tree partitions are useful for comparing clusterings produced by\r
+different methods and measures. They are also an effective way of\r
+identifying specific patterns of conservation and mutation\r
+corresponding to the overall phylogenetic structure, when combined\r
+with the <a href="../colourSchemes/conservation.html">conservation\r
+based colour scheme</a>.</p>\r
+\r
+\r
+<p><strong>External Sources for Phylogenetic Tree Construction</strong></p>\r
+ <p>A number of programs exist for the reliable construction of\r
+ phylogenetic trees, which can cope with large numbers of sequences,\r
+ use better distance methods and can perform bootstrapping. See the\r
+ <a href="../webservices/phylogeny.html">Phylogenetic Web\r
+ Services</a> page for directly accessible methods. It will also be\r
+ possible to read trees into Jalview directly, in the near future.\r
+ </p>\r
+\r
</body>\r
</html>\r
<html>\r
<head><title>Conservation Calculation</title></head>\r
<body>\r
-<p><em>Conservation Colours</em></p>\r
-<p>This option is based on the AMAS method of multiple sequence alignment analysis\r
- (Livingstone C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy\r
- for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9 No. 6 (745-756)).\r
- <br>\r
- Hierarchical analysis is based on each residue having certain physico-chemical\r
- properties.</p>\r
-<p>The alignment can first be divided into groups. This is best done by first\r
- creating an average distance tree (Calculate->Average distance tree). Selecting\r
- a position on the tree will cluster the sequences into groups depending on the\r
- position selected. Each group is coloured a different colour which is used for\r
- both the ids in the tree and alignment windows and the sequences themselves.\r
- If a PCA window is visible a visual comparison can be made between the clustering\r
- based on the tree and the PCA. </p>\r
-<p>The grouping by tree may not be satisfactory and the user may want to edit\r
- the groups to put any outliers together. </p>\r
-<p>When the conservation option is selected the existing colour scheme is modified\r
- so that the most conserved columns in each group have the most intense colours\r
- and the least conserved are the palest.</p>\r
-<p> </p>\r
+<p><em>Colouring by Conservation</em></p>\r
+<p>This is an approach to alignment colouring based on the one used in\r
+ the AMAS method of multiple sequence alignment analysis (Livingstone\r
+ C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy \r
+ for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9\r
+ No. 6 (745-756)). \r
+</p>\r
+<p>Conservation is measured as a numerical index reflecting the\r
+ conservation of physico-chemical properties in the alignment:\r
+ Identities score highest, and the next most conserved group contain\r
+ substitutions to amino acids lying in the same physico-chemical\r
+ class.</p>\r
+<p>For an already coloured alignment, the conservation index at each\r
+ alignment position is used to modify the shading intensity of the\r
+ colour at that position. This means that the most conserved columns\r
+ in each group have the most intense colours, and the least conserved\r
+ are the palest. The slider controls the contrast between these\r
+ extremes.</p>\r
+<p>Conservation can be calculated over the whole alignment, or just\r
+ within specific groups of sequences (such as those defined by\r
+ <a href="../calculations/tree.html">phylogenetic tree partitioning</a>).\r
+ The option 'apply to all groups' controls whether the contrast\r
+ slider value will be applied to the indices for the currently\r
+ selected group, or all groups defined over the alignment.</p>\r
</body>\r
</html>\r