Principal Component Analysis

From: jprocter Date: Tue, 8 Mar 2005 16:42:01 +0000 (+0000) Subject: more precise methods information in help description. X-Git-Tag: Release_2_0~619 X-Git-Url: http://source.jalview.org/gitweb/?a=commitdiff_plain;h=998ae297436877c6086f657e0a02c376cb5eb4dd;p=jalview.git more precise methods information in help description. --- diff --git a/help/help.jhm b/help/help.jhm index 0bf7ee8..c843321 100755 --- a/help/help.jhm +++ b/help/help.jhm @@ -16,7 +16,6 @@ - diff --git a/help/helpIndex.xml b/help/helpIndex.xml index d38cad8..62ea52f 100755 --- a/help/helpIndex.xml +++ b/help/helpIndex.xml @@ -8,8 +8,8 @@ - - + + diff --git a/help/html/calculations/pca.html b/help/html/calculations/pca.html index 12fbab8..e824d7d 100755 --- a/help/html/calculations/pca.html +++ b/help/html/calculations/pca.html @@ -2,28 +2,55 @@ Principal Component Analysis

Principal Component Analysis

This is a method of clustering sequences based on the method developed by G. - Casari, C. Sander and A. Valencia. Structural Biology volume 2, no. 2, February - 1995 . Extra information can also be found at the SeqSpace server at the EBI. -
- The version implemented here only looks at the clustering of whole sequences - and not individual positions in the alignment to help identify functional residues. - For large alignments plans are afoot to implement a web service to do this 'residue - space' PCA remotely.

When the Principal component analysis option is selected all the sequences - ( or just the selected ones) are used in the calculation and for large numbers - of sequences this could take quite a time. When the calculation is finished - a new window is displayed showing the projections of the sequences along the - 2nd, 3rd and 4th vectors giving a 3dimensional view of how the sequences cluster. +

This calculation creates a spatial representation of the +similarities within a selected group, or all of the sequences in +an alignment. After the calculation finishes, a 3D viewer displays the +set of sequences as points in 'similarity space', and similar +sequences tend to lie near each other in the space.

Note: The calculation is computationally expensive, and may fail for very large sets of sequences - + usually because the JVM has run out of memory. The next release of + Jalview release will execute this calculation through a web service.

Principal components analysis is a technique for examining the +structure of complex datasets. The components are a set of dimensions +formed from the measured values in the dataset, and the principle +component is the one with the greatest magnitude, or length. The +sets of measurements that differ the most should lie at either end of +this principle axis, and the other axes correspond to less extreme +patterns of variation in the dataset.

This 3d view can be rotated by holding the left mouse button down in the PCA - window and moving it. The user can also zoom in and out by using the up and - down arrow keys.

Individual points can be selected using the mouse and selected sequences show - up green in the PCA window and the usual grey background/white text in the alignment - and tree windows.

Different eigenvectors can be used to do the projection by changing the selected - dimensions in the 3 menus underneath the 3d window.
+ +

In this case, the components are generated by an eigenvector +decomposition of the matrix formed from the sum of BLOSUM scores at +each aligned position between each pair of sequences. The basic method +is described in the paper by G. Casari, C. Sander and +A. Valencia. Structural Biology volume 2, no. 2, February 1995 (pubmed) + and implemented at the SeqSpace server (http://industry.ebi.ac.uk/SeqSpace) at the EBI.

+ +

The PCA Viewer

This is an interactive display of the sequences positioned within + the similarity space. The colour of each sequence point is the same + as the sequence group coloring, white if no colour has been + defined for the sequence, and green if the sequence is part of a + the currently selected group. +

The 3d view can be rotated by dragging the mouse with the + left mouse button pressed. The view can also be + zoomed in and out with the up and down arrow + keys.

A tool tip gives the sequence ID corresponding to a point in the + space, and clicking a point toggles the selection of the + corresponding sequence in the alignment window. Rectanglar region + based selection is also possible, by holding the 's' key whilst + left-clicking and dragging the mouse over the display. +

Initially, the display shows the first three components of the + similarity space, but any eigenvector can be used by changing the selected + dimension for the x, y, or z axis through each ones menu located + below the 3d display. +

+ diff --git a/help/html/calculations/tree.html b/help/html/calculations/tree.html index 16d7bbc..a8dbe59 100755 --- a/help/html/calculations/tree.html +++ b/help/html/calculations/tree.html @@ -1,28 +1,85 @@ Tree Calculation -

UPGMA tree

If this option is selected then all sequences are used to generate a UPGMA - tree. The pairwise distances used to cluster the sequences are the percentage - mismatch between two sequences. For a reliable phylogenetic tree I recommend - other programs (phylowin, phylip) should be used as they have the speed to use - better distance methods and bootstrapping. Again, plans are afoot for a server - to do this and to be able to read in tree files generated by other programs. -
- When the tree has been calculated a new window is displayed showing the tree - with labels on the leaves showing the sequence ids. The user can select the - ids with the mouse and the selected sequences will also be selected in the alignment - window and the PCA window if that analysis has been calculated.

Calculation of trees from alignments

Trees are calculated on either the complete alignment, or just the +currently selected group of sequences. There are four different +calculations, using one of two distance measures and constructing the +tree from one of two algorithms : +

Distance Measures

Trees are calculated on the basis of a measure of similarity +between each pair of sequences in the alignment : +

PID
The percentage identity between the two +sequences at each aligned position. +
BLOSUM62
The sum of BLOSUM62 scores for the +residue pair at each aligned position. +

Tree Construction Methods

Jalview currently supports two kinds of agglomerative clustering +methods. These are not intended to substitute for rigorous +phylogenetic tree construction, and may fail on very large alignments. +

UPGMA tree
+ UPGMA stands for Unweighted Pair-Group Method using Arithmetic + averages. Clusters are iteratively formed and extended by finding a + non-member sequence with the lowest average dissimilarity over the + cluster members. +
+
Neighbour Joining tree
+ First described in 1987 by Saitou and Nei, this method applies a + greedy algorithm to find the tree with the shortest branch + lengths.
+ This method, as implemented in Jalview, is considerably more + expensive than UPGMA. +

The Tree Viewing Window

+ When the tree has been calculated a window is displayed showing the + tree, with the leaves labelled with sequence ids.

Selecting the 'show distances' checkbox will put branch lengths on the branches. These branch lengths are the percentage mismatch between two nodes.

Neighbour Joining tree

The distances between sequences for this tree are generated in the same way - as for the UPGMA tree. The method of clustering is the neighbour joining method - which doesn't just pick the two closest leaves to cluster together but compensates - for long edges by subtracting from the distances the average distance from each - leaf to all the others.
- Selection and output options are the same as for the UPGMA tree.
+ +

+ Selecting sequence ids at the leaves of the tree selects sequences + in the original alignment. These selections are reflected in any + other analysis windows open on the same alignment.

+ Clicking on an internal node of the tree will rearrange the tree + diagram, inverting the ordering of the branches at that node. +

+ Clicking anywhere along the extent of the tree (but not on a leaf or + internal node) defines a tree 'partition', by cutting every branch + of the tree spanning the depth where the mouse-click occured. Groups + are created containing sequences at the leaves of each connected + subtree. These groups are each given a different colour, which are + reflected in other windows in the same way as if the sequence ids + were selected, and can be edited in the same way as user defined + sequence groups.

Tree partitions are useful for comparing clusterings produced by +different methods and measures. They are also an effective way of +identifying specific patterns of conservation and mutation +corresponding to the overall phylogenetic structure, when combined +with the conservation +based colour scheme.

+ + +

External Sources for Phylogenetic Tree Construction

A number of programs exist for the reliable construction of + phylogenetic trees, which can cope with large numbers of sequences, + use better distance methods and can perform bootstrapping. See the + Phylogenetic Web + Services page for directly accessible methods. It will also be + possible to read trees into Jalview directly, in the near future. +

+ diff --git a/help/html/colourSchemes/conservation.html b/help/html/colourSchemes/conservation.html index 29756ba..3ac2c13 100755 --- a/help/html/colourSchemes/conservation.html +++ b/help/html/colourSchemes/conservation.html @@ -1,25 +1,29 @@ Conservation Calculation -

Conservation Colours

This option is based on the AMAS method of multiple sequence alignment analysis - (Livingstone C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy - for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9 No. 6 (745-756)). -
- Hierarchical analysis is based on each residue having certain physico-chemical - properties.

The alignment can first be divided into groups. This is best done by first - creating an average distance tree (Calculate->Average distance tree). Selecting - a position on the tree will cluster the sequences into groups depending on the - position selected. Each group is coloured a different colour which is used for - both the ids in the tree and alignment windows and the sequences themselves. - If a PCA window is visible a visual comparison can be made between the clustering - based on the tree and the PCA.

The grouping by tree may not be satisfactory and the user may want to edit - the groups to put any outliers together.

When the conservation option is selected the existing colour scheme is modified - so that the most conserved columns in each group have the most intense colours - and the least conserved are the palest.

Colouring by Conservation

This is an approach to alignment colouring based on the one used in + the AMAS method of multiple sequence alignment analysis (Livingstone + C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy + for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9 + No. 6 (745-756)). +

Conservation is measured as a numerical index reflecting the + conservation of physico-chemical properties in the alignment: + Identities score highest, and the next most conserved group contain + substitutions to amino acids lying in the same physico-chemical + class.

For an already coloured alignment, the conservation index at each + alignment position is used to modify the shading intensity of the + colour at that position. This means that the most conserved columns + in each group have the most intense colours, and the least conserved + are the palest. The slider controls the contrast between these + extremes.

Conservation can be calculated over the whole alignment, or just + within specific groups of sequences (such as those defined by + phylogenetic tree partitioning). + The option 'apply to all groups' controls whether the contrast + slider value will be applied to the indices for the currently + selected group, or all groups defined over the alignment.