Untitled Document

Calculate menu
Consensus sequence
Sorting sequence by pairwise identity to consensus
Sorting by tree order
Sorting by group order
Removing redundancy of sequences
Smith-Waterman pairwise alignment
Principal component analysis
UPGMA tree using percentage identity distances
Neighbour joining tree using percentage identity distances
Conservation

Consensus sequence
Each residue in the consensus sequence is the most frequent residue in each column of the alignment excluding gap residues ' ','-' and '.' . You can't access the consensus sequence directly but it is used in the PID colour scheme.
When the editor first starts up the consensus sequence is automatically calculated using all the sequences in the alignment and the PID colour scheme is used as default. If the consensus option is selected again only the currently selected sequences are used to calculate it and all sequences in the alignment are coloured according to that consensus.

Sorting sequences
Once a consensus calculation has been done selecting this option will sort the selected sequences by their percentage identity to the consensus sequence. The most similar sequence is put at the top. If no sequences are selected then redundancy is removed from the whole alignment.
Sorting by tree order
If a UPGMA tree or a neighbour joining tree has been displayed then the main alignment window displays the sequences in the same order as they appear in the tree. This makes for easier comparison of the tree and the alignment.

Sorting by group order
If the sequences have been grouped either by hand or by selecting a point on the tree then this option will reorder the alignment so all sequences in the same group are together. The largest group is shown at the top of the alignment and the smallest at the bottom.
Removing redundancy
Selecting this option brings up a window asking you to select a threshold. If the percentage identity between two sequences exceeds this value one of the sequences (the shorter) is discarded. The redundancy calculation is done when the Apply button is pressed. For large numbers of sequences this can take a long time as all pairs have to be compared.
Pairwise alignment (Proteins only)
This calculation is performed on the selected sequences only. Java is not the fastest language in the world and aligning more than a handful of sequences will take a fair amount of time.
For each pair of sequences the best global alignment is found using BLOSUM62 as the scoring matrix. The scores reported are the raw scores. The sequences are aligned using a dynamic programming technique and using the following gap penalties :

Gap open : 12
Gap extend : 2

When you select the pairwise alignment option a new window will come up which will display the alignments in a text format as they are calculated. Also displayed is information about the alignment such as alignment score, length and percentage identity between the sequences.

If you want to save that pairwise alignment (it's not in any known format I'm afraid) you can cut and paste it from the text window with the mouse. You can also press the 'View in alignment editor' button to bring up another editor window.

Principal Component Analysis
This is a method of clustering sequences based on the method developed by G. Casari, C. Sander and A. Valencia. Structural Biology volume 2, no. 2, February 1995 . Extra information can also be found at the SeqSpace server at the EBI.
The version implemented here only looks at the clustering of whole sequences and not individual positions in the alignment to help identify functional residues. For large alignments plans are afoot to use the CORBA server written by Chris Dodge to do this 'residue space' PCA remotely.

When the Calculate->Principal component analysis option is selected all the sequences (not just the selected ones) are used in the calculation and for large numbers of sequences this could take quite a time. When the calculation is finished a new window is displayed showing the projections of the sequences along the 2nd, 3rd and 4th vectors giving a 3dimensional view of how the sequences cluster.

This 3d view can be rotated by holding the left mouse button down in the PCA window and moving it. The user can also zoom in and out by using the up and down arrow keys.

Individual points can be selected using the mouse and selected sequences show up green in the PCA window and the usual grey background/white text in the alignment and tree windows.

Different eigenvectors can be used to do the projection by changing the selected dimensions in the 3 menus underneath the 3d window.

UPGMA tree
If this option is selected from the Calculate menu then all sequences are used to generate a UPGMA tree. The pairwise distances used to cluster the sequences are the percentage mismatch between two sequences. For a reliable phylogenetic tree I recommend other programs (phylowin, phylip) should be used as they have the speed to use better distance methods and bootstrapping. Again, plans are afoot for a server to do this and to be able to read in tree files generated by other programs.
When the tree has been calculated a new window is displayed showing the tree with labels on the leaves showing the sequence ids. The user can select the ids with the mouse and the selected sequences will also be selected in the alignment window and the PCA window if that analysis has been calculated.

Selecting the 'show distances' checkbox will put branch lengths on the branches. These branch lengths are the percentage mismatch between two nodes.

Postscript output can be generated for this tree and mailed to you by clicking the Output button. This will bring up a window asking you for your email address and you can set font options and the page orientation. Clicking the Apply button will generate the postscript and send the email.

Neighbour Joining tree
The distances between sequences for this tree are generated in the same way as for the UPGMA tree. The method of clustering is the neighbour joining method which doesn't just pick the two closest leaves to cluster together but compensates for long edges by subtracting from the distances the average distance from each leaf to all the others.
Selection and output options are the same as for the UPGMA tree.

Conservation
This option is based on the AMAS method of multiple sequence alignment analysis (Livingstone C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9 No. 6 (745-756)).
Hierarchical analysis is based on each residue having certain physico-chemical properties listed as follows:

In brief go about it like this :

The alignment can first be divided into groups. This is best done by first creating an average distance tree (Calculate->Average distance tree). Selecting a position on the tree will cluster the sequences into groups depending on the position selected. Each group is coloured a different colour which is used for both the ids in the tree and alignment windows and the sequences themselves. If a PCA window is visible a visual comparison can be made between the clustering based on the tree and the PCA.
This link provides an example of the output after grouping for Pfam family rnaseH:

The grouping by tree may not be satisfactory and the user may want to edit the groups (Edit->Groups...) to put any outliers together.

Before selecting the conservation option change the colour scheme to something sensible (Taylor or hydrophobicity for example). When the conservation is done the existing colour scheme is modified so that the most conserved columns in each group have the most intense colours and the least conserved are the palest.

This link shows the results of first colouring the alignment by hydrophobicity (Colour->by hydrophobicity) then performing conservation analysis (Calculate->Conservation). Conserved hydrophobic columns are shown with predominately red residues and conserved hydrophilic columns with blue. The most conserved regions have the brightest colours.

Here is shown the same conservation but with Taylor colours instead of hydrophobicity (Colour->Taylor).

The conservation analysis is done on each sequence group. This highlights differences and similarities in conserved residue properties between groups.