Sorting sequences
Once a consensus calculation has been done selecting this option will sort the
selected sequences by their percentage identity to the consensus sequence. The
most similar sequence is put at the top. If no sequences are selected then redundancy
is removed from the whole alignment.
Sorting by tree order
If a UPGMA tree or a neighbour joining tree has been displayed then the main
alignment window displays the sequences in the same order as they appear in
the tree. This makes for easier comparison of the tree and the alignment.
Sorting by group order
If the sequences have been grouped either by hand or by selecting a point on
the tree then this option will reorder the alignment so all sequences in the
same group are together. The largest group is shown at the top of the alignment
and the smallest at the bottom.
Removing redundancy
Selecting this option brings up a window asking you to select a threshold. If
the percentage identity between two sequences exceeds this value one of the
sequences (the shorter) is discarded. The redundancy calculation is done when
the Apply button is pressed. For large numbers of sequences this can take a
long time as all pairs have to be compared.
Pairwise alignment (Proteins only)
This calculation is performed on the selected sequences only. Java is not the
fastest language in the world and aligning more than a handful of sequences
will take a fair amount of time.
For each pair of sequences the best global alignment is found using BLOSUM62
as the scoring matrix. The scores reported are the raw scores. The sequences
are aligned using a dynamic programming technique and using the following gap
penalties :
Gap open : 12
Gap extend : 2
When you select the pairwise alignment option a new window will come up which will display the alignments in a text format as they are calculated. Also displayed is information about the alignment such as alignment score, length and percentage identity between the sequences.
If you want to save that pairwise alignment (it's not in any known format I'm
afraid) you can cut and paste it from the text window with the mouse. You can
also press the 'View in alignment editor' button to bring up another editor
window.
Principal Component Analysis
This is a method of clustering sequences based on the method developed by G.
Casari, C. Sander and A. Valencia. Structural Biology volume 2, no. 2, February
1995 . Extra information can also be found at the SeqSpace server at the EBI.
The version implemented here only looks at the clustering of whole sequences
and not individual positions in the alignment to help identify functional residues.
For large alignments plans are afoot to use the CORBA server written by Chris
Dodge to do this 'residue space' PCA remotely.
When the Calculate->Principal component analysis option is selected all the sequences (not just the selected ones) are used in the calculation and for large numbers of sequences this could take quite a time. When the calculation is finished a new window is displayed showing the projections of the sequences along the 2nd, 3rd and 4th vectors giving a 3dimensional view of how the sequences cluster.
This 3d view can be rotated by holding the left mouse button down in the PCA window and moving it. The user can also zoom in and out by using the up and down arrow keys.
Individual points can be selected using the mouse and selected sequences show up green in the PCA window and the usual grey background/white text in the alignment and tree windows.
Different eigenvectors can be used to do the projection by changing the selected
dimensions in the 3 menus underneath the 3d window.
UPGMA tree
If this option is selected from the Calculate menu then all sequences are used
to generate a UPGMA tree. The pairwise distances used to cluster the sequences
are the percentage mismatch between two sequences. For a reliable phylogenetic
tree I recommend other programs (phylowin, phylip) should be used as they have
the speed to use better distance methods and bootstrapping. Again, plans are
afoot for a server to do this and to be able to read in tree files generated
by other programs.
When the tree has been calculated a new window is displayed showing the tree
with labels on the leaves showing the sequence ids. The user can select the
ids with the mouse and the selected sequences will also be selected in the alignment
window and the PCA window if that analysis has been calculated.
Selecting the 'show distances' checkbox will put branch lengths on the branches. These branch lengths are the percentage mismatch between two nodes.
Postscript output can be generated for this tree and mailed to you by clicking
the Output button. This will bring up a window asking you for your email address
and you can set font options and the page orientation. Clicking the Apply button
will generate the postscript and send the email.
Neighbour Joining tree
The distances between sequences for this tree are generated in the same way
as for the UPGMA tree. The method of clustering is the neighbour joining method
which doesn't just pick the two closest leaves to cluster together but compensates
for long edges by subtracting from the distances the average distance from each
leaf to all the others.
Selection and output options are the same as for the UPGMA tree.
Conservation
This option is based on the AMAS method of multiple sequence alignment analysis
(Livingstone C.D. and Barton G.J. (1993), Protein Sequence Alignments: A Strategy
for the Hierarchical Analysis of Residue Conservation.CABIOS Vol. 9 No. 6 (745-756)).
Hierarchical analysis is based on each residue having certain physico-chemical
properties listed as follows:
In brief go about it like this :
The alignment can first be divided into groups. This is best done by first
creating an average distance tree (Calculate->Average distance tree). Selecting
a position on the tree will cluster the sequences into groups depending on the
position selected. Each group is coloured a different colour which is used for
both the ids in the tree and alignment windows and the sequences themselves.
If a PCA window is visible a visual comparison can be made between the clustering
based on the tree and the PCA.
This link provides an example of the output after grouping for Pfam family rnaseH:
The grouping by tree may not be satisfactory and the user may want to edit the groups (Edit->Groups...) to put any outliers together.
Before selecting the conservation option change the colour scheme to something sensible (Taylor or hydrophobicity for example). When the conservation is done the existing colour scheme is modified so that the most conserved columns in each group have the most intense colours and the least conserved are the palest.
This link shows the results of first colouring the alignment by hydrophobicity
(Colour->by hydrophobicity) then performing conservation analysis (Calculate->Conservation).
Conserved hydrophobic columns are shown with predominately red residues and
conserved hydrophilic columns with blue. The most conserved regions have the
brightest colours.
Here is shown the same conservation but with Taylor colours instead of hydrophobicity
(Colour->Taylor).
The conservation analysis is done on each sequence group. This highlights differences and similarities in conserved residue properties between groups.