Calculation of trees from alignments

Trees are calculated on either the complete alignment, or just the currently selected group of sequences, via the calculations dialog opened from the Calculate→Calculate Tree or PCA... menu entry. Once calculated, trees are displayed in a new tree viewing window. There are four different calculations, using one of two distance measures and constructing the tree from one of two algorithms :

Distance Measures

Trees are calculated on the basis of a measure of similarity between each pair of sequences in the alignment :

• PID
The percentage identity between the two sequences at each aligned position.
• PID = Number of equivalent aligned non-gap symbols * 100 / Smallest number of non-gap positions in either of both sequences
This is essentially the 'number of identical bases (or residues) per 100 base pairs (or residues)'.
• BLOSUM62, PAM250, DNA
These options use one of the available substitution matrices to compute a sum of scores for the residue pairs at each aligned position.
• Sequence Feature Similarity
Trees are constructed from a distance matrix formed from Jaccard distances between sequence features observed at each column of the alignment.
• Similarity at column i = (Total number of features displayed - Sum of number of features in common at i)
Similarities are summed over all columns and divided by the number of columns.
Since the total number of feature types is constant over all columns of the alignment, we do not scale the matrix, so tree distances can be interpreted as the average number of features that differ over all sites in the aligned region.
Distances are computed based on the currently displayed feature types. Sequences with similar distributions of features of the same type will be grouped together in trees computed with this metric. This measure was introduced in Jalview 2.9

Tree Construction Methods

Jalview currently supports two kinds of agglomerative clustering methods. These are not intended to substitute for rigorous phylogenetic tree construction, and may fail on very large alignments.

• UPGMA tree
UPGMA stands for Unweighted Pair-Group Method using Arithmetic averages. Clusters are iteratively formed and extended by finding a non-member sequence with the lowest average dissimilarity over the cluster members.

• Neighbour Joining tree
First described in 1987 by Saitou and Nei, this method applies a greedy algorithm to find the tree with the shortest branch lengths.
This method, as implemented in Jalview, is considerably more expensive than UPGMA.

A newly calculated tree will be displayed in a new tree viewing window. In addition, a new entry with the same tree viewer window name will be added in the Sort menu so that the alignment can be reordered to reflect the ordering of the leafs of the tree. If the tree was calculated on a selected region of the alignment, then the title of the tree view will reflect this.

External Sources for Phylogenetic Trees

A number of programs exist for the reliable construction of phylogenetic trees, which can cope with large numbers of sequences, use better distance methods and can perform bootstrapping. Jalview can read Newick format tree files using the 'Load Associated Tree' entry of the alignment's File menu. Sequences in the alignment will be automatically associated to nodes in the tree, by matching Sequence IDs to the tree's leaf names.