website/prog_docs/clustalw.txt

   1 \r
   2 \r
   3 \r
   4  CLUSTAL 2.0.12 Multiple Sequence Alignments\r
   5 \r
   6 \r
   7 \r
   8 \r
   9 >> HELP NEW <<             NEW FEATURES/OPTIONS\r
  10 \r
  11 ==UPGMA== \r
  12  The UPGMA algorithm has been added to allow faster tree construction. The user now\r
  13  has the choice of using Neighbour Joining or UPGMA. The default is still NJ, but the\r
  14  user can change this by setting the clustering parameter.\r
  15  \r
  16  -CLUSTERING=   :NJ or UPGMA\r
  17  \r
  18 ==ITERATION==\r
  19 \r
  20  A remove first iteration scheme has been added. This can be used to improve the final\r
  21  alignment or improve the alignment at each stage of the progressive alignment. During the \r
  22  iteration step each sequence is removed in turn and realigned. If the resulting alignment \r
  23  is better than the  previous alignment it is kept. This process is repeated until the score\r
  24  converges (the  score is not improved) or until the maximum number of iterations is \r
  25  reached. The user can  iterate at each step of the progressive alignment by setting the \r
  26  iteration parameter to  TREE or just on the final alignment by seting the iteration \r
  27  parameter to ALIGNMENT. The default is no iteration. The maximum number of  iterations can \r
  28  be set using the numiter parameter. The default number of iterations is 3.\r
  29   \r
  30  -ITERATION=    :NONE or TREE or ALIGNMENT\r
  31  \r
  32  -NUMITER=n     :Maximum number of iterations to perform\r
  33  \r
  34 ==HELP==\r
  35  \r
  36  -FULLHELP      :Print out the complete help content\r
  37  \r
  38 ==MISC==\r
  39 \r
  40  -MAXSEQLEN=n   :Maximum allowed sequence length\r
  41  \r
  42  -QUIET         :Reduce console output to minimum\r
  43  \r
  44  -STATS=file    :Log some alignents statistics to file\r
  45 \r
  46 \r
  47 >> HELP 1 <<             General help for CLUSTAL W (2.0.12)\r
  48 \r
  49 Clustal W is a general purpose multiple alignment program for DNA or proteins.\r
  50 \r
  51 SEQUENCE INPUT:  all sequences must be in 1 file, one after another.  \r
  52 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, \r
  53 Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.\r
  54 All non-alphabetic characters (spaces, digits, punctuation marks) are ignored\r
  55 except "-" which is used to indicate a GAP ("." in MSF-RSF).  \r
  56 \r
  57 To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to \r
  58 INPUT them; go to menu item 2 to do the multiple alignment.\r
  59 \r
  60 PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments.  Use this to\r
  61 add a new sequence to an old alignment, or to use secondary structure to guide \r
  62 the alignment process.  GAPS in the old alignments are indicated using the "-" \r
  63 character.   PROFILES can be input in ANY of the allowed formats; just \r
  64 use "-" (or "." for MSF-RSF) for each gap position.\r
  65 \r
  66 PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in\r
  67 with "-" characters to indicate gaps) OR after a multiple alignment while the \r
  68 alignment is still in memory.\r
  69 \r
  70 \r
  71 The program tries to automatically recognise the different file formats used\r
  72 and to guess whether the sequences are amino acid or nucleotide.  This is not\r
  73 always foolproof.\r
  74 \r
  75 FASTA and NBRF-PIR formats are recognised by having a ">" as the first \r
  76 character in the file.  \r
  77 \r
  78 EMBL-Swiss Prot formats are recognised by the letters\r
  79 ID at the start of the file (the token for the entry name field).  \r
  80 \r
  81 CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.\r
  82 \r
  83 GCG-MSF format is recognised by one of the following:\r
  84        - the word PileUp at the start of the file. \r
  85        - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT\r
  86          at the start of the file.\r
  87        - the word MSF on the first line of the line, and the characters ..\r
  88          at the end of this line.\r
  89 \r
  90 GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of\r
  91 the file.\r
  92 \r
  93 \r
  94 If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the\r
  95 sequence will be assumed to be nucleotide.  This works in 97.3% of cases\r
  96 but watch out!\r
  97 \r
  98 \r
  99 >> HELP 2 <<             Help for multiple alignments\r
 100 \r
 101 If you have already loaded sequences, use menu item 1 to do the complete\r
 102 multiple alignment.  You will be prompted for 2 output files: 1 for the \r
 103 alignment itself; another to store a dendrogram that describes the similarity\r
 104 of the sequences to each other.\r
 105 \r
 106 Multiple alignments are carried out in 3 stages (automatically done from menu\r
 107 item 1 ...Do complete multiple alignments now):\r
 108 \r
 109 1) all sequences are compared to each other (pairwise alignments);\r
 110 \r
 111 2) a dendrogram (like a phylogenetic tree) is constructed, describing the\r
 112 approximate groupings of the sequences by similarity (stored in a file).\r
 113 \r
 114 3) the final multiple alignment is carried out, using the dendrogram as a guide.\r
 115 \r
 116 \r
 117 PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial\r
 118 alignments.\r
 119 \r
 120 MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.\r
 121 \r
 122 \r
 123 RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences\r
 124 during multiple alignment if you wish to change the parameters and try again.\r
 125 This only takes effect just before you do a second multiple alignment.  You\r
 126 can make phylogenetic trees after alignment whether or not this is ON.\r
 127 If you turn this OFF, the new gaps are kept even if you do a second multiple\r
 128 alignment. This allows you to iterate the alignment gradually.  Sometimes, the \r
 129 alignment is improved by a second or third pass.\r
 130 \r
 131 SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the \r
 132 screen as well as to the output file.\r
 133 \r
 134 You can skip the first stages (pairwise alignments; dendrogram) by using an\r
 135 old dendrogram file (menu item 3); or you can just produce the dendrogram\r
 136 with no final multiple alignment (menu item 2).\r
 137 \r
 138 \r
 139 OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6 \r
 140 different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).  \r
 141 \r
 142 \r
 143 \r
 144 >> HELP 3 <<             Help for pairwise alignment parameters\r
 145 \r
 146 A distance is calculated between every pair of sequences and these are used to\r
 147 construct the dendrogram which guides the final multiple alignment. The scores\r
 148 are calculated from separate pairwise alignments. These can be calculated using\r
 149 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur\r
 150 and Lipman (extremely fast but approximate). \r
 151 \r
 152 You can choose between the 2 alignment methods using menu option 8.  The\r
 153 slow-accurate method is fine for short sequences but will be VERY SLOW for \r
 154 many (e.g. >100) long (e.g. >1000 residue) sequences.   \r
 155 \r
 156 SLOW-ACCURATE alignment parameters:\r
 157         These parameters do not have any affect on the speed of the alignments. \r
 158 They are used to give initial alignments which are then rescored to give percent\r
 159 identity scores.  These % scores are the ones which are displayed on the \r
 160 screen.  The scores are converted to distances for the trees.\r
 161 \r
 162 1) Gap Open Penalty:      the penalty for opening a gap in the alignment.\r
 163 2) Gap extension penalty: the penalty for extending a gap by 1 residue.\r
 164 3) Protein weight matrix: the scoring table which describes the similarity\r
 165                           of each amino acid to each other.\r
 166 4) DNA weight matrix:     the scores assigned to matches and mismatches \r
 167                           (including IUB ambiguity codes).\r
 168 \r
 169 \r
 170 FAST-APPROXIMATE alignment parameters:\r
 171 \r
 172 These similarity scores are calculated from fast, approximate, global align-\r
 173 ments, which are controlled by 4 parameters.   2 techniques are used to make\r
 174 these alignments very fast: 1) only exactly matching fragments (k-tuples) are\r
 175 considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)\r
 176 are used.\r
 177 \r
 178 K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. \r
 179 INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.\r
 180 For longer sequences (e.g. >1000 residues) you may need to increase the default.\r
 181 \r
 182 GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has\r
 183 little affect on the speed or sensitivity except for extreme values.\r
 184 \r
 185 TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary\r
 186 dot-matrix plot) is calculated.  Only the best ones (with most matches) are\r
 187 used in the alignment.  This parameter specifies how many.  Decrease for speed;\r
 188 increase for sensitivity.\r
 189 \r
 190 WINDOW SIZE:  This is the number of diagonals around each of the 'best' \r
 191 diagonals that will be used.  Decrease for speed; increase for sensitivity.\r
 192 \r
 193 \r
 194 >> HELP 4 <<             Help for multiple alignment parameters\r
 195 \r
 196 These parameters control the final multiple alignment. This is the core of the\r
 197 program and the details are complicated. To fully understand the use of the\r
 198 parameters and the scoring system, you will have to refer to the documentation.\r
 199 \r
 200 Each step in the final multiple alignment consists of aligning two alignments \r
 201 or sequences.  This is done progressively, following the branching order in \r
 202 the GUIDE TREE.  The basic parameters to control this are two gap penalties and\r
 203 the scores for various identical-non-indentical residues.  \r
 204 \r
 205 1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the \r
 206 cost of opening up every new gap and the cost of every item in a gap. \r
 207 Increasing the gap opening penalty will make gaps less frequent. Increasing \r
 208 the gap extension penalty will make gaps shorter. Terminal gaps are not \r
 209 penalised.\r
 210 \r
 211 3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most\r
 212 distantly related sequences until after the most closely related sequences have \r
 213 been aligned.   The setting shows the percent identity level required to delay\r
 214 the addition of a sequence; sequences that are less identical than this level\r
 215 to any other sequences will be aligned later.\r
 216 \r
 217 \r
 218 \r
 219 4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T \r
 220 i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0\r
 221 and 1; a weight of zero means that the transitions are scored as mismatches,\r
 222 while a weight of 1 gives the transitions the match score. For distantly related\r
 223 DNA sequences, the weight should be near to zero; for closely related sequences\r
 224 it can be useful to assign a higher score.\r
 225 \r
 226 \r
 227 5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of\r
 228 weight matrices. The default for proteins in version 1.8 is the PAM series \r
 229 derived by Gonnet and colleagues. Note, a series is used! The actual matrix\r
 230 that is used depends on how similar the sequences to be aligned at this \r
 231 alignment step are. Different matrices work differently at each evolutionary\r
 232 distance. \r
 233 \r
 234 6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)\r
 235 can be selected. The default is the matrix used by BESTFIT for comparison of\r
 236 nucleic acid sequences.\r
 237 \r
 238 Further help is offered in the weight matrix menu.\r
 239 \r
 240 \r
 241 7)  In the weight matrices, you can use negative as well as positive values if\r
 242 you wish, although the matrix will be automatically adjusted to all positive\r
 243 scores, unless the NEGATIVE MATRIX option is selected.\r
 244 \r
 245 8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty\r
 246 options which are only used in protein alignments.\r
 247 \r
 248 \r
 249 >> HELP A <<             Help for protein gap parameters.\r
 250 \r
 251 1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce\r
 252 or increase the gap opening penalties at each position in the alignment or\r
 253 sequence.  See the documentation for details.  As an example, positions that \r
 254 are rich in glycine are more likely to have an adjacent gap than positions that\r
 255 are rich in valine.\r
 256 \r
 257 2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within\r
 258 a run (5 or more residues) of hydrophilic amino acids; these are likely to\r
 259 be loop or random coil regions where gaps are more common.  The residues that \r
 260 are "considered" to be hydrophilic are set by menu item 3.\r
 261 \r
 262 4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too\r
 263 close to each other. Gaps that are less than this distance apart are penalised\r
 264 more than other gaps. This does not prevent close gaps; it makes them less\r
 265 frequent, promoting a block-like appearance of the alignment.\r
 266 \r
 267 5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes\r
 268 of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).\r
 269 If you turn this off, end gaps will be ignored for this purpose.  This is\r
 270 useful when you wish to align fragments where the end gaps are not biologically\r
 271 meaningful.\r
 272 \r
 273 \r
 274 >> HELP 5 <<             Help for output format options.\r
 275 \r
 276 Six output formats are offered. You can choose any (or all 6 if you wish).  \r
 277 \r
 278 CLUSTAL format output is a self explanatory alignment format.  It shows the\r
 279 sequences aligned in blocks.  It can be read in again at a later date to\r
 280 (for example) calculate a phylogenetic tree or add a new sequence with a \r
 281 profile alignment.\r
 282 \r
 283 GCG output can be used by any of the GCG programs that can work on multiple\r
 284 alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG\r
 285 .msf format files (multiple sequence file); new in version 7 of GCG.\r
 286 \r
 287 PHYLIP format output can be used for input to the PHYLIP package of Joe \r
 288 Felsenstein.  This is an extremely widely used package for doing every \r
 289 imaginable form of phylogenetic analysis (MUCH more than the the modest intro-\r
 290 duction offered by this program).\r
 291 \r
 292 NBRF-PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap\r
 293 characters "-" are used to indicate the positions of gaps in the multiple \r
 294 alignment.  These files can be re-used as input in any part of clustal that\r
 295 allows sequences (or alignments or profiles) to be read in.  \r
 296 \r
 297 GDE:  this is the flat file format used by the GDE package of Steven Smith.\r
 298 \r
 299 NEXUS: the format used by several phylogeny programs, including PAUP and\r
 300 MacClade.\r
 301 \r
 302 GDE OUTPUT CASE: sequences in GDE format may be written in either upper or\r
 303 lower case.\r
 304 \r
 305 CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the\r
 306 alignment lines in clustalw format.\r
 307 \r
 308 OUTPUT ORDER is used to control the order of the sequences in the output\r
 309 alignments.  By default, the order corresponds to the order in which the\r
 310 sequences were aligned (from the guide tree-dendrogram), thus automatically\r
 311 grouping closely related sequences. This switch can be used to set the order\r
 312 to the same as the input file.\r
 313 \r
 314 PARAMETER OUTPUT: This option allows you to save all your parameter settings\r
 315 in a parameter file. This file can be used subsequently to rerun Clustal W\r
 316 using the same parameters.\r
 317 \r
 318 \r
 319 >> HELP 6 <<             Help for profile and structure alignments\r
 320 \r
 321 By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile \r
 322 alignments allow you to store alignments of your favourite sequences and add\r
 323 new sequences to them in small bunches at a time. A profile is simply an\r
 324 alignment of one or more sequences (e.g. an alignment output file from CLUSTAL\r
 325 W). Each input can be a single sequence. One or both sets of input sequences\r
 326 may include secondary structure assignments or gap penalty masks to guide the\r
 327 alignment. \r
 328 \r
 329 The profiles can be in any of the allowed input formats with "-" characters\r
 330 used to specify gaps (except for MSF-RSF where "." is used).\r
 331 \r
 332 You have to specify the 2 profiles by choosing menu items 1 and 2 and giving\r
 333 2 file names.  Then Menu item 3 will align the 2 profiles to each other. \r
 334 Secondary structure masks in either profile can be used to guide the alignment.\r
 335 \r
 336 Menu item 4 will take the sequences in the second profile and align them to\r
 337 the first profile, 1 at a time.  This is useful to add some new sequences to\r
 338 an existing alignment, or to align a set of sequences to a known structure.  \r
 339 In this case, the second profile would not be pre-aligned.\r
 340 \r
 341 \r
 342 The alignment parameters can be set using menu items 5, 6 and 7. These are\r
 343 EXACTLY the same parameters as used by the general, automatic multiple\r
 344 alignment procedure. The general multiple alignment procedure is simply a\r
 345 series of profile alignments. Carrying out a series of profile alignments on\r
 346 larger and larger groups of sequences, allows you to manually build up a\r
 347 complete alignment, if necessary editing intermediate alignments.\r
 348 \r
 349 SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure\r
 350 parameters. If a solved structure is available, it can be used to guide the \r
 351 alignment by raising gap penalties within secondary structure elements, so \r
 352 that gaps will preferentially be inserted into unstructured surface loops.\r
 353 Alternatively, a user-specified gap penalty mask can be supplied directly.\r
 354 \r
 355 A gap penalty mask is a series of numbers between 1 and 9, one per position in \r
 356 the alignment. Each number specifies how much the gap opening penalty is to be \r
 357 raised at that position (raised by multiplying the basic gap opening penalty\r
 358 by the number) i.e. a mask figure of 1 at a position means no change\r
 359 in gap opening penalty; a figure of 4 means that the gap opening penalty is\r
 360 four times greater at that position, making gaps 4 times harder to open.\r
 361 \r
 362 The format for gap penalty masks and secondary structure masks is explained\r
 363 in the help under option 0 (secondary structure options).\r
 364 \r
 365 \r
 366 >> HELP B <<             Help for secondary structure - gap penalty masks\r
 367 \r
 368 The use of secondary structure-based penalties has been shown to improve the\r
 369 accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty \r
 370 masks to be supplied with the input sequences. The masks work by raising gap \r
 371 penalties in specified regions (typically secondary structure elements) so that\r
 372 gaps are preferentially opened in the less well conserved regions (typically \r
 373 surface loops).\r
 374 \r
 375 Options 1 and 2 control whether the input secondary structure information or\r
 376 gap penalty masks will be used.\r
 377 \r
 378 Option 3 controls whether the secondary structure and gap penalty masks should\r
 379 be included in the output alignment.\r
 380 \r
 381 Options 4 and 5 provide the value for raising the gap penalty at core Alpha \r
 382 Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues \r
 383 denote the A and B core structure notation. The basic gap penalties are\r
 384 multiplied by the amount specified.\r
 385 \r
 386 Option 6 provides the value for the gap penalty in Loops. By default this \r
 387 penalty is not raised. In CLUSTAL format, loops are specified by "." in the \r
 388 secondary structure notation.\r
 389 \r
 390 Option 7 provides the value for setting the gap penalty at the ends of \r
 391 secondary structures. Ends of secondary structures are observed to grow \r
 392 and-or shrink in related structures. Therefore by default these are given \r
 393 intermediate values, lower than the core penalties. All secondary structure \r
 394 read in as lower case in CLUSTAL format gets the reduced terminal penalty.\r
 395 \r
 396 Options 8 and 9 specify the range of structure termini for the intermediate \r
 397 penalties. In the alignment output, these are indicated as lower case. \r
 398 For Alpha Helices, by default, the range spans the end helical turn. For \r
 399 Beta Strands, the default range spans the end residue and the adjacent loop \r
 400 residue, since sequence conservation often extends beyond the actual H-bonded\r
 401 Beta Strand.\r
 402 \r
 403 CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input\r
 404 files. For many 3-D protein structures, secondary structure information is\r
 405 recorded in the feature tables of SWISS-PROT database entries. You should\r
 406 always check that the assignments are correct - some are quite inaccurate.\r
 407 CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.\r
 408 \r
 409 FT   HELIX       100    115\r
 410 FT   STRAND      118    119\r
 411 \r
 412 The structure and penalty masks can also be read from CLUSTAL alignment format \r
 413 as comment lines beginning "!SS_" or "!GM_" e.g.\r
 414 \r
 415 !SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA\r
 416 !GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444\r
 417 HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK\r
 418 \r
 419 Note that the mask itself is a set of numbers between 1 and 9 each of which is \r
 420 assigned to the residue(s) in the same column below. \r
 421 \r
 422 In GDE flat file format, the masks are specified as text and the names must\r
 423 begin with "SS_ or "GM_.\r
 424 \r
 425 Either a structure or penalty mask or both may be used. If both are included in\r
 426 an alignment, the user will be asked which is to be used.\r
 427 \r
 428 \r
 429 >> HELP C <<             Help for secondary structure - gap penalty mask output options\r
 430 \r
 431 The options in this menu let you choose whether or not to include the masks\r
 432 in the CLUSTAL W output alignments. Showing both is useful for understanding\r
 433 how the masks work. The secondary structure information is itself very useful\r
 434 in judging the alignment quality and in seeing how residue conservation\r
 435 patterns vary with secondary structure.\r
 436 \r
 437 \r
 438 >> HELP 7 <<             Help for phylogenetic trees\r
 439 \r
 440 1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be\r
 441 input in any format or you should have just carried out a full multiple\r
 442 alignment and the alignment is still in memory. \r
 443 \r
 444 \r
 445 *************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! ***************\r
 446 \r
 447 \r
 448 The methods used are NJ (Neighbour Joining) and UPGMA. First\r
 449 you calculate distances (percent divergence) between all pairs of sequence from\r
 450 a multiple alignment; second you apply the NJ or UPGMA method to the distance matrix.\r
 451 \r
 452 2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where\r
 453 ANY of the sequences have a gap will be ignored. This means that 'like' will be\r
 454 compared to 'like' in all distances, which is highly desirable. It also\r
 455 automatically throws away the most ambiguous parts of the alignment, which are\r
 456 concentrated around gaps (usually). The disadvantage is that you may throw away\r
 457 much of the data if there are many gaps (which is why it is difficult for us to\r
 458 make it the default).  \r
 459 \r
 460 \r
 461 \r
 462 3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this\r
 463 option makes no difference. For greater divergence, it corrects for the fact\r
 464 that observed distances underestimate actual evolutionary distances. This is\r
 465 because, as sequences diverge, more than one substitution will happen at many\r
 466 sites. However, you only see one difference when you look at the present day\r
 467 sequences. Therefore, this option has the effect of stretching branch lengths\r
 468 in trees (especially long branches). The corrections used here (for DNA or\r
 469 proteins) are both due to Motoo Kimura. See the documentation for details.  \r
 470 \r
 471 Where possible, this option should be used. However, for VERY divergent\r
 472 sequences, the distances cannot be reliably corrected. You will be warned if\r
 473 this happens. Even if none of the distances in a data set exceed the reliable\r
 474 threshold, if you bootstrap the data, some of the bootstrap distances may\r
 475 randomly exceed the safe limit.  \r
 476 \r
 477 4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED\r
 478 tree and all branch lengths. The root of the tree can only be inferred by\r
 479 using an outgroup (a sequence that you are certain branches at the outside\r
 480 of the tree .... certain on biological grounds) OR if you assume a degree\r
 481 of constancy in the 'molecular clock', you can place the root in the 'middle'\r
 482 of the tree (roughly equidistant from all tips).\r
 483 \r
 484 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS\r
 485 By default, the bootstrap values are correctly placed on the tree branches of\r
 486 the phylip format output tree. The toggle allows them to be placed on the\r
 487 nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView\r
 488 and Phylowin) only support node labelling but not branch labelling. Care\r
 489 should be taken to note which branches and labels go together.\r
 490 \r
 491 6) OUTPUT FORMATS: four different formats are allowed. None of these displays\r
 492 the tree visually. Useful display programs accepting PHYLIP format include\r
 493 NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and\r
 494 PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities\r
 495 there. (Get the PHYLIP package anyway if you are interested in trees). The\r
 496 NEXUS format can be read into PAUP or MacClade.\r
 497 \r
 498 \r
 499 >> HELP 8 <<             Help for choosing a weight matrix\r
 500 \r
 501 For protein alignments, you use a weight matrix to determine the similarity of\r
 502 non-identical amino acids.  For example, Tyr aligned with Phe is usually judged \r
 503 to be 'better' than Tyr aligned with Pro.\r
 504 \r
 505 There are three 'in-built' series of weight matrices offered. Each consists of\r
 506 several matrices which work differently at different evolutionary distances. To\r
 507 see the exact details, read the documentation. Crudely, we store several\r
 508 matrices in memory, spanning the full range of amino acid distance (from almost\r
 509 identical sequences to highly divergent ones). For very similar sequences, it\r
 510 is best to use a strict weight matrix which only gives a high score to\r
 511 identities and the most favoured conservative substitutions. For more divergent\r
 512 sequences, it is appropriate to use "softer" matrices which give a high score\r
 513 to many other frequent substitutions.\r
 514 \r
 515 1) BLOSUM (Henikoff). These matrices appear to be the best available for \r
 516 carrying out database similarity (homology searches). The matrices used are:\r
 517 Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W\r
 518 versions)\r
 519 \r
 520 2) PAM (Dayhoff). These have been extremely widely used since the late '70s.\r
 521 We use the PAM 20, 60, 120 and 350 matrices.\r
 522 \r
 523 3) GONNET. These matrices were derived using almost the same procedure as the\r
 524 Dayhoff one (above) but are much more up to date and are based on a far larger\r
 525 data set. They appear to be more sensitive than the Dayhoff series. We use the\r
 526 GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for\r
 527 Clustal W version 1.8.\r
 528 \r
 529 We also supply an identity matrix which gives a score of 1.0 to two identical \r
 530 amino acids and a score of zero otherwise. This matrix is not very useful.\r
 531 Alternatively, you can read in your own (just one matrix, not a series).\r
 532 \r
 533 A new matrix can be read from a file on disk, if the filename consists only\r
 534 of lower case characters. The values in the new weight matrix must be integers\r
 535 and the scores should be similarities. You can use negative as well as positive\r
 536 values if you wish, although the matrix will be automatically adjusted to all\r
 537 positive scores.\r
 538 \r
 539 \r
 540 \r
 541 For DNA, a single matrix (not a series) is used. Two hard-coded matrices are \r
 542 available:\r
 543 \r
 544 \r
 545 1) IUB. This is the default scoring matrix used by BESTFIT for the comparison\r
 546 of nucleic acid sequences. X's and N's are treated as matches to any IUB\r
 547 ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.\r
 548  \r
 549  \r
 550 2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score\r
 551 1.0 and mismatches score 0. All matches for IUB symbols also score 0.\r
 552 \r
 553 INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.\r
 554 Any lines beginning with a # character are assumed to be comments. The first\r
 555 non-comment line should contain a list of amino acids in any order, using the\r
 556 1 letter code, followed by a * character. This should be followed by a square\r
 557 matrix of integer scores, with one row and one column for each amino acid. The\r
 558 last row and column of the matrix (corresponding to the * character) contain\r
 559 the minimum score over the whole matrix.\r
 560 \r
 561 \r
 562 >> HELP 9 <<             Help for command line parameters\r
 563 \r
 564                 DATA (sequences)\r
 565 \r
 566 -INFILE=file.ext                             :input sequences.\r
 567 -PROFILE1=file.ext  and  -PROFILE2=file.ext  :profiles (old alignment).\r
 568 \r
 569 \r
 570                 VERBS (do things)\r
 571 \r
 572 -OPTIONS            :list the command line parameters\r
 573 -HELP  or -CHECK    :outline the command line params.\r
 574 -FULLHELP           :output full help content.\r
 575 -ALIGN              :do full multiple alignment.\r
 576 -TREE               :calculate NJ tree.\r
 577 -PIM                :output percent identity matrix (while calculating the tree)\r
 578 -BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).\r
 579 -CONVERT            :output the input sequences in a different file format.\r
 580 \r
 581 \r
 582                 PARAMETERS (set things)\r
 583 \r
 584 ***General settings:****\r
 585 -INTERACTIVE :read command line, then enter normal interactive menus\r
 586 -QUICKTREE   :use FAST algorithm for the alignment guide tree\r
 587 -TYPE=       :PROTEIN or DNA sequences\r
 588 -NEGATIVE    :protein alignment with negative values in matrix\r
 589 -OUTFILE=    :sequence alignment file name\r
 590 -OUTPUT=     :GCG, GDE, PHYLIP, PIR or NEXUS\r
 591 -OUTORDER=   :INPUT or ALIGNED\r
 592 -CASE        :LOWER or UPPER (for GDE output only)\r
 593 -SEQNOS=     :OFF or ON (for Clustal output only)\r
 594 -SEQNO_RANGE=:OFF or ON (NEW: for all output formats)\r
 595 -RANGE=m,n   :sequence range to write starting m to m+n\r
 596 -MAXSEQLEN=n :maximum allowed input sequence length\r
 597 -QUIET       :Reduce console output to minimum\r
 598 -STATS=      :Log some alignents statistics to file\r
 599 \r
 600 ***Fast Pairwise Alignments:***\r
 601 -KTUPLE=n    :word size\r
 602 -TOPDIAGS=n  :number of best diags.\r
 603 -WINDOW=n    :window around best diags.\r
 604 -PAIRGAP=n   :gap penalty\r
 605 -SCORE       :PERCENT or ABSOLUTE\r
 606 \r
 607 \r
 608 ***Slow Pairwise Alignments:***\r
 609 -PWMATRIX=    :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename\r
 610 -PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename\r
 611 -PWGAPOPEN=f  :gap opening penalty        \r
 612 -PWGAPEXT=f   :gap opening penalty\r
 613 \r
 614 \r
 615 ***Multiple Alignments:***\r
 616 -NEWTREE=      :file for new guide tree\r
 617 -USETREE=      :file for old guide tree\r
 618 -MATRIX=       :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename\r
 619 -DNAMATRIX=    :DNA weight matrix=IUB, CLUSTALW or filename\r
 620 -GAPOPEN=f     :gap opening penalty        \r
 621 -GAPEXT=f      :gap extension penalty\r
 622 -ENDGAPS       :no end gap separation pen. \r
 623 -GAPDIST=n     :gap separation pen. range\r
 624 -NOPGAP        :residue-specific gaps off  \r
 625 -NOHGAP        :hydrophilic gaps off\r
 626 -HGAPRESIDUES= :list hydrophilic res.    \r
 627 -MAXDIV=n      :% ident. for delay\r
 628 -TYPE=         :PROTEIN or DNA\r
 629 -TRANSWEIGHT=f :transitions weighting\r
 630 -ITERATION=    :NONE or TREE or ALIGNMENT\r
 631 -NUMITER=n     :maximum number of iterations to perform\r
 632 -NOWEIGHTS     :disable sequence weighting\r
 633 \r
 634 \r
 635 ***Profile Alignments:***\r
 636 -PROFILE      :Merge two alignments by profile alignment\r
 637 -NEWTREE1=    :file for new guide tree for profile1\r
 638 -NEWTREE2=    :file for new guide tree for profile2\r
 639 -USETREE1=    :file for old guide tree for profile1\r
 640 -USETREE2=    :file for old guide tree for profile2\r
 641 \r
 642 \r
 643 ***Sequence to Profile Alignments:***\r
 644 -SEQUENCES   :Sequentially add profile2 sequences to profile1 alignment\r
 645 -NEWTREE=    :file for new guide tree\r
 646 -USETREE=    :file for old guide tree\r
 647 \r
 648 \r
 649 ***Structure Alignments:***\r
 650 -NOSECSTR1     :do not use secondary structure-gap penalty mask for profile 1 \r
 651 -NOSECSTR2     :do not use secondary structure-gap penalty mask for profile 2\r
 652 -SECSTROUT=STRUCTURE or MASK or BOTH or NONE   :output in alignment file\r
 653 -HELIXGAP=n    :gap penalty for helix core residues \r
 654 -STRANDGAP=n   :gap penalty for strand core residues\r
 655 -LOOPGAP=n     :gap penalty for loop regions\r
 656 -TERMINALGAP=n :gap penalty for structure termini\r
 657 -HELIXENDIN=n  :number of residues inside helix to be treated as terminal\r
 658 -HELIXENDOUT=n :number of residues outside helix to be treated as terminal\r
 659 -STRANDENDIN=n :number of residues inside strand to be treated as terminal\r
 660 -STRANDENDOUT=n:number of residues outside strand to be treated as terminal \r
 661 \r
 662 \r
 663 ***Trees:***\r
 664 -OUTPUTTREE=nj OR phylip OR dist OR nexus\r
 665 -SEED=n        :seed number for bootstraps.\r
 666 -KIMURA        :use Kimura's correction.   \r
 667 -TOSSGAPS      :ignore positions with gaps.\r
 668 -BOOTLABELS=node OR branch :position of bootstrap values in tree display\r
 669 -CLUSTERING=   :NJ or UPGMA\r
 670 \r
 671 \r
 672 >> HELP 0 <<             Help for tree output format options\r
 673 \r
 674 Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances\r
 675 4) Nexus\r
 676 \r
 677 None of these formats displays the results graphically. Many packages can\r
 678 display trees in the the PHYLIP format 2) below. It can also be imported into\r
 679 the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display. \r
 680 NEXUS format trees can be read by PAUP and MacClade.\r
 681 \r
 682 1) Clustal format output. \r
 683 This format is verbose and lists all of the distances between the sequences and\r
 684 the number of alignment positions used for each. The tree is described at the\r
 685 end of the file. It lists the sequences that are joined at each alignment step\r
 686 and the branch lengths. After two sequences are joined, it is referred to later\r
 687 as a NODE. The number of a NODE is the number of the lowest sequence in that\r
 688 NODE.   \r
 689 \r
 690 2) Phylip format output.\r
 691 This format is the New Hampshire format, used by many phylogenetic analysis\r
 692 packages. It consists of a series of nested parentheses, describing the\r
 693 branching order, with the sequence names and branch lengths. It can be used by\r
 694 the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the\r
 695 trees graphically. This is the same format used during multiple alignment for\r
 696 the guide trees. \r
 697 \r
 698 Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other\r
 699 packages that can read and display New Hampshire format are TreeView (Mac/PC),\r
 700 TreeTool (UNIX), and Phylowin.\r
 701 \r
 702 3) The distances only.\r
 703 This format just outputs a matrix of all the pairwise distances in a format\r
 704 that can be used by the Phylip package. It used to be useful when one could not\r
 705 produce distances from protein sequences in the Phylip package but is now\r
 706 redundant (Protdist of Phylip 3.5 now does this).\r
 707 \r
 708 4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,\r
 709 including PAUP and MacClade. The format is described fully in:\r
 710 Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.\r
 711 NEXUS: an extensible file format for systematic information.\r
 712 Systematic Biology 46:590-621.\r
 713 \r
 714 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS\r
 715 By default, the bootstrap values are placed on the nodes of the phylip format\r
 716 output tree. This is inaccurate as the bootstrap values should be associated\r
 717 with the tree branches and not the nodes. However, this format can be read and\r
 718 displayed by TreeTool, TreeView and Phylowin. An option is available to\r
 719 correctly place the bootstrap values on the branches with which they are\r
 720 associated.\r