binaries/src/clustalw/clustalw_help

   1
   2
   3
   4  CLUSTAL 2.1 Multiple Sequence Alignments
   5
   6
   7
   8
   9 >> HELP NEW <<             NEW FEATURES/OPTIONS
  10
  11 ==UPGMA==
  12  The UPGMA algorithm has been added to allow faster tree construction. The user now
  13  has the choice of using Neighbour Joining or UPGMA. The default is still NJ, but the
  14  user can change this by setting the clustering parameter.
  15
  16  -CLUSTERING=   :NJ or UPGMA
  17
  18 ==ITERATION==
  19
  20  A remove first iteration scheme has been added. This can be used to improve the final
  21  alignment or improve the alignment at each stage of the progressive alignment. During the
  22  iteration step each sequence is removed in turn and realigned. If the resulting alignment
  23  is better than the  previous alignment it is kept. This process is repeated until the score
  24  converges (the  score is not improved) or until the maximum number of iterations is
  25  reached. The user can  iterate at each step of the progressive alignment by setting the
  26  iteration parameter to  TREE or just on the final alignment by seting the iteration
  27  parameter to ALIGNMENT. The default is no iteration. The maximum number of  iterations can
  28  be set using the numiter parameter. The default number of iterations is 3.
  29
  30  -ITERATION=    :NONE or TREE or ALIGNMENT
  31
  32  -NUMITER=n     :Maximum number of iterations to perform
  33
  34 ==HELP==
  35
  36  -FULLHELP      :Print out the complete help content
  37
  38 ==MISC==
  39
  40  -MAXSEQLEN=n   :Maximum allowed sequence length
  41
  42  -QUIET         :Reduce console output to minimum
  43
  44  -STATS=file    :Log some alignents statistics to file
  45
  46
  47 >> HELP 1 <<             General help for CLUSTAL W (2.1)
  48
  49 Clustal W is a general purpose multiple alignment program for DNA or proteins.
  50
  51 SEQUENCE INPUT:  all sequences must be in 1 file, one after another.
  52 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT,
  53 Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.
  54 All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
  55 except "-" which is used to indicate a GAP ("." in MSF-RSF).
  56
  57 To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
  58 INPUT them; go to menu item 2 to do the multiple alignment.
  59
  60 PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments.  Use this to
  61 add a new sequence to an old alignment, or to use secondary structure to guide
  62 the alignment process.  GAPS in the old alignments are indicated using the "-"
  63 character.   PROFILES can be input in ANY of the allowed formats; just
  64 use "-" (or "." for MSF-RSF) for each gap position.
  65
  66 PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
  67 with "-" characters to indicate gaps) OR after a multiple alignment while the
  68 alignment is still in memory.
  69
  70
  71 The program tries to automatically recognise the different file formats used
  72 and to guess whether the sequences are amino acid or nucleotide.  This is not
  73 always foolproof.
  74
  75 FASTA and NBRF-PIR formats are recognised by having a ">" as the first
  76 character in the file.
  77
  78 EMBL-Swiss Prot formats are recognised by the letters
  79 ID at the start of the file (the token for the entry name field).
  80
  81 CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
  82
  83 GCG-MSF format is recognised by one of the following:
  84        - the word PileUp at the start of the file.
  85        - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
  86          at the start of the file.
  87        - the word MSF on the first line of the line, and the characters ..
  88          at the end of this line.
  89
  90 GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of
  91 the file.
  92
  93
  94 If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
  95 sequence will be assumed to be nucleotide.  This works in 97.3% of cases
  96 but watch out!
  97
  98
  99 >> HELP 2 <<             Help for multiple alignments
 100
 101 If you have already loaded sequences, use menu item 1 to do the complete
 102 multiple alignment.  You will be prompted for 2 output files: 1 for the
 103 alignment itself; another to store a dendrogram that describes the similarity
 104 of the sequences to each other.
 105
 106 Multiple alignments are carried out in 3 stages (automatically done from menu
 107 item 1 ...Do complete multiple alignments now):
 108
 109 1) all sequences are compared to each other (pairwise alignments);
 110
 111 2) a dendrogram (like a phylogenetic tree) is constructed, describing the
 112 approximate groupings of the sequences by similarity (stored in a file).
 113
 114 3) the final multiple alignment is carried out, using the dendrogram as a guide.
 115
 116
 117 PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial
 118 alignments.
 119
 120 MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
 121
 122
 123 RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences
 124 during multiple alignment if you wish to change the parameters and try again.
 125 This only takes effect just before you do a second multiple alignment.  You
 126 can make phylogenetic trees after alignment whether or not this is ON.
 127 If you turn this OFF, the new gaps are kept even if you do a second multiple
 128 alignment. This allows you to iterate the alignment gradually.  Sometimes, the
 129 alignment is improved by a second or third pass.
 130
 131 SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the
 132 screen as well as to the output file.
 133
 134 You can skip the first stages (pairwise alignments; dendrogram) by using an
 135 old dendrogram file (menu item 3); or you can just produce the dendrogram
 136 with no final multiple alignment (menu item 2).
 137
 138
 139 OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6
 140 different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).
 141
 142
 143
 144 >> HELP 3 <<             Help for pairwise alignment parameters
 145
 146 A distance is calculated between every pair of sequences and these are used to
 147 construct the dendrogram which guides the final multiple alignment. The scores
 148 are calculated from separate pairwise alignments. These can be calculated using
 149 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur
 150 and Lipman (extremely fast but approximate).
 151
 152 You can choose between the 2 alignment methods using menu option 8.  The
 153 slow-accurate method is fine for short sequences but will be VERY SLOW for
 154 many (e.g. >100) long (e.g. >1000 residue) sequences.
 155
 156 SLOW-ACCURATE alignment parameters:
 157         These parameters do not have any affect on the speed of the alignments.
 158 They are used to give initial alignments which are then rescored to give percent
 159 identity scores.  These % scores are the ones which are displayed on the
 160 screen.  The scores are converted to distances for the trees.
 161
 162 1) Gap Open Penalty:      the penalty for opening a gap in the alignment.
 163 2) Gap extension penalty: the penalty for extending a gap by 1 residue.
 164 3) Protein weight matrix: the scoring table which describes the similarity
 165                           of each amino acid to each other.
 166 4) DNA weight matrix:     the scores assigned to matches and mismatches
 167                           (including IUB ambiguity codes).
 168
 169
 170 FAST-APPROXIMATE alignment parameters:
 171
 172 These similarity scores are calculated from fast, approximate, global align-
 173 ments, which are controlled by 4 parameters.   2 techniques are used to make
 174 these alignments very fast: 1) only exactly matching fragments (k-tuples) are
 175 considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
 176 are used.
 177
 178 K-TUPLE SIZE:  This is the size of exactly matching fragment that is used.
 179 INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
 180 For longer sequences (e.g. >1000 residues) you may need to increase the default.
 181
 182 GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has
 183 little affect on the speed or sensitivity except for extreme values.
 184
 185 TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
 186 dot-matrix plot) is calculated.  Only the best ones (with most matches) are
 187 used in the alignment.  This parameter specifies how many.  Decrease for speed;
 188 increase for sensitivity.
 189
 190 WINDOW SIZE:  This is the number of diagonals around each of the 'best'
 191 diagonals that will be used.  Decrease for speed; increase for sensitivity.
 192
 193
 194 >> HELP 4 <<             Help for multiple alignment parameters
 195
 196 These parameters control the final multiple alignment. This is the core of the
 197 program and the details are complicated. To fully understand the use of the
 198 parameters and the scoring system, you will have to refer to the documentation.
 199
 200 Each step in the final multiple alignment consists of aligning two alignments
 201 or sequences.  This is done progressively, following the branching order in
 202 the GUIDE TREE.  The basic parameters to control this are two gap penalties and
 203 the scores for various identical-non-indentical residues.
 204
 205 1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the
 206 cost of opening up every new gap and the cost of every item in a gap.
 207 Increasing the gap opening penalty will make gaps less frequent. Increasing
 208 the gap extension penalty will make gaps shorter. Terminal gaps are not
 209 penalised.
 210
 211 3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most
 212 distantly related sequences until after the most closely related sequences have
 213 been aligned.   The setting shows the percent identity level required to delay
 214 the addition of a sequence; sequences that are less identical than this level
 215 to any other sequences will be aligned later.
 216
 217
 218
 219 4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T
 220 i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0
 221 and 1; a weight of zero means that the transitions are scored as mismatches,
 222 while a weight of 1 gives the transitions the match score. For distantly related
 223 DNA sequences, the weight should be near to zero; for closely related sequences
 224 it can be useful to assign a higher score.
 225
 226
 227 5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of
 228 weight matrices. The default for proteins in version 1.8 is the PAM series
 229 derived by Gonnet and colleagues. Note, a series is used! The actual matrix
 230 that is used depends on how similar the sequences to be aligned at this
 231 alignment step are. Different matrices work differently at each evolutionary
 232 distance.
 233
 234 6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)
 235 can be selected. The default is the matrix used by BESTFIT for comparison of
 236 nucleic acid sequences.
 237
 238 Further help is offered in the weight matrix menu.
 239
 240
 241 7)  In the weight matrices, you can use negative as well as positive values if
 242 you wish, although the matrix will be automatically adjusted to all positive
 243 scores, unless the NEGATIVE MATRIX option is selected.
 244
 245 8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty
 246 options which are only used in protein alignments.
 247
 248
 249 >> HELP A <<             Help for protein gap parameters.
 250
 251 1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce
 252 or increase the gap opening penalties at each position in the alignment or
 253 sequence.  See the documentation for details.  As an example, positions that
 254 are rich in glycine are more likely to have an adjacent gap than positions that
 255 are rich in valine.
 256
 257 2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
 258 a run (5 or more residues) of hydrophilic amino acids; these are likely to
 259 be loop or random coil regions where gaps are more common.  The residues that
 260 are "considered" to be hydrophilic are set by menu item 3.
 261
 262 4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too
 263 close to each other. Gaps that are less than this distance apart are penalised
 264 more than other gaps. This does not prevent close gaps; it makes them less
 265 frequent, promoting a block-like appearance of the alignment.
 266
 267 5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes
 268 of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).
 269 If you turn this off, end gaps will be ignored for this purpose.  This is
 270 useful when you wish to align fragments where the end gaps are not biologically
 271 meaningful.
 272
 273
 274 >> HELP 5 <<             Help for output format options.
 275
 276 Several output formats are offered. You can choose any (or all 6 if you wish).
 277
 278 CLUSTAL format output is a self explanatory alignment format.  It shows the
 279 sequences aligned in blocks.  It can be read in again at a later date to
 280 (for example) calculate a phylogenetic tree or add a new sequence with a
 281 profile alignment.
 282
 283 GCG output can be used by any of the GCG programs that can work on multiple
 284 alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG
 285 .msf format files (multiple sequence file); new in version 7 of GCG.
 286
 287 Fasta output cis widely used because of it's simplicity. Each sequence name is
 288 preceeded by a '>'-sign. The sequence itself is printed out in the following lines
 289
 290 PHYLIP format output can be used for input to the PHYLIP package of Joe
 291 Felsenstein.  This is an extremely widely used package for doing every
 292 imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
 293 duction offered by this program).
 294
 295 NBRF-PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap
 296 characters "-" are used to indicate the positions of gaps in the multiple
 297 alignment.  These files can be re-used as input in any part of clustal that
 298 allows sequences (or alignments or profiles) to be read in.
 299
 300 GDE:  this is the flat file format used by the GDE package of Steven Smith.
 301
 302 NEXUS: the format used by several phylogeny programs, including PAUP and
 303 MacClade.
 304
 305 GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
 306 lower case.
 307
 308 CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
 309 alignment lines in clustalw format.
 310
 311 OUTPUT ORDER is used to control the order of the sequences in the output
 312 alignments.  By default, the order corresponds to the order in which the
 313 sequences were aligned (from the guide tree-dendrogram), thus automatically
 314 grouping closely related sequences. This switch can be used to set the order
 315 to the same as the input file.
 316
 317 PARAMETER OUTPUT: This option allows you to save all your parameter settings
 318 in a parameter file. This file can be used subsequently to rerun Clustal W
 319 using the same parameters.
 320
 321
 322 >> HELP 6 <<             Help for profile and structure alignments
 323
 324 By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile
 325 alignments allow you to store alignments of your favourite sequences and add
 326 new sequences to them in small bunches at a time. A profile is simply an
 327 alignment of one or more sequences (e.g. an alignment output file from CLUSTAL
 328 W). Each input can be a single sequence. One or both sets of input sequences
 329 may include secondary structure assignments or gap penalty masks to guide the
 330 alignment.
 331
 332 The profiles can be in any of the allowed input formats with "-" characters
 333 used to specify gaps (except for MSF-RSF where "." is used).
 334
 335 You have to specify the 2 profiles by choosing menu items 1 and 2 and giving
 336 2 file names.  Then Menu item 3 will align the 2 profiles to each other.
 337 Secondary structure masks in either profile can be used to guide the alignment.
 338
 339 Menu item 4 will take the sequences in the second profile and align them to
 340 the first profile, 1 at a time.  This is useful to add some new sequences to
 341 an existing alignment, or to align a set of sequences to a known structure.
 342 In this case, the second profile would not be pre-aligned.
 343
 344
 345 The alignment parameters can be set using menu items 5, 6 and 7. These are
 346 EXACTLY the same parameters as used by the general, automatic multiple
 347 alignment procedure. The general multiple alignment procedure is simply a
 348 series of profile alignments. Carrying out a series of profile alignments on
 349 larger and larger groups of sequences, allows you to manually build up a
 350 complete alignment, if necessary editing intermediate alignments.
 351
 352 SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure
 353 parameters. If a solved structure is available, it can be used to guide the
 354 alignment by raising gap penalties within secondary structure elements, so
 355 that gaps will preferentially be inserted into unstructured surface loops.
 356 Alternatively, a user-specified gap penalty mask can be supplied directly.
 357
 358 A gap penalty mask is a series of numbers between 1 and 9, one per position in
 359 the alignment. Each number specifies how much the gap opening penalty is to be
 360 raised at that position (raised by multiplying the basic gap opening penalty
 361 by the number) i.e. a mask figure of 1 at a position means no change
 362 in gap opening penalty; a figure of 4 means that the gap opening penalty is
 363 four times greater at that position, making gaps 4 times harder to open.
 364
 365 The format for gap penalty masks and secondary structure masks is explained
 366 in the help under option 0 (secondary structure options).
 367
 368
 369 >> HELP B <<             Help for secondary structure - gap penalty masks
 370
 371 The use of secondary structure-based penalties has been shown to improve the
 372 accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty
 373 masks to be supplied with the input sequences. The masks work by raising gap
 374 penalties in specified regions (typically secondary structure elements) so that
 375 gaps are preferentially opened in the less well conserved regions (typically
 376 surface loops).
 377
 378 Options 1 and 2 control whether the input secondary structure information or
 379 gap penalty masks will be used.
 380
 381 Option 3 controls whether the secondary structure and gap penalty masks should
 382 be included in the output alignment.
 383
 384 Options 4 and 5 provide the value for raising the gap penalty at core Alpha
 385 Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues
 386 denote the A and B core structure notation. The basic gap penalties are
 387 multiplied by the amount specified.
 388
 389 Option 6 provides the value for the gap penalty in Loops. By default this
 390 penalty is not raised. In CLUSTAL format, loops are specified by "." in the
 391 secondary structure notation.
 392
 393 Option 7 provides the value for setting the gap penalty at the ends of
 394 secondary structures. Ends of secondary structures are observed to grow
 395 and-or shrink in related structures. Therefore by default these are given
 396 intermediate values, lower than the core penalties. All secondary structure
 397 read in as lower case in CLUSTAL format gets the reduced terminal penalty.
 398
 399 Options 8 and 9 specify the range of structure termini for the intermediate
 400 penalties. In the alignment output, these are indicated as lower case.
 401 For Alpha Helices, by default, the range spans the end helical turn. For
 402 Beta Strands, the default range spans the end residue and the adjacent loop
 403 residue, since sequence conservation often extends beyond the actual H-bonded
 404 Beta Strand.
 405
 406 CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input
 407 files. For many 3-D protein structures, secondary structure information is
 408 recorded in the feature tables of SWISS-PROT database entries. You should
 409 always check that the assignments are correct - some are quite inaccurate.
 410 CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.
 411
 412 FT   HELIX       100    115
 413 FT   STRAND      118    119
 414
 415 The structure and penalty masks can also be read from CLUSTAL alignment format
 416 as comment lines beginning "!SS_" or "!GM_" e.g.
 417
 418 !SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
 419 !GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444
 420 HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
 421
 422 Note that the mask itself is a set of numbers between 1 and 9 each of which is
 423 assigned to the residue(s) in the same column below.
 424
 425 In GDE flat file format, the masks are specified as text and the names must
 426 begin with "SS_ or "GM_.
 427
 428 Either a structure or penalty mask or both may be used. If both are included in
 429 an alignment, the user will be asked which is to be used.
 430
 431
 432 >> HELP C <<             Help for secondary structure - gap penalty mask output options
 433
 434 The options in this menu let you choose whether or not to include the masks
 435 in the CLUSTAL W output alignments. Showing both is useful for understanding
 436 how the masks work. The secondary structure information is itself very useful
 437 in judging the alignment quality and in seeing how residue conservation
 438 patterns vary with secondary structure.
 439
 440
 441 >> HELP 7 <<             Help for phylogenetic trees
 442
 443 1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be
 444 input in any format or you should have just carried out a full multiple
 445 alignment and the alignment is still in memory.
 446
 447
 448 *************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! ***************
 449
 450
 451 The methods used are NJ (Neighbour Joining) and UPGMA. First
 452 you calculate distances (percent divergence) between all pairs of sequence from
 453 a multiple alignment; second you apply the NJ or UPGMA method to the distance matrix.
 454
 455 2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
 456 ANY of the sequences have a gap will be ignored. This means that 'like' will be
 457 compared to 'like' in all distances, which is highly desirable. It also
 458 automatically throws away the most ambiguous parts of the alignment, which are
 459 concentrated around gaps (usually). The disadvantage is that you may throw away
 460 much of the data if there are many gaps (which is why it is difficult for us to
 461 make it the default).
 462
 463
 464
 465 3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
 466 option makes no difference. For greater divergence, it corrects for the fact
 467 that observed distances underestimate actual evolutionary distances. This is
 468 because, as sequences diverge, more than one substitution will happen at many
 469 sites. However, you only see one difference when you look at the present day
 470 sequences. Therefore, this option has the effect of stretching branch lengths
 471 in trees (especially long branches). The corrections used here (for DNA or
 472 proteins) are both due to Motoo Kimura. See the documentation for details.
 473
 474 Where possible, this option should be used. However, for VERY divergent
 475 sequences, the distances cannot be reliably corrected. You will be warned if
 476 this happens. Even if none of the distances in a data set exceed the reliable
 477 threshold, if you bootstrap the data, some of the bootstrap distances may
 478 randomly exceed the safe limit.
 479
 480 4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
 481 tree and all branch lengths. The root of the tree can only be inferred by
 482 using an outgroup (a sequence that you are certain branches at the outside
 483 of the tree .... certain on biological grounds) OR if you assume a degree
 484 of constancy in the 'molecular clock', you can place the root in the 'middle'
 485 of the tree (roughly equidistant from all tips).
 486
 487 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
 488 By default, the bootstrap values are correctly placed on the tree branches of
 489 the phylip format output tree. The toggle allows them to be placed on the
 490 nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView
 491 and Phylowin) only support node labelling but not branch labelling. Care
 492 should be taken to note which branches and labels go together.
 493
 494 6) OUTPUT FORMATS: four different formats are allowed. None of these displays
 495 the tree visually. Useful display programs accepting PHYLIP format include
 496 NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and
 497 PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities
 498 there. (Get the PHYLIP package anyway if you are interested in trees). The
 499 NEXUS format can be read into PAUP or MacClade.
 500
 501
 502 >> HELP 8 <<             Help for choosing a weight matrix
 503
 504 For protein alignments, you use a weight matrix to determine the similarity of
 505 non-identical amino acids.  For example, Tyr aligned with Phe is usually judged
 506 to be 'better' than Tyr aligned with Pro.
 507
 508 There are three 'in-built' series of weight matrices offered. Each consists of
 509 several matrices which work differently at different evolutionary distances. To
 510 see the exact details, read the documentation. Crudely, we store several
 511 matrices in memory, spanning the full range of amino acid distance (from almost
 512 identical sequences to highly divergent ones). For very similar sequences, it
 513 is best to use a strict weight matrix which only gives a high score to
 514 identities and the most favoured conservative substitutions. For more divergent
 515 sequences, it is appropriate to use "softer" matrices which give a high score
 516 to many other frequent substitutions.
 517
 518 1) BLOSUM (Henikoff). These matrices appear to be the best available for
 519 carrying out database similarity (homology searches). The matrices used are:
 520 Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W
 521 versions)
 522
 523 2) PAM (Dayhoff). These have been extremely widely used since the late '70s.
 524 We use the PAM 20, 60, 120 and 350 matrices.
 525
 526 3) GONNET. These matrices were derived using almost the same procedure as the
 527 Dayhoff one (above) but are much more up to date and are based on a far larger
 528 data set. They appear to be more sensitive than the Dayhoff series. We use the
 529 GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for
 530 Clustal W version 1.8.
 531
 532 We also supply an identity matrix which gives a score of 1.0 to two identical
 533 amino acids and a score of zero otherwise. This matrix is not very useful.
 534 Alternatively, you can read in your own (just one matrix, not a series).
 535
 536 A new matrix can be read from a file on disk, if the filename consists only
 537 of lower case characters. The values in the new weight matrix must be integers
 538 and the scores should be similarities. You can use negative as well as positive
 539 values if you wish, although the matrix will be automatically adjusted to all
 540 positive scores.
 541
 542
 543
 544 For DNA, a single matrix (not a series) is used. Two hard-coded matrices are
 545 available:
 546
 547
 548 1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
 549 of nucleic acid sequences. X's and N's are treated as matches to any IUB
 550 ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
 551
 552
 553 2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score
 554 1.0 and mismatches score 0. All matches for IUB symbols also score 0.
 555
 556 INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.
 557 Any lines beginning with a # character are assumed to be comments. The first
 558 non-comment line should contain a list of amino acids in any order, using the
 559 1 letter code, followed by a * character. This should be followed by a square
 560 matrix of integer scores, with one row and one column for each amino acid. The
 561 last row and column of the matrix (corresponding to the * character) contain
 562 the minimum score over the whole matrix.
 563
 564
 565 >> HELP 9 <<             Help for command line parameters
 566
 567                 DATA (sequences)
 568
 569 -INFILE=file.ext                             :input sequences.
 570 -PROFILE1=file.ext  and  -PROFILE2=file.ext  :profiles (old alignment).
 571
 572
 573                 VERBS (do things)
 574
 575 -OPTIONS            :list the command line parameters
 576 -HELP  or -CHECK    :outline the command line params.
 577 -FULLHELP           :output full help content.
 578 -ALIGN              :do full multiple alignment.
 579 -TREE               :calculate NJ tree.
 580 -PIM                :output percent identity matrix (while calculating the tree)
 581 -BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
 582 -CONVERT            :output the input sequences in a different file format.
 583
 584
 585                 PARAMETERS (set things)
 586
 587 ***General settings:****
 588 -INTERACTIVE :read command line, then enter normal interactive menus
 589 -QUICKTREE   :use FAST algorithm for the alignment guide tree
 590 -TYPE=       :PROTEIN or DNA sequences
 591 -NEGATIVE    :protein alignment with negative values in matrix
 592 -OUTFILE=    :sequence alignment file name
 593 -OUTPUT=     :CLUSTAL(default), GCG, GDE, PHYLIP, PIR, NEXUS and FASTA
 594 -OUTORDER=   :INPUT or ALIGNED
 595 -CASE        :LOWER or UPPER (for GDE output only)
 596 -SEQNOS=     :OFF or ON (for Clustal output only)
 597 -SEQNO_RANGE=:OFF or ON (NEW: for all output formats)
 598 -RANGE=m,n   :sequence range to write starting m to m+n
 599 -MAXSEQLEN=n :maximum allowed input sequence length
 600 -QUIET       :Reduce console output to minimum
 601 -STATS=      :Log some alignents statistics to file
 602
 603 ***Fast Pairwise Alignments:***
 604 -KTUPLE=n    :word size
 605 -TOPDIAGS=n  :number of best diags.
 606 -WINDOW=n    :window around best diags.
 607 -PAIRGAP=n   :gap penalty
 608 -SCORE       :PERCENT or ABSOLUTE
 609
 610
 611 ***Slow Pairwise Alignments:***
 612 -PWMATRIX=    :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
 613 -PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
 614 -PWGAPOPEN=f  :gap opening penalty
 615 -PWGAPEXT=f   :gap opening penalty
 616
 617
 618 ***Multiple Alignments:***
 619 -NEWTREE=      :file for new guide tree
 620 -USETREE=      :file for old guide tree
 621 -MATRIX=       :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
 622 -DNAMATRIX=    :DNA weight matrix=IUB, CLUSTALW or filename
 623 -GAPOPEN=f     :gap opening penalty
 624 -GAPEXT=f      :gap extension penalty
 625 -ENDGAPS       :no end gap separation pen.
 626 -GAPDIST=n     :gap separation pen. range
 627 -NOPGAP        :residue-specific gaps off
 628 -NOHGAP        :hydrophilic gaps off
 629 -HGAPRESIDUES= :list hydrophilic res.
 630 -MAXDIV=n      :% ident. for delay
 631 -TYPE=         :PROTEIN or DNA
 632 -TRANSWEIGHT=f :transitions weighting
 633 -ITERATION=    :NONE or TREE or ALIGNMENT
 634 -NUMITER=n     :maximum number of iterations to perform
 635 -NOWEIGHTS     :disable sequence weighting
 636
 637
 638 ***Profile Alignments:***
 639 -PROFILE      :Merge two alignments by profile alignment
 640 -NEWTREE1=    :file for new guide tree for profile1
 641 -NEWTREE2=    :file for new guide tree for profile2
 642 -USETREE1=    :file for old guide tree for profile1
 643 -USETREE2=    :file for old guide tree for profile2
 644
 645
 646 ***Sequence to Profile Alignments:***
 647 -SEQUENCES   :Sequentially add profile2 sequences to profile1 alignment
 648 -NEWTREE=    :file for new guide tree
 649 -USETREE=    :file for old guide tree
 650
 651
 652 ***Structure Alignments:***
 653 -NOSECSTR1     :do not use secondary structure-gap penalty mask for profile 1
 654 -NOSECSTR2     :do not use secondary structure-gap penalty mask for profile 2
 655 -SECSTROUT=STRUCTURE or MASK or BOTH or NONE   :output in alignment file
 656 -HELIXGAP=n    :gap penalty for helix core residues
 657 -STRANDGAP=n   :gap penalty for strand core residues
 658 -LOOPGAP=n     :gap penalty for loop regions
 659 -TERMINALGAP=n :gap penalty for structure termini
 660 -HELIXENDIN=n  :number of residues inside helix to be treated as terminal
 661 -HELIXENDOUT=n :number of residues outside helix to be treated as terminal
 662 -STRANDENDIN=n :number of residues inside strand to be treated as terminal
 663 -STRANDENDOUT=n:number of residues outside strand to be treated as terminal
 664
 665
 666 ***Trees:***
 667 -OUTPUTTREE=nj OR phylip OR dist OR nexus
 668 -SEED=n        :seed number for bootstraps.
 669 -KIMURA        :use Kimura's correction.
 670 -TOSSGAPS      :ignore positions with gaps.
 671 -BOOTLABELS=node OR branch :position of bootstrap values in tree display
 672 -CLUSTERING=   :NJ or UPGMA
 673
 674
 675 >> HELP 0 <<             Help for tree output format options
 676
 677 Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances
 678 4) Nexus
 679
 680 None of these formats displays the results graphically. Many packages can
 681 display trees in the the PHYLIP format 2) below. It can also be imported into
 682 the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display.
 683 NEXUS format trees can be read by PAUP and MacClade.
 684
 685 1) Clustal format output.
 686 This format is verbose and lists all of the distances between the sequences and
 687 the number of alignment positions used for each. The tree is described at the
 688 end of the file. It lists the sequences that are joined at each alignment step
 689 and the branch lengths. After two sequences are joined, it is referred to later
 690 as a NODE. The number of a NODE is the number of the lowest sequence in that
 691 NODE.
 692
 693 2) Phylip format output.
 694 This format is the New Hampshire format, used by many phylogenetic analysis
 695 packages. It consists of a series of nested parentheses, describing the
 696 branching order, with the sequence names and branch lengths. It can be used by
 697 the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the
 698 trees graphically. This is the same format used during multiple alignment for
 699 the guide trees.
 700
 701 Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other
 702 packages that can read and display New Hampshire format are TreeView (Mac/PC),
 703 TreeTool (UNIX), and Phylowin.
 704
 705 3) The distances only.
 706 This format just outputs a matrix of all the pairwise distances in a format
 707 that can be used by the Phylip package. It used to be useful when one could not
 708 produce distances from protein sequences in the Phylip package but is now
 709 redundant (Protdist of Phylip 3.5 now does this).
 710
 711 4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
 712 including PAUP and MacClade. The format is described fully in:
 713 Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.
 714 NEXUS: an extensible file format for systematic information.
 715 Systematic Biology 46:590-621.
 716
 717 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
 718 By default, the bootstrap values are placed on the nodes of the phylip format
 719 output tree. This is inaccurate as the bootstrap values should be associated
 720 with the tree branches and not the nodes. However, this format can be read and
 721 displayed by TreeTool, TreeView and Phylowin. An option is available to
 722 correctly place the bootstrap values on the branches with which they are
 723 associated.