getting rid of ambiguous documentation folder. all prog_doc are now stored in website...

[jabaws.git] / binaries / help / clustalw-help.txt
diff --git a/binaries/help/clustalw-help.txt b/binaries/help/clustalw-help.txt

deleted file mode 100644 (file)

index fd50671..0000000
--- a/binaries/help/clustalw-help.txt
+++ /dev/null
@@ -1,720 +0,0 @@
-\r
-\r
-\r
- CLUSTAL 2.0.12 Multiple Sequence Alignments\r
-\r
-\r
-\r
-\r
->> HELP NEW <<             NEW FEATURES/OPTIONS\r
-\r
-==UPGMA== \r
- The UPGMA algorithm has been added to allow faster tree construction. The user now\r
- has the choice of using Neighbour Joining or UPGMA. The default is still NJ, but the\r
- user can change this by setting the clustering parameter.\r
- \r
- -CLUSTERING=   :NJ or UPGMA\r
- \r
-==ITERATION==\r
-\r
- A remove first iteration scheme has been added. This can be used to improve the final\r
- alignment or improve the alignment at each stage of the progressive alignment. During the \r
- iteration step each sequence is removed in turn and realigned. If the resulting alignment \r
- is better than the  previous alignment it is kept. This process is repeated until the score\r
- converges (the  score is not improved) or until the maximum number of iterations is \r
- reached. The user can  iterate at each step of the progressive alignment by setting the \r
- iteration parameter to  TREE or just on the final alignment by seting the iteration \r
- parameter to ALIGNMENT. The default is no iteration. The maximum number of  iterations can \r
- be set using the numiter parameter. The default number of iterations is 3.\r
-  \r
- -ITERATION=    :NONE or TREE or ALIGNMENT\r
- \r
- -NUMITER=n     :Maximum number of iterations to perform\r
- \r
-==HELP==\r
- \r
- -FULLHELP      :Print out the complete help content\r
- \r
-==MISC==\r
-\r
- -MAXSEQLEN=n   :Maximum allowed sequence length\r
- \r
- -QUIET         :Reduce console output to minimum\r
- \r
- -STATS=file    :Log some alignents statistics to file\r
-\r
-\r
->> HELP 1 <<             General help for CLUSTAL W (2.0.12)\r
-\r
-Clustal W is a general purpose multiple alignment program for DNA or proteins.\r
-\r
-SEQUENCE INPUT:  all sequences must be in 1 file, one after another.  \r
-7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, \r
-Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.\r
-All non-alphabetic characters (spaces, digits, punctuation marks) are ignored\r
-except "-" which is used to indicate a GAP ("." in MSF-RSF).  \r
-\r
-To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to \r
-INPUT them; go to menu item 2 to do the multiple alignment.\r
-\r
-PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments.  Use this to\r
-add a new sequence to an old alignment, or to use secondary structure to guide \r
-the alignment process.  GAPS in the old alignments are indicated using the "-" \r
-character.   PROFILES can be input in ANY of the allowed formats; just \r
-use "-" (or "." for MSF-RSF) for each gap position.\r
-\r
-PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in\r
-with "-" characters to indicate gaps) OR after a multiple alignment while the \r
-alignment is still in memory.\r
-\r
-\r
-The program tries to automatically recognise the different file formats used\r
-and to guess whether the sequences are amino acid or nucleotide.  This is not\r
-always foolproof.\r
-\r
-FASTA and NBRF-PIR formats are recognised by having a ">" as the first \r
-character in the file.  \r
-\r
-EMBL-Swiss Prot formats are recognised by the letters\r
-ID at the start of the file (the token for the entry name field).  \r
-\r
-CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.\r
-\r
-GCG-MSF format is recognised by one of the following:\r
-       - the word PileUp at the start of the file. \r
-       - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT\r
-         at the start of the file.\r
-       - the word MSF on the first line of the line, and the characters ..\r
-         at the end of this line.\r
-\r
-GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of\r
-the file.\r
-\r
-\r
-If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the\r
-sequence will be assumed to be nucleotide.  This works in 97.3% of cases\r
-but watch out!\r
-\r
-\r
->> HELP 2 <<             Help for multiple alignments\r
-\r
-If you have already loaded sequences, use menu item 1 to do the complete\r
-multiple alignment.  You will be prompted for 2 output files: 1 for the \r
-alignment itself; another to store a dendrogram that describes the similarity\r
-of the sequences to each other.\r
-\r
-Multiple alignments are carried out in 3 stages (automatically done from menu\r
-item 1 ...Do complete multiple alignments now):\r
-\r
-1) all sequences are compared to each other (pairwise alignments);\r
-\r
-2) a dendrogram (like a phylogenetic tree) is constructed, describing the\r
-approximate groupings of the sequences by similarity (stored in a file).\r
-\r
-3) the final multiple alignment is carried out, using the dendrogram as a guide.\r
-\r
-\r
-PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial\r
-alignments.\r
-\r
-MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.\r
-\r
-\r
-RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences\r
-during multiple alignment if you wish to change the parameters and try again.\r
-This only takes effect just before you do a second multiple alignment.  You\r
-can make phylogenetic trees after alignment whether or not this is ON.\r
-If you turn this OFF, the new gaps are kept even if you do a second multiple\r
-alignment. This allows you to iterate the alignment gradually.  Sometimes, the \r
-alignment is improved by a second or third pass.\r
-\r
-SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the \r
-screen as well as to the output file.\r
-\r
-You can skip the first stages (pairwise alignments; dendrogram) by using an\r
-old dendrogram file (menu item 3); or you can just produce the dendrogram\r
-with no final multiple alignment (menu item 2).\r
-\r
-\r
-OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6 \r
-different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).  \r
-\r
-\r
-\r
->> HELP 3 <<             Help for pairwise alignment parameters\r
-\r
-A distance is calculated between every pair of sequences and these are used to\r
-construct the dendrogram which guides the final multiple alignment. The scores\r
-are calculated from separate pairwise alignments. These can be calculated using\r
-2 methods: dynamic programming (slow but accurate) or by the method of Wilbur\r
-and Lipman (extremely fast but approximate). \r
-\r
-You can choose between the 2 alignment methods using menu option 8.  The\r
-slow-accurate method is fine for short sequences but will be VERY SLOW for \r
-many (e.g. >100) long (e.g. >1000 residue) sequences.   \r
-\r
-SLOW-ACCURATE alignment parameters:\r
-       These parameters do not have any affect on the speed of the alignments. \r
-They are used to give initial alignments which are then rescored to give percent\r
-identity scores.  These % scores are the ones which are displayed on the \r
-screen.  The scores are converted to distances for the trees.\r
-\r
-1) Gap Open Penalty:      the penalty for opening a gap in the alignment.\r
-2) Gap extension penalty: the penalty for extending a gap by 1 residue.\r
-3) Protein weight matrix: the scoring table which describes the similarity\r
-                          of each amino acid to each other.\r
-4) DNA weight matrix:     the scores assigned to matches and mismatches \r
-                          (including IUB ambiguity codes).\r
-\r
-\r
-FAST-APPROXIMATE alignment parameters:\r
-\r
-These similarity scores are calculated from fast, approximate, global align-\r
-ments, which are controlled by 4 parameters.   2 techniques are used to make\r
-these alignments very fast: 1) only exactly matching fragments (k-tuples) are\r
-considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)\r
-are used.\r
-\r
-K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. \r
-INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.\r
-For longer sequences (e.g. >1000 residues) you may need to increase the default.\r
-\r
-GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has\r
-little affect on the speed or sensitivity except for extreme values.\r
-\r
-TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary\r
-dot-matrix plot) is calculated.  Only the best ones (with most matches) are\r
-used in the alignment.  This parameter specifies how many.  Decrease for speed;\r
-increase for sensitivity.\r
-\r
-WINDOW SIZE:  This is the number of diagonals around each of the 'best' \r
-diagonals that will be used.  Decrease for speed; increase for sensitivity.\r
-\r
-\r
->> HELP 4 <<             Help for multiple alignment parameters\r
-\r
-These parameters control the final multiple alignment. This is the core of the\r
-program and the details are complicated. To fully understand the use of the\r
-parameters and the scoring system, you will have to refer to the documentation.\r
-\r
-Each step in the final multiple alignment consists of aligning two alignments \r
-or sequences.  This is done progressively, following the branching order in \r
-the GUIDE TREE.  The basic parameters to control this are two gap penalties and\r
-the scores for various identical-non-indentical residues.  \r
-\r
-1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the \r
-cost of opening up every new gap and the cost of every item in a gap. \r
-Increasing the gap opening penalty will make gaps less frequent. Increasing \r
-the gap extension penalty will make gaps shorter. Terminal gaps are not \r
-penalised.\r
-\r
-3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most\r
-distantly related sequences until after the most closely related sequences have \r
-been aligned.   The setting shows the percent identity level required to delay\r
-the addition of a sequence; sequences that are less identical than this level\r
-to any other sequences will be aligned later.\r
-\r
-\r
-\r
-4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T \r
-i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0\r
-and 1; a weight of zero means that the transitions are scored as mismatches,\r
-while a weight of 1 gives the transitions the match score. For distantly related\r
-DNA sequences, the weight should be near to zero; for closely related sequences\r
-it can be useful to assign a higher score.\r
-\r
-\r
-5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of\r
-weight matrices. The default for proteins in version 1.8 is the PAM series \r
-derived by Gonnet and colleagues. Note, a series is used! The actual matrix\r
-that is used depends on how similar the sequences to be aligned at this \r
-alignment step are. Different matrices work differently at each evolutionary\r
-distance. \r
-\r
-6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)\r
-can be selected. The default is the matrix used by BESTFIT for comparison of\r
-nucleic acid sequences.\r
-\r
-Further help is offered in the weight matrix menu.\r
-\r
-\r
-7)  In the weight matrices, you can use negative as well as positive values if\r
-you wish, although the matrix will be automatically adjusted to all positive\r
-scores, unless the NEGATIVE MATRIX option is selected.\r
-\r
-8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty\r
-options which are only used in protein alignments.\r
-\r
-\r
->> HELP A <<             Help for protein gap parameters.\r
-\r
-1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce\r
-or increase the gap opening penalties at each position in the alignment or\r
-sequence.  See the documentation for details.  As an example, positions that \r
-are rich in glycine are more likely to have an adjacent gap than positions that\r
-are rich in valine.\r
-\r
-2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within\r
-a run (5 or more residues) of hydrophilic amino acids; these are likely to\r
-be loop or random coil regions where gaps are more common.  The residues that \r
-are "considered" to be hydrophilic are set by menu item 3.\r
-\r
-4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too\r
-close to each other. Gaps that are less than this distance apart are penalised\r
-more than other gaps. This does not prevent close gaps; it makes them less\r
-frequent, promoting a block-like appearance of the alignment.\r
-\r
-5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes\r
-of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).\r
-If you turn this off, end gaps will be ignored for this purpose.  This is\r
-useful when you wish to align fragments where the end gaps are not biologically\r
-meaningful.\r
-\r
-\r
->> HELP 5 <<             Help for output format options.\r
-\r
-Six output formats are offered. You can choose any (or all 6 if you wish).  \r
-\r
-CLUSTAL format output is a self explanatory alignment format.  It shows the\r
-sequences aligned in blocks.  It can be read in again at a later date to\r
-(for example) calculate a phylogenetic tree or add a new sequence with a \r
-profile alignment.\r
-\r
-GCG output can be used by any of the GCG programs that can work on multiple\r
-alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG\r
-.msf format files (multiple sequence file); new in version 7 of GCG.\r
-\r
-PHYLIP format output can be used for input to the PHYLIP package of Joe \r
-Felsenstein.  This is an extremely widely used package for doing every \r
-imaginable form of phylogenetic analysis (MUCH more than the the modest intro-\r
-duction offered by this program).\r
-\r
-NBRF-PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap\r
-characters "-" are used to indicate the positions of gaps in the multiple \r
-alignment.  These files can be re-used as input in any part of clustal that\r
-allows sequences (or alignments or profiles) to be read in.  \r
-\r
-GDE:  this is the flat file format used by the GDE package of Steven Smith.\r
-\r
-NEXUS: the format used by several phylogeny programs, including PAUP and\r
-MacClade.\r
-\r
-GDE OUTPUT CASE: sequences in GDE format may be written in either upper or\r
-lower case.\r
-\r
-CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the\r
-alignment lines in clustalw format.\r
-\r
-OUTPUT ORDER is used to control the order of the sequences in the output\r
-alignments.  By default, the order corresponds to the order in which the\r
-sequences were aligned (from the guide tree-dendrogram), thus automatically\r
-grouping closely related sequences. This switch can be used to set the order\r
-to the same as the input file.\r
-\r
-PARAMETER OUTPUT: This option allows you to save all your parameter settings\r
-in a parameter file. This file can be used subsequently to rerun Clustal W\r
-using the same parameters.\r
-\r
-\r
->> HELP 6 <<             Help for profile and structure alignments\r
-\r
-By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile \r
-alignments allow you to store alignments of your favourite sequences and add\r
-new sequences to them in small bunches at a time. A profile is simply an\r
-alignment of one or more sequences (e.g. an alignment output file from CLUSTAL\r
-W). Each input can be a single sequence. One or both sets of input sequences\r
-may include secondary structure assignments or gap penalty masks to guide the\r
-alignment. \r
-\r
-The profiles can be in any of the allowed input formats with "-" characters\r
-used to specify gaps (except for MSF-RSF where "." is used).\r
-\r
-You have to specify the 2 profiles by choosing menu items 1 and 2 and giving\r
-2 file names.  Then Menu item 3 will align the 2 profiles to each other. \r
-Secondary structure masks in either profile can be used to guide the alignment.\r
-\r
-Menu item 4 will take the sequences in the second profile and align them to\r
-the first profile, 1 at a time.  This is useful to add some new sequences to\r
-an existing alignment, or to align a set of sequences to a known structure.  \r
-In this case, the second profile would not be pre-aligned.\r
-\r
-\r
-The alignment parameters can be set using menu items 5, 6 and 7. These are\r
-EXACTLY the same parameters as used by the general, automatic multiple\r
-alignment procedure. The general multiple alignment procedure is simply a\r
-series of profile alignments. Carrying out a series of profile alignments on\r
-larger and larger groups of sequences, allows you to manually build up a\r
-complete alignment, if necessary editing intermediate alignments.\r
-\r
-SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure\r
-parameters. If a solved structure is available, it can be used to guide the \r
-alignment by raising gap penalties within secondary structure elements, so \r
-that gaps will preferentially be inserted into unstructured surface loops.\r
-Alternatively, a user-specified gap penalty mask can be supplied directly.\r
-\r
-A gap penalty mask is a series of numbers between 1 and 9, one per position in \r
-the alignment. Each number specifies how much the gap opening penalty is to be \r
-raised at that position (raised by multiplying the basic gap opening penalty\r
-by the number) i.e. a mask figure of 1 at a position means no change\r
-in gap opening penalty; a figure of 4 means that the gap opening penalty is\r
-four times greater at that position, making gaps 4 times harder to open.\r
-\r
-The format for gap penalty masks and secondary structure masks is explained\r
-in the help under option 0 (secondary structure options).\r
-\r
-\r
->> HELP B <<             Help for secondary structure - gap penalty masks\r
-\r
-The use of secondary structure-based penalties has been shown to improve the\r
-accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty \r
-masks to be supplied with the input sequences. The masks work by raising gap \r
-penalties in specified regions (typically secondary structure elements) so that\r
-gaps are preferentially opened in the less well conserved regions (typically \r
-surface loops).\r
-\r
-Options 1 and 2 control whether the input secondary structure information or\r
-gap penalty masks will be used.\r
-\r
-Option 3 controls whether the secondary structure and gap penalty masks should\r
-be included in the output alignment.\r
-\r
-Options 4 and 5 provide the value for raising the gap penalty at core Alpha \r
-Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues \r
-denote the A and B core structure notation. The basic gap penalties are\r
-multiplied by the amount specified.\r
-\r
-Option 6 provides the value for the gap penalty in Loops. By default this \r
-penalty is not raised. In CLUSTAL format, loops are specified by "." in the \r
-secondary structure notation.\r
-\r
-Option 7 provides the value for setting the gap penalty at the ends of \r
-secondary structures. Ends of secondary structures are observed to grow \r
-and-or shrink in related structures. Therefore by default these are given \r
-intermediate values, lower than the core penalties. All secondary structure \r
-read in as lower case in CLUSTAL format gets the reduced terminal penalty.\r
-\r
-Options 8 and 9 specify the range of structure termini for the intermediate \r
-penalties. In the alignment output, these are indicated as lower case. \r
-For Alpha Helices, by default, the range spans the end helical turn. For \r
-Beta Strands, the default range spans the end residue and the adjacent loop \r
-residue, since sequence conservation often extends beyond the actual H-bonded\r
-Beta Strand.\r
-\r
-CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input\r
-files. For many 3-D protein structures, secondary structure information is\r
-recorded in the feature tables of SWISS-PROT database entries. You should\r
-always check that the assignments are correct - some are quite inaccurate.\r
-CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.\r
-\r
-FT   HELIX       100    115\r
-FT   STRAND      118    119\r
-\r
-The structure and penalty masks can also be read from CLUSTAL alignment format \r
-as comment lines beginning "!SS_" or "!GM_" e.g.\r
-\r
-!SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA\r
-!GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444\r
-HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK\r
-\r
-Note that the mask itself is a set of numbers between 1 and 9 each of which is \r
-assigned to the residue(s) in the same column below. \r
-\r
-In GDE flat file format, the masks are specified as text and the names must\r
-begin with "SS_ or "GM_.\r
-\r
-Either a structure or penalty mask or both may be used. If both are included in\r
-an alignment, the user will be asked which is to be used.\r
-\r
-\r
->> HELP C <<             Help for secondary structure - gap penalty mask output options\r
-\r
-The options in this menu let you choose whether or not to include the masks\r
-in the CLUSTAL W output alignments. Showing both is useful for understanding\r
-how the masks work. The secondary structure information is itself very useful\r
-in judging the alignment quality and in seeing how residue conservation\r
-patterns vary with secondary structure.\r
-\r
-\r
->> HELP 7 <<             Help for phylogenetic trees\r
-\r
-1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be\r
-input in any format or you should have just carried out a full multiple\r
-alignment and the alignment is still in memory. \r
-\r
-\r
-*************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! ***************\r
-\r
-\r
-The methods used are NJ (Neighbour Joining) and UPGMA. First\r
-you calculate distances (percent divergence) between all pairs of sequence from\r
-a multiple alignment; second you apply the NJ or UPGMA method to the distance matrix.\r
-\r
-2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where\r
-ANY of the sequences have a gap will be ignored. This means that 'like' will be\r
-compared to 'like' in all distances, which is highly desirable. It also\r
-automatically throws away the most ambiguous parts of the alignment, which are\r
-concentrated around gaps (usually). The disadvantage is that you may throw away\r
-much of the data if there are many gaps (which is why it is difficult for us to\r
-make it the default).  \r
-\r
-\r
-\r
-3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this\r
-option makes no difference. For greater divergence, it corrects for the fact\r
-that observed distances underestimate actual evolutionary distances. This is\r
-because, as sequences diverge, more than one substitution will happen at many\r
-sites. However, you only see one difference when you look at the present day\r
-sequences. Therefore, this option has the effect of stretching branch lengths\r
-in trees (especially long branches). The corrections used here (for DNA or\r
-proteins) are both due to Motoo Kimura. See the documentation for details.  \r
-\r
-Where possible, this option should be used. However, for VERY divergent\r
-sequences, the distances cannot be reliably corrected. You will be warned if\r
-this happens. Even if none of the distances in a data set exceed the reliable\r
-threshold, if you bootstrap the data, some of the bootstrap distances may\r
-randomly exceed the safe limit.  \r
-\r
-4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED\r
-tree and all branch lengths. The root of the tree can only be inferred by\r
-using an outgroup (a sequence that you are certain branches at the outside\r
-of the tree .... certain on biological grounds) OR if you assume a degree\r
-of constancy in the 'molecular clock', you can place the root in the 'middle'\r
-of the tree (roughly equidistant from all tips).\r
-\r
-5) TOGGLE PHYLIP BOOTSTRAP POSITIONS\r
-By default, the bootstrap values are correctly placed on the tree branches of\r
-the phylip format output tree. The toggle allows them to be placed on the\r
-nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView\r
-and Phylowin) only support node labelling but not branch labelling. Care\r
-should be taken to note which branches and labels go together.\r
-\r
-6) OUTPUT FORMATS: four different formats are allowed. None of these displays\r
-the tree visually. Useful display programs accepting PHYLIP format include\r
-NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and\r
-PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities\r
-there. (Get the PHYLIP package anyway if you are interested in trees). The\r
-NEXUS format can be read into PAUP or MacClade.\r
-\r
-\r
->> HELP 8 <<             Help for choosing a weight matrix\r
-\r
-For protein alignments, you use a weight matrix to determine the similarity of\r
-non-identical amino acids.  For example, Tyr aligned with Phe is usually judged \r
-to be 'better' than Tyr aligned with Pro.\r
-\r
-There are three 'in-built' series of weight matrices offered. Each consists of\r
-several matrices which work differently at different evolutionary distances. To\r
-see the exact details, read the documentation. Crudely, we store several\r
-matrices in memory, spanning the full range of amino acid distance (from almost\r
-identical sequences to highly divergent ones). For very similar sequences, it\r
-is best to use a strict weight matrix which only gives a high score to\r
-identities and the most favoured conservative substitutions. For more divergent\r
-sequences, it is appropriate to use "softer" matrices which give a high score\r
-to many other frequent substitutions.\r
-\r
-1) BLOSUM (Henikoff). These matrices appear to be the best available for \r
-carrying out database similarity (homology searches). The matrices used are:\r
-Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W\r
-versions)\r
-\r
-2) PAM (Dayhoff). These have been extremely widely used since the late '70s.\r
-We use the PAM 20, 60, 120 and 350 matrices.\r
-\r
-3) GONNET. These matrices were derived using almost the same procedure as the\r
-Dayhoff one (above) but are much more up to date and are based on a far larger\r
-data set. They appear to be more sensitive than the Dayhoff series. We use the\r
-GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for\r
-Clustal W version 1.8.\r
-\r
-We also supply an identity matrix which gives a score of 1.0 to two identical \r
-amino acids and a score of zero otherwise. This matrix is not very useful.\r
-Alternatively, you can read in your own (just one matrix, not a series).\r
-\r
-A new matrix can be read from a file on disk, if the filename consists only\r
-of lower case characters. The values in the new weight matrix must be integers\r
-and the scores should be similarities. You can use negative as well as positive\r
-values if you wish, although the matrix will be automatically adjusted to all\r
-positive scores.\r
-\r
-\r
-\r
-For DNA, a single matrix (not a series) is used. Two hard-coded matrices are \r
-available:\r
-\r
-\r
-1) IUB. This is the default scoring matrix used by BESTFIT for the comparison\r
-of nucleic acid sequences. X's and N's are treated as matches to any IUB\r
-ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.\r
- \r
- \r
-2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score\r
-1.0 and mismatches score 0. All matches for IUB symbols also score 0.\r
-\r
-INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.\r
-Any lines beginning with a # character are assumed to be comments. The first\r
-non-comment line should contain a list of amino acids in any order, using the\r
-1 letter code, followed by a * character. This should be followed by a square\r
-matrix of integer scores, with one row and one column for each amino acid. The\r
-last row and column of the matrix (corresponding to the * character) contain\r
-the minimum score over the whole matrix.\r
-\r
-\r
->> HELP 9 <<             Help for command line parameters\r
-\r
-                DATA (sequences)\r
-\r
--INFILE=file.ext                             :input sequences.\r
--PROFILE1=file.ext  and  -PROFILE2=file.ext  :profiles (old alignment).\r
-\r
-\r
-                VERBS (do things)\r
-\r
--OPTIONS            :list the command line parameters\r
--HELP  or -CHECK    :outline the command line params.\r
--FULLHELP           :output full help content.\r
--ALIGN              :do full multiple alignment.\r
--TREE               :calculate NJ tree.\r
--PIM                :output percent identity matrix (while calculating the tree)\r
--BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).\r
--CONVERT            :output the input sequences in a different file format.\r
-\r
-\r
-                PARAMETERS (set things)\r
-\r
-***General settings:****\r
--INTERACTIVE :read command line, then enter normal interactive menus\r
--QUICKTREE   :use FAST algorithm for the alignment guide tree\r
--TYPE=       :PROTEIN or DNA sequences\r
--NEGATIVE    :protein alignment with negative values in matrix\r
--OUTFILE=    :sequence alignment file name\r
--OUTPUT=     :GCG, GDE, PHYLIP, PIR or NEXUS\r
--OUTORDER=   :INPUT or ALIGNED\r
--CASE        :LOWER or UPPER (for GDE output only)\r
--SEQNOS=     :OFF or ON (for Clustal output only)\r
--SEQNO_RANGE=:OFF or ON (NEW: for all output formats)\r
--RANGE=m,n   :sequence range to write starting m to m+n\r
--MAXSEQLEN=n :maximum allowed input sequence length\r
--QUIET       :Reduce console output to minimum\r
--STATS=      :Log some alignents statistics to file\r
-\r
-***Fast Pairwise Alignments:***\r
--KTUPLE=n    :word size\r
--TOPDIAGS=n  :number of best diags.\r
--WINDOW=n    :window around best diags.\r
--PAIRGAP=n   :gap penalty\r
--SCORE       :PERCENT or ABSOLUTE\r
-\r
-\r
-***Slow Pairwise Alignments:***\r
--PWMATRIX=    :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename\r
--PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename\r
--PWGAPOPEN=f  :gap opening penalty        \r
--PWGAPEXT=f   :gap opening penalty\r
-\r
-\r
-***Multiple Alignments:***\r
--NEWTREE=      :file for new guide tree\r
--USETREE=      :file for old guide tree\r
--MATRIX=       :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename\r
--DNAMATRIX=    :DNA weight matrix=IUB, CLUSTALW or filename\r
--GAPOPEN=f     :gap opening penalty        \r
--GAPEXT=f      :gap extension penalty\r
--ENDGAPS       :no end gap separation pen. \r
--GAPDIST=n     :gap separation pen. range\r
--NOPGAP        :residue-specific gaps off  \r
--NOHGAP        :hydrophilic gaps off\r
--HGAPRESIDUES= :list hydrophilic res.    \r
--MAXDIV=n      :% ident. for delay\r
--TYPE=         :PROTEIN or DNA\r
--TRANSWEIGHT=f :transitions weighting\r
--ITERATION=    :NONE or TREE or ALIGNMENT\r
--NUMITER=n     :maximum number of iterations to perform\r
--NOWEIGHTS     :disable sequence weighting\r
-\r
-\r
-***Profile Alignments:***\r
--PROFILE      :Merge two alignments by profile alignment\r
--NEWTREE1=    :file for new guide tree for profile1\r
--NEWTREE2=    :file for new guide tree for profile2\r
--USETREE1=    :file for old guide tree for profile1\r
--USETREE2=    :file for old guide tree for profile2\r
-\r
-\r
-***Sequence to Profile Alignments:***\r
--SEQUENCES   :Sequentially add profile2 sequences to profile1 alignment\r
--NEWTREE=    :file for new guide tree\r
--USETREE=    :file for old guide tree\r
-\r
-\r
-***Structure Alignments:***\r
--NOSECSTR1     :do not use secondary structure-gap penalty mask for profile 1 \r
--NOSECSTR2     :do not use secondary structure-gap penalty mask for profile 2\r
--SECSTROUT=STRUCTURE or MASK or BOTH or NONE   :output in alignment file\r
--HELIXGAP=n    :gap penalty for helix core residues \r
--STRANDGAP=n   :gap penalty for strand core residues\r
--LOOPGAP=n     :gap penalty for loop regions\r
--TERMINALGAP=n :gap penalty for structure termini\r
--HELIXENDIN=n  :number of residues inside helix to be treated as terminal\r
--HELIXENDOUT=n :number of residues outside helix to be treated as terminal\r
--STRANDENDIN=n :number of residues inside strand to be treated as terminal\r
--STRANDENDOUT=n:number of residues outside strand to be treated as terminal \r
-\r
-\r
-***Trees:***\r
--OUTPUTTREE=nj OR phylip OR dist OR nexus\r
--SEED=n        :seed number for bootstraps.\r
--KIMURA        :use Kimura's correction.   \r
--TOSSGAPS      :ignore positions with gaps.\r
--BOOTLABELS=node OR branch :position of bootstrap values in tree display\r
--CLUSTERING=   :NJ or UPGMA\r
-\r
-\r
->> HELP 0 <<             Help for tree output format options\r
-\r
-Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances\r
-4) Nexus\r
-\r
-None of these formats displays the results graphically. Many packages can\r
-display trees in the the PHYLIP format 2) below. It can also be imported into\r
-the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display. \r
-NEXUS format trees can be read by PAUP and MacClade.\r
-\r
-1) Clustal format output. \r
-This format is verbose and lists all of the distances between the sequences and\r
-the number of alignment positions used for each. The tree is described at the\r
-end of the file. It lists the sequences that are joined at each alignment step\r
-and the branch lengths. After two sequences are joined, it is referred to later\r
-as a NODE. The number of a NODE is the number of the lowest sequence in that\r
-NODE.   \r
-\r
-2) Phylip format output.\r
-This format is the New Hampshire format, used by many phylogenetic analysis\r
-packages. It consists of a series of nested parentheses, describing the\r
-branching order, with the sequence names and branch lengths. It can be used by\r
-the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the\r
-trees graphically. This is the same format used during multiple alignment for\r
-the guide trees. \r
-\r
-Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other\r
-packages that can read and display New Hampshire format are TreeView (Mac/PC),\r
-TreeTool (UNIX), and Phylowin.\r
-\r
-3) The distances only.\r
-This format just outputs a matrix of all the pairwise distances in a format\r
-that can be used by the Phylip package. It used to be useful when one could not\r
-produce distances from protein sequences in the Phylip package but is now\r
-redundant (Protdist of Phylip 3.5 now does this).\r
-\r
-4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,\r
-including PAUP and MacClade. The format is described fully in:\r
-Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.\r
-NEXUS: an extensible file format for systematic information.\r
-Systematic Biology 46:590-621.\r
-\r
-5) TOGGLE PHYLIP BOOTSTRAP POSITIONS\r
-By default, the bootstrap values are placed on the nodes of the phylip format\r
-output tree. This is inaccurate as the bootstrap values should be associated\r
-with the tree branches and not the nodes. However, this format can be read and\r
-displayed by TreeTool, TreeView and Phylowin. An option is available to\r
-correctly place the bootstrap values on the branches with which they are\r
-associated.\r