website/prog_docs/clustalo.txt

   1 \r
   2 \r
   3 CLUSTAL-OMEGA is a general purpose multiple sequence alignment program\r
   4 for proteins.\r
   5 \r
   6 \r
   7 \r
   8 INTRODUCTION\r
   9 \r
  10 Clustal-Omega is a general purpose multiple sequence alignment (MSA)\r
  11 program for proteins. It produces high quality MSAs and is capable of\r
  12 handling data-sets of hundreds of thousands of sequences in reasonable\r
  13 time.\r
  14 \r
  15 In default mode, users give a file of sequences to be aligned and\r
  16 these are clustered to produce a guide tree and this is used to guide\r
  17 a "progressive alignment" of the sequences.  There are also facilities\r
  18 for aligning existing alignments to each other, aligning a sequence to\r
  19 an alignment and for using a hidden Markov model (HMM) to help guide\r
  20 an alignment of new sequences that are homologous to the sequences\r
  21 used to make the HMM.  This latter procedure is referred to as\r
  22 "external profile alignment" or EPA.\r
  23 \r
  24 Clustal-Omega uses HMMs for the alignment engine, based on the HHalign\r
  25 package from Johannes Soeding [1]. Guide trees are made using an\r
  26 enhanced version of mBed [2] which can cluster very large numbers of\r
  27 sequences in O(N*log(N)) time. Multiple alignment then proceeds by\r
  28 aligning larger and larger alignments using HHalign, following the\r
  29 clustering given by the guide tree.\r
  30 \r
  31 In its current form Clustal-Omega can only align protein sequences but\r
  32 not DNA/RNA sequences. It is envisioned that DNA/RNA will become\r
  33 available in a future version.\r
  34 \r
  35 \r
  36 \r
  37 SEQUENCE INPUT:\r
  38 \r
  39 -i, --in, --infile={<file>,-}\r
  40         Multiple sequence input file (- for stdin)\r
  41 \r
  42 --hmm-in=<file>\r
  43         HMM input files\r
  44 \r
  45 --dealign\r
  46         Dealign input sequences\r
  47 \r
  48 --profile1, --p1=<file>\r
  49         Pre-aligned multiple sequence file (aligned columns will be kept fixed)\r
  50 \r
  51 --profile2, --p2=<file>\r
  52         Pre-aligned multiple sequence file (aligned columns will be kept fixed)\r
  53 \r
  54 \r
  55 For sequence and profile input Clustal-Omega uses the Squid library\r
  56 from Sean Eddy [3].\r
  57 \r
  58 \r
  59 Clustal-Omega accepts 3 types of sequence input: (i) a sequence file\r
  60 with un-aligned or aligned sequences, (ii) profiles (a multiple\r
  61 alignment in a file) of aligned sequences, (iii) a HMM. Valid\r
  62 combinations of the above are:\r
  63 \r
  64 (a) one file with un-aligned or aligned sequences (i); the sequences\r
  65     will be aligned, and the alignment will be written out. For this\r
  66     mode use the -i flag. If the sequences are aligned (all sequences\r
  67     have the same length and at least one sequence has at least one\r
  68     gap), then the alignment is turned into a HMM, the sequences are\r
  69     de-aligned and the now un-aligned sequences are aligned using the\r
  70     HMM as an External Profile for External Profile Alignment (EPA).\r
  71     If no EPA is desired use the --dealign flag.\r
  72 \r
  73     Use the above option to make a multiple alignment from a set of\r
  74     sequences. A sequence file must contain more than one sequence (at\r
  75     least two sequences).\r
  76 \r
  77 (b) two profiles (ii)+(ii); the columns in each profile will be kept\r
  78     fixed and the alignment of the two profiles will be written\r
  79     out. Use the --p1 and --p2 flags for this mode.\r
  80 \r
  81     Use this option to align two alignments (profiles) together.\r
  82 \r
  83 (c) one file with un/aligned sequences (i) and one profile (ii); the\r
  84     profile is converted into a HMM and the un-aligned sequences will\r
  85     be multiply aligned (using the HMM background information) to form\r
  86     a profile; this constructed profile is aligned with the input\r
  87     profile; the columns in each profile (the original one and the one\r
  88     created from the un-aligned sequences) will be kept fixed and the\r
  89     alignment of the two profiles will be written out. Use the -i flag\r
  90     in conjunction with the --p1 flag for this mode.\r
  91       The un/aligned sequences file (i) must contain at least two\r
  92     sequences. If a single sequence has to be aligned with a profile\r
  93     the profile-profile option (b) has to be used.\r
  94 \r
  95     Use the above option to add new sequences to an existing\r
  96     alignment.\r
  97 \r
  98 (d) one file with un-aligned sequences (i) and one HMM (iii); the\r
  99     un-aligned sequences will be aligned to form a profile, using the\r
 100     HMM as an External Profile. So far only one HMM can be input and\r
 101     only HMMer2 and HMMer3 formats are allowed. The alignment will be\r
 102     written out; the HMM information is discarded. As, at the moment,\r
 103     only one HMM can be used, no HMM is produced if the sequences are\r
 104     already aligned. Use the -i flag in conjunction with the --hmm-in\r
 105     flag for this mode. Multiple HMMs can be inputted, however, in the\r
 106     current version all but the first HMM will be ignored.\r
 107 \r
 108     Use this option to make a new multiple alignment of sequences from\r
 109     the input file and use the HMM as a guide (EPA).\r
 110 \r
 111 \r
 112 Invalid combinations of the above are:\r
 113 \r
 114 (v) an un/aligned sequence file containing just one sequence (i)\r
 115 \r
 116 (w) an un/aligned sequence file containing just one sequence and a profile\r
 117     (i)+(ii)\r
 118 \r
 119 (x) an un/aligned sequence file containing just one sequence and a HMM\r
 120     (i)+(iii)\r
 121 \r
 122 (y) two or more HMMs (iii)+(iii)+... cannot be aligned to one another.\r
 123 \r
 124 (z) one profile (ii) cannot be aligned with a HMM (iii)\r
 125 \r
 126 \r
 127 The following MSA file formats are allowed:\r
 128 \r
 129     a2m=fasta, (vienna)\r
 130     clustal,\r
 131     msf,\r
 132     phylip,\r
 133     selex,\r
 134     stockholm\r
 135 \r
 136 \r
 137 Prior to MSA, Clustal-Omega de-aligns all sequence input (i). However,\r
 138 alignment information is automatically converted into a HMM and used\r
 139 during MSA, unless the --dealign flag is specifically set.  Profiles\r
 140 (ii) are not de-aligned.\r
 141 \r
 142 The Clustal-Omega alignment engine can at the moment not process\r
 143 DNA/RNA. If a sequence input file (i) or a profile (ii) is interpreted\r
 144 as DNA/RNA the program will terminate during the file input stage.\r
 145 \r
 146 \r
 147 \r
 148 CLUSTERING:\r
 149 \r
 150   --distmat-in=<file>\r
 151         Pairwise distance matrix input file (skips distance computation)\r
 152 \r
 153   --distmat-out=<file>\r
 154         Pairwise distance matrix output file\r
 155 \r
 156   --guidetree-in=<file>\r
 157         Guide tree input file\r
 158         (skips distance computation and guide tree clustering step)\r
 159 \r
 160   --guidetree-out=<file>\r
 161         Guide tree output file\r
 162 \r
 163   --full\r
 164         Use full distance matrix for guide-tree calculation (slow; mBed is default)\r
 165 \r
 166   --full-iter\r
 167         Use full distance matrix for guide-tree calculation during iteration (mBed is default)\r
 168 \r
 169 \r
 170 In order to produce a multiple alignment Clustal-Omega requires a\r
 171 guide tree which defines the order in which sequences/profiles are\r
 172 aligned. A guide tree in turn is constructed, based on a distance\r
 173 matrix. Conventionally, this distance matrix is comprised of all the\r
 174 pair-wise distances of the sequences. The distance measure\r
 175 Clustal-Omega uses for pair-wise distances of un-aligned sequences is\r
 176 the k-tuple measure [4], which was also implemented in Clustal 1.83\r
 177 and ClustalW2 [5,6]. If the sequences inputted via -i are aligned\r
 178 Clustal-Omega uses the Kimura-corrected pairwise aligned identities\r
 179 [7]. The computational effort (time/memory) to calculate and store a\r
 180 full distance matrix grows quadratically with the number of sequences.\r
 181 Clustal-Omega can improve this scalability to N*log(N) by employing a\r
 182 fast clustering algorithm called mBed [2]; this option is\r
 183 automatically invoked (default). If a full distance matrix evaluation\r
 184 is desired, then the --full flag has to be set. The mBed mode\r
 185 calculates a reduced set of pair-wise distances. These distances are\r
 186 used in a k-means algorithm, that clusters at most 100 sequences. For\r
 187 each cluster a full distance matrix is calculated. No full distance\r
 188 matrix (of all input sequences) is calculated in mBed mode. If there\r
 189 are less than 100 sequences in the input, then in effect a full\r
 190 distance matrix is calculated in mBed mode, however, no distance\r
 191 matrix can be outputted (see below).\r
 192 \r
 193 \r
 194 Clustal-Omega uses Muscle's [8] fast UPGMA implementation to construct\r
 195 its guide trees from the distance matrix. By default, the distance\r
 196 matrix is used internally to construct the guide tree and is then\r
 197 discarded. By specifying --distmat-out the internal distance matrix\r
 198 can be written to file. This is only possible in --full mode. The\r
 199 guide trees by default are used internally to guide the multiple\r
 200 alignment and are then discarded. By specifying the --guidetree-out\r
 201 option these internal guide trees can be written out to\r
 202 file. Conversely, the distance calculation and/or guide tree building\r
 203 stage can be skipped, by reading in a pre-calculated distance matrix\r
 204 and/or pre-calculated guide tree. These options are invoked by\r
 205 specifying the --distmat-in and/or --guidetree-in flags,\r
 206 respectively. However, distance matrix reading is disabled in the\r
 207 current version. By default, distance matrix and guide tree files are\r
 208 not over-written, if a file with the specified name already exists. In\r
 209 this case Clustal-Omega aborts during the command-line processing\r
 210 stage. To force over-writing of already existing files use the --force\r
 211 flag (see MISCELLANEOUS).  In mBed mode a full distance matrix cannot\r
 212 be outputted, distance matrix output is only possible in --full mode.\r
 213 mBed or --full distance mode do not affect the ability to write out\r
 214 guide-trees.\r
 215 \r
 216 Guide trees can be iterated to refine the alignment (see section\r
 217 ITERATION). Clustal-Omega takes the alignment, that was produced\r
 218 initially and constructs a new distance matrix from this alignment.\r
 219 The distance measure used at this stage is the Kimura distance [7]. By\r
 220 default, Clustal-Omega constructs a reduced distance matrix at this\r
 221 stage using the mBed algorithm, which will then be used to create an\r
 222 improved (iterated) new guide tree. To turn off mBed-like clustering\r
 223 at this stage the --full-iter flag has to be set. While Kimura\r
 224 distances in general are much faster to calculate than k-tuple\r
 225 distances, time and memory requirements still scale quadratically with\r
 226 the number of sequences and --full-iter clustering should only be\r
 227 considered for smaller cases (<< 10,000 sequences).\r
 228 \r
 229 \r
 230 \r
 231 ALIGNMENT OUTPUT:\r
 232 \r
 233   -o, --out, --outfile={file,-} Multiple sequence alignment output file (default: stdout)\r
 234 \r
 235   --outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} MSA output file format (default: fasta)\r
 236 \r
 237 \r
 238 By default Clustal-Omega writes its results (alignments) to stdout. An\r
 239 output file can be specified with the -o flag. Output to stdout is not\r
 240 possible in verbose mode (-v, see MISCELLANEOUS) as verbose/debugging\r
 241 messages would interfere with the alignment output.  By default,\r
 242 alignment files are not over-written, if a file with the specified\r
 243 name already exists. In this case Clustal-Omega aborts during the\r
 244 command-line processing stage. To force over-writing of already\r
 245 existing files use the --force flag (see MISCELLANEOUS).\r
 246 \r
 247 Clustal-Omega can output alignments in various formats by setting the\r
 248 --outfmt flag:\r
 249 \r
 250   * for Fasta format set: --outfmt=a2m  or  --outfmt=fa  or  --outfmt=fasta\r
 251 \r
 252   * for Clustal format set: --outfmt=clu  or  --outfmt=clustal\r
 253 \r
 254   * for Msf format: set --outfmt= msf\r
 255 \r
 256   * for Phylip format set: --outfmt=phy  or  --outfmt=phylip\r
 257 \r
 258   * for Selex format set: --outfmt=selex\r
 259 \r
 260   * for Stockholm format set: --outfmt=st  or  --outfmt=stockholm\r
 261 \r
 262   * for Vienna format set: --outfmt=vie  or  --outfmt=vienna\r
 263 \r
 264 \r
 265 ITERATION:\r
 266 \r
 267   --iterations, --iter=<n>  Number of (combined guide tree/HMM) iterations\r
 268 \r
 269   --max-guidetree-iterations=<n> Maximum guide tree iterations\r
 270 \r
 271   --max-hmm-iterations=<n>  Maximum number of HMM iterations\r
 272 \r
 273 \r
 274 By default, Clustal-Omega calculates (or reads in) a guide tree and\r
 275 performs a multiple alignment in the order specified by this guide\r
 276 tree. This alignment is then outputted. Clustal-Omega can 'iterate'\r
 277 its guide tree. The hope is that the (Kimura) distances, that can be\r
 278 derived from the initial alignment, will give rise to a better guide\r
 279 tree, and by extension, to a better alignment.\r
 280 \r
 281 A similar rationale applies to HMM-iteration. MSAs in general are very\r
 282 'vulnerable' at their early stages. Sequences that are aligned at an\r
 283 early stage remain fixed for the rest of the MSA. Another way of\r
 284 putting this is: 'once a gap, always a gap'. This behaviour can be\r
 285 mitigated by HMM iteration. An initial alignment is created and turned\r
 286 into a HMM. This HMM can help in a new round of MSA to 'anticipate'\r
 287 where residues should align. This is using the HMM as an External\r
 288 Profile and carrying out iterative EPA.  In practice, individual\r
 289 sequences and profiles are aligned to the External HMM, derived after\r
 290 the initial alignment. Pseudo-count information is then transferred to\r
 291 the (internal) HMM, corresponding to the individual\r
 292 sequence/profile. The now somewhat 'softened' sequences/profiles are\r
 293 then in turn aligned in the order specified by the guide\r
 294 tree. Pseudo-count transfer is reduced with the size of the\r
 295 profile. Individual sequences attain the greatest pseudo-count\r
 296 transfer, larger profiles less so. Pseudo-count transfer to profiles\r
 297 larger than, say, 10 is negligible. The effect of HMM iteration is\r
 298 more pronounced in larger test sets (that is, with more sequences).\r
 299 \r
 300 Both, HMM- and guide tree-iteration come at a cost of increasing the\r
 301 run-time. One round of guide tree iteration adds on (roughly) the time\r
 302 it took to construct the initial alignment. If, for example, the\r
 303 initial alignment took 1min, then it will take (roughly) 2min to\r
 304 iterate the guide tree once, 3min to iterate the guide tree twice, and\r
 305 so on. HMM-iteration is more costly, as each round of iteration adds\r
 306 three times the time required for the alignment stage. For example, if\r
 307 the initial alignment took 1min, then each additional round of HMM\r
 308 iteration will add on 3min; so 4 iterations will take 13min\r
 309 (=1min+4*3min). The factor of 3 stems from the fact that at every\r
 310 stage both intermediate profiles have to be aligned with the\r
 311 background HMM, and finally the (softened) HMMs have to be aligned as\r
 312 well. All times are quoted for single processors.\r
 313 \r
 314 By default, guide tree iteration and HMM-iteration are coupled. This\r
 315 means, at each iteration step both, guide tree and HMM, are\r
 316 re-calculated. This is invoked by setting the --iter flag. For\r
 317 example, if --iter=1, then first an initial alignment is produced\r
 318 (without external HMM background information and using k-tuple\r
 319 distances to calculate the guide tree). This initial alignment is then\r
 320 used to re-calculate a new guide tree (using Kimura distances) and to\r
 321 create a HMM. The new guide tree and the HMM are then used to produce\r
 322 a new MSA.\r
 323 \r
 324 Iteration of guide tree and HMM can be de-coupled. This means that the\r
 325 number of guide tree iterations and HMM iterations can be\r
 326 different. This can be done by combining the --iter flag with the\r
 327 --max-guidetree-iterations and/or the --max-hmm-iterations flag.  The\r
 328 number of guide tree iterations is the minimum of --iter and\r
 329 --max-guidetree-iterations, while the number of HMM iterations is the\r
 330 minimum of --iter and --max-hmm-iterations.  If, for example, HMM\r
 331 iteration should be performed 5 times but guide tree iteration should\r
 332 be performed only 3 times, then one should set --iter=5 and\r
 333 --max-guidetree-iterations=3. All three flags can be specified at the\r
 334 same time (however, this makes no sense). It is not sufficient just to\r
 335 specify --max-guidetree-iterations and --max-hmm-iterations but not\r
 336 --iter. If any iteration is desired --iter has to be set.\r
 337 \r
 338 \r
 339 LIMITS (will exit early, if exceeded):\r
 340 \r
 341   --maxnumseq=<n>           Maximum allowed number of sequences\r
 342 \r
 343   --maxseqlen=<l>           Maximum allowed sequence length\r
 344 \r
 345 Limits can be imposed on the number of sequences in the input file\r
 346 and/or the lengths of the sequences. This cap can be set with the\r
 347 --maxnumseq and --maxseqlen flags, respectively. Clustal-Omega will\r
 348 exit early, if these limits are exceeded.\r
 349 \r
 350 \r
 351 MISCELLANEOUS:\r
 352 \r
 353   --auto                    Set options automatically (might overwrite some of your options)\r
 354 \r
 355   --threads=<n>             Number of processors to use\r
 356 \r
 357   -l, --log=<file>          Log all non-essential output to this file\r
 358 \r
 359   -h, --help                Print help and exit\r
 360 \r
 361   -v, --verbose             Verbose output (increases if given multiple times)\r
 362 \r
 363   --version                 Print version information and exit\r
 364 \r
 365   --long-version            Print long version information and exit\r
 366 \r
 367   --force                   Force file overwriting\r
 368 \r
 369 \r
 370 Users may feel unsure which options are appropriate in certain\r
 371 situations even though using ClustalO without any special options\r
 372 should give you the desired results. The --auto flag tries to\r
 373 alleviate this problem and selects accuracy/speed flags according to\r
 374 the number of sequences. For all cases will use mBed and thereby\r
 375 possibly overwrite the --full option. For more than 1,000 sequences\r
 376 the iteration is turned off as the effect of iteration is more\r
 377 noticeable for 'larger' problems. Otherwise iterations are set to 1 if\r
 378 not already set to a higher value by the user. Expert users may want\r
 379 to avoid this flag and exercise more fine tuned control by selecting\r
 380 the appropriate options manually.\r
 381 \r
 382 Certain parts of the MSA calculation have been parallelised. Most\r
 383 noticeably, the distance matrix calculation, and certain aspects of\r
 384 the HMM building stage. Clustal-Omega uses OpenMP. By default,\r
 385 Clustal-Omega will attempt to use as many threads as possible. For\r
 386 example, on a 4-core machine Clustal-Omega will attempt to use 4\r
 387 threads. The number of threads can be limited by setting the --threads\r
 388 flag. This may be desirable, for example, in the case of\r
 389 benchmarking/timing.\r
 390 \r
 391 Usually, non-essential (verbose) output is written to screen. This\r
 392 output can be written to file by specifying the --log flag.\r
 393 \r
 394 Help is available by specifying the -h flag.\r
 395 \r
 396 By default Clustal-Omega does not print any information to stdout\r
 397 (other than the final alignment, if no output file is\r
 398 specified). Information concerning the progress of the alignment can\r
 399 be obtained by specifying one verbosity flag (-v). This may be\r
 400 desirable, to verify what Clustal-Omega is actually doing at the\r
 401 moment. If two verbosity flags (-v -v) are specified, command-line\r
 402 flags (explicitly and implicitly set) are printed in addition to the\r
 403 progress report.  Triple verbose level (-v -v -v) is the most verbose\r
 404 level. In addition to single- and double-verbose information much more\r
 405 information is displayed: input sequences and names, details of the\r
 406 tree construction and intermediate alignments. Tree construction\r
 407 information includes pairwise distances. The number of pairwise\r
 408 distances scales with the square of the number of sequences, and\r
 409 double verbose mode is probably only useful for a small number of\r
 410 sequences.\r
 411 \r
 412 The current version number of Clustal-Omega can be displayed by\r
 413 setting the --version flag.\r
 414 \r
 415 The current version number of Clustal-Omega as well as the code-name\r
 416 and the build date can be displayed by setting the --long-version\r
 417 flag.\r
 418 \r
 419 By default, Clustal-Omega does not over-write files. These can be (i)\r
 420 alignment output, (ii) distance matrix and (iii) guide\r
 421 tree. Overwriting can be forced by setting the --force flag.\r
 422 \r
 423 \r
 424 EXAMPLES:\r
 425 \r
 426 ./clustalo -i globin.fa\r
 427 \r
 428 Clustal-Omega reads the sequence file globin.fa, aligns the sequences\r
 429 and prints the result to screen in fasta/a2m format.\r
 430 \r
 431 \r
 432 ./clustalo -i globin.fa -o globin.sto --outfmt=st\r
 433 \r
 434 If the file globin.sto does not exist, then Clustal-Omega reads the\r
 435 sequence file globin.fa, aligns the sequences and prints the result to\r
 436 globin.sto in Stockholm format. If the file globin.sto does exist\r
 437 already, then Clustal-Omega terminates the alignment process before\r
 438 reading globin.fa.\r
 439 \r
 440 \r
 441 ./clustalo -i globin.fa -o globin.aln --outfmt=clu --force\r
 442 \r
 443 Clustal-Omega reads the sequence file globin.fa, aligns the sequences\r
 444 and prints the result to globin.aln in Clustal format, overwriting the\r
 445 file globin.aln, if it already exists.\r
 446 \r
 447 \r
 448 ./clustalo -i globin.fa --distmat-out=globin.mat --guidetree-out=globin.dnd --force\r
 449 \r
 450 Clustal-Omega reads the sequence file globin.fa, aligns the sequences,\r
 451 prints the result to screen in fasta/a2m format (default), the guide\r
 452 tree to globin.dnd and the distance matrix to globin.mat, overwriting\r
 453 those files if they already exist.\r
 454 \r
 455 \r
 456 ./clustalo -i globin.fa --guidetree-in=globin.dnd\r
 457 \r
 458 Clustal-Omega reads the files globin.fa and globin.dnd, skipping\r
 459 distance calculation and guide tree creation, using instead the guide\r
 460 tree specified in globin.dnd.\r
 461 \r
 462 \r
 463 ./clustalo -i globin.fa --hmm-in=PF00042.hmm\r
 464 \r
 465 Clustal-Omega reads the sequence file globin.fa and the HMM file\r
 466 PF00042.hmm (in HMMer2 or HMMer3 format).  It then performs the\r
 467 alignment, transferring pseudo-count information contained in\r
 468 PF00042.hmm to the sequences/profiles during the MSA.\r
 469 \r
 470 \r
 471 ./clustalo -i globin.sto\r
 472 \r
 473 Clustal-Omega reads the file globin.sto (of aligned sequences in\r
 474 Stockholm format). It converts the alignment into a HMM, de-aligns the\r
 475 sequences and re-aligns them, transferring pseudo-count information to\r
 476 the sequences/profiles during the MSA. The guide tree is constructed\r
 477 using a full distance matrix of Kimura distances.\r
 478 \r
 479 \r
 480 ./clustalo -i globin.sto  --dealign\r
 481 \r
 482 Clustal-Omega reads the file globin.sto (of aligned sequences in\r
 483 Stockholm format). It de-aligns the sequences and then re-aligns\r
 484 them. No HMM is produced in the process, no pseudo-count information\r
 485 is transferred. Consequently, the output must be the same as for\r
 486 unaligned output (like in the first example ./clustalo -i globin.fa)\r
 487 \r
 488 \r
 489 ./clustalo -i globin.fa --iter=2\r
 490 \r
 491 Clustal-Omega reads the file globin.fa, creates a UPGMA guide tree\r
 492 built from k-tuple distances, and performs an initial alignment. This\r
 493 initial alignment is converted into a HMM and a new guide tree is\r
 494 built from the Kimura distances of the initial alignment. The\r
 495 un-aligned sequences are then aligned (for the second time but this\r
 496 time) using pseudo-count information from the HMM created after the\r
 497 initial alignment (and using the new guide tree). This second\r
 498 alignment is then again converted into a HMM and a new guide tree is\r
 499 constructed. The un-aligned sequences are then aligned (for a third\r
 500 time), again using pseudo-count information of the HMM from the\r
 501 previous step and the most recent guide tree. The final alignment is\r
 502 written to screen.\r
 503 \r
 504 \r
 505 ./clustalo -i globin.fa --iter=5 --max-guidetree-iterations=1\r
 506 \r
 507 Clustal-Omega reads the file globin.fa, creates a UPGMA guide tree\r
 508 built from k-tuple distances, and performs an initial alignment. This\r
 509 initial alignment is converted into a HMM and a new guide tree is\r
 510 built from the Kimura distances of the initial alignment. The\r
 511 un-aligned sequences are then aligned (for the second time but this\r
 512 time) using pseudo-count information from the HMM created after the\r
 513 initial alignment (and using the new guide tree). For the last 4\r
 514 iterations the guide tree is left unchanged and only HMM iteration is\r
 515 performed. This means that intermediate alignments are converted to\r
 516 HMMs, and these intermediate HMMs are used to guide the MSA during\r
 517 subsequent iteration stages.\r
 518 \r
 519 \r
 520 ./clustalo -i globin.fa -o globin.a2m -v\r
 521 \r
 522 In case the file globin.a2m does not exist, Clustal-Omega reads the\r
 523 file globin.fa, prints a progress report to screen and writes the\r
 524 alignment in (default) Fasta format to globin.a2m. The progress report\r
 525 consists of the number of threads used, the number of sequences read,\r
 526 the current progress in the k-tuple distance calculation, completion\r
 527 of the guide tree computation and current progress of the MSA stage.\r
 528 If the file globin.a2m already exists Clustal-Omega aborts before\r
 529 reading the file globin.fa. Note that in verbose mode an output file\r
 530 has to be specified, because progress/debugging information, which is\r
 531 printed to screen, would interfere with the alignment being printed to\r
 532 screen.\r
 533 \r
 534 \r
 535 ./clustalo -i PF00042_full.fa --dealign --full --outfmt=vie -o PF00042_full.vie --force\r
 536 \r
 537 Clustal-Omega reads the file PF00042_full.fa. This file contains\r
 538 several thousand aligned sequences. --dealign tells Clustal-Omega to\r
 539 erase all alignment information and re-align the sequences from\r
 540 scratch. As there are several thousand sequences calculating a full\r
 541 distance matrix may be slow. Setting the --full flag specifically\r
 542 selects the full distance mode over the default mBed mode. The\r
 543 alignment is then written out in Vienna format (fasta format all on\r
 544 one line, no line breaks per sequence) to file PF00042_full.vie.\r
 545 \r
 546 \r
 547 ./clustalo -i PF00042_full.fa --dealign --outfmt=vie -o PF00042_full.vie --force\r
 548 \r
 549 Clustal-Omega reads the file PF00042_full.fa. This file contains\r
 550 several thousand aligned sequences. --dealign tells Clustal-Omega to\r
 551 erase all alignment information and re-align the sequences from\r
 552 scratch. Calculating the distance matrix will be done by mBed\r
 553 (default). Clustal-Omega will calculate pairwise distances to a\r
 554 small number of reference sequences only. This will give a significant\r
 555 speed-up. The speed-up is greater for larger families (more\r
 556 sequences). The alignment is then written out in Vienna format (fasta\r
 557 format all on one line, no line breaks per sequence) to file\r
 558 PF00042_full.vie.\r
 559 \r
 560 \r
 561 ./clustalo --p1=globin.sto --p2=PF00042_full.vie -o globin+pf00042.fa\r
 562 \r
 563 Clustal-Omega reads files globin.sto and PF00042_full.vie of aligned\r
 564 sequences (profiles). Both profiles are then aligned. The relative\r
 565 positions of residues in both profiles are not changed during this\r
 566 alignment, however, columns of gaps may be inserted into the profiles,\r
 567 respectively. The final alignment is written to file globin+pf00042.fa\r
 568 in fasta format.\r
 569 \r
 570 \r
 571 ./clustalo -i globin.fa --p1=PF00042_full.vie -o pf00042+globin.fa\r
 572 \r
 573 Clustal-Omega reads file globin.fa of un-aligned sequences and the\r
 574 profile (of aligned sequences) in file PF00042_full.vie. A HMM is\r
 575 created from the profile. This HMM is used to guide the alignment of\r
 576 the un-aligned sequences in globin.fa. The profile that was generated\r
 577 during this alignment of un-aligned globin.fa sequences is then\r
 578 aligned to the input profile PF00042_full.vie. The relative positions\r
 579 of residues in profile PF00042_full.vie is not changed during this\r
 580 alignment, however, columns of gaps may be inserted into the\r
 581 profile. The final alignment is output to file pf00042+globin.fa in\r
 582 fasta format. The alignment in this example may be slightly different\r
 583 from the alignment in the previous example, because no HMM guidance\r
 584 was used generate the profile globin.sto. In this example HMM guidance\r
 585 was used to align the sequences in globin.fa; the hope being that this\r
 586 intermediate alignment will have profited from the bigger profile.\r
 587 \r
 588 \r
 589 \r
 590 LITERATURE:\r
 591 \r
 592 [1] Johannes Soding (2005) Protein homology detection by HMM-HMM\r
 593     comparison. Bioinformatics 21 (7): 951–960.\r
 594 \r
 595 [2] Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG.  Sequence\r
 596     embedding for fast construction of guide trees for multiple\r
 597     sequence alignment.  Algorithms Mol Biol. 2010 May 14;5:21.\r
 598 \r
 599 [3] http://www.genetics.wustl.edu/eddy/software/#squid\r
 600 \r
 601 [4] Wilbur and Lipman, 1983; PMID 6572363\r
 602 \r
 603 [5] Thompson JD, Higgins DG, Gibson TJ.  (1994). CLUSTAL W: improving\r
 604     the sensitivity of progressive multiple sequence alignment through\r
 605     sequence weighting, position-specific gap penalties and weight\r
 606     matrix choice. Nucleic Acids Res., 22, 4673-4680.\r
 607 \r
 608 [6] Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,\r
 609     McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD,\r
 610     Gibson TJ, Higgins DG.  (2007). Clustal W and Clustal X version\r
 611     2.0. Bioinformatics, 23, 2947-2948.\r
 612 \r
 613 [7] Kimura M (1980). "A simple method for estimating evolutionary\r
 614     rates of base substitutions through comparative studies of\r
 615     nucleotide sequences". Journal of Molecular Evolution 16: 111–120.\r
 616 \r
 617 [8] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high\r
 618     accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797.\r
 619 \r