binaries/windows/clustalo/README.txt

   1
   2
   3 CLUSTAL-OMEGA is a general purpose multiple sequence alignment program
   4 for proteins.
   5
   6
   7
   8 INTRODUCTION
   9
  10 Clustal-Omega is a general purpose multiple sequence alignment (MSA)
  11 program for proteins. It produces high quality MSAs and is capable of
  12 handling data-sets of hundreds of thousands of sequences in reasonable
  13 time.
  14
  15 In default mode, users give a file of sequences to be aligned and
  16 these are clustered to produce a guide tree and this is used to guide
  17 a "progressive alignment" of the sequences.  There are also facilities
  18 for aligning existing alignments to each other, aligning a sequence to
  19 an alignment and for using a hidden Markov model (HMM) to help guide
  20 an alignment of new sequences that are homologous to the sequences
  21 used to make the HMM.  This latter procedure is referred to as
  22 "external profile alignment" or EPA.
  23
  24 Clustal-Omega uses HMMs for the alignment engine, based on the HHalign
  25 package from Johannes Soeding [1]. Guide trees are made using an
  26 enhanced version of mBed [2] which can cluster very large numbers of
  27 sequences in O(N*log(N)) time. Multiple alignment then proceeds by
  28 aligning larger and larger alignments using HHalign, following the
  29 clustering given by the guide tree.
  30
  31 In its current form Clustal-Omega can only align protein sequences but
  32 not DNA/RNA sequences. It is envisioned that DNA/RNA will become
  33 available in a future version.
  34
  35
  36
  37 SEQUENCE INPUT:
  38
  39 -i, --in, --infile={<file>,-}
  40         Multiple sequence input file (- for stdin)
  41
  42 --hmm-in=<file>
  43         HMM input files
  44
  45 --dealign
  46         Dealign input sequences
  47
  48 --profile1, --p1=<file>
  49         Pre-aligned multiple sequence file (aligned columns will be kept fixed)
  50
  51 --profile2, --p2=<file>
  52         Pre-aligned multiple sequence file (aligned columns will be kept fixed)
  53
  54
  55 For sequence and profile input Clustal-Omega uses the Squid library
  56 from Sean Eddy [3].
  57
  58
  59 Clustal-Omega accepts 3 types of sequence input: (i) a sequence file
  60 with un-aligned or aligned sequences, (ii) profiles (a multiple
  61 alignment in a file) of aligned sequences, (iii) a HMM. Valid
  62 combinations of the above are:
  63
  64 (a) one file with un-aligned or aligned sequences (i); the sequences
  65     will be aligned, and the alignment will be written out. For this
  66     mode use the -i flag. If the sequences are aligned (all sequences
  67     have the same length and at least one sequence has at least one
  68     gap), then the alignment is turned into a HMM, the sequences are
  69     de-aligned and the now un-aligned sequences are aligned using the
  70     HMM as an External Profile for External Profile Alignment (EPA).
  71     If no EPA is desired use the --dealign flag.
  72
  73     Use the above option to make a multiple alignment from a set of
  74     sequences. A sequence file must contain more than one sequence (at
  75     least two sequences).
  76
  77 (b) two profiles (ii)+(ii); the columns in each profile will be kept
  78     fixed and the alignment of the two profiles will be written
  79     out. Use the --p1 and --p2 flags for this mode.
  80
  81     Use this option to align two alignments (profiles) together.
  82
  83 (c) one file with un/aligned sequences (i) and one profile (ii); the
  84     profile is converted into a HMM and the un-aligned sequences will
  85     be multiply aligned (using the HMM background information) to form
  86     a profile; this constructed profile is aligned with the input
  87     profile; the columns in each profile (the original one and the one
  88     created from the un-aligned sequences) will be kept fixed and the
  89     alignment of the two profiles will be written out. Use the -i flag
  90     in conjunction with the --p1 flag for this mode.
  91       The un/aligned sequences file (i) must contain at least two
  92     sequences. If a single sequence has to be aligned with a profile
  93     the profile-profile option (b) has to be used.
  94
  95     Use the above option to add new sequences to an existing
  96     alignment.
  97
  98 (d) one file with un-aligned sequences (i) and one HMM (iii); the
  99     un-aligned sequences will be aligned to form a profile, using the
 100     HMM as an External Profile. So far only one HMM can be input and
 101     only HMMer2 and HMMer3 formats are allowed. The alignment will be
 102     written out; the HMM information is discarded. As, at the moment,
 103     only one HMM can be used, no HMM is produced if the sequences are
 104     already aligned. Use the -i flag in conjunction with the --hmm-in
 105     flag for this mode. Multiple HMMs can be inputted, however, in the
 106     current version all but the first HMM will be ignored.
 107
 108     Use this option to make a new multiple alignment of sequences from
 109     the input file and use the HMM as a guide (EPA).
 110
 111
 112 Invalid combinations of the above are:
 113
 114 (v) an un/aligned sequence file containing just one sequence (i)
 115
 116 (w) an un/aligned sequence file containing just one sequence and a profile
 117     (i)+(ii)
 118
 119 (x) an un/aligned sequence file containing just one sequence and a HMM
 120     (i)+(iii)
 121
 122 (y) two or more HMMs (iii)+(iii)+... cannot be aligned to one another.
 123
 124 (z) one profile (ii) cannot be aligned with a HMM (iii)
 125
 126
 127 The following MSA file formats are allowed:
 128
 129     a2m=fasta, (vienna)
 130     clustal,
 131     msf,
 132     phylip,
 133     selex,
 134     stockholm
 135
 136
 137 Prior to MSA, Clustal-Omega de-aligns all sequence input (i). However,
 138 alignment information is automatically converted into a HMM and used
 139 during MSA, unless the --dealign flag is specifically set.  Profiles
 140 (ii) are not de-aligned.
 141
 142 The Clustal-Omega alignment engine can at the moment not process
 143 DNA/RNA. If a sequence input file (i) or a profile (ii) is interpreted
 144 as DNA/RNA the program will terminate during the file input stage.
 145
 146
 147
 148 CLUSTERING:
 149
 150   --distmat-in=<file>
 151         Pairwise distance matrix input file (skips distance computation)
 152
 153   --distmat-out=<file>
 154         Pairwise distance matrix output file
 155
 156   --guidetree-in=<file>
 157         Guide tree input file
 158         (skips distance computation and guide tree clustering step)
 159
 160   --guidetree-out=<file>
 161         Guide tree output file
 162
 163   --full
 164         Use full distance matrix for guide-tree calculation (slow; mBed is default)
 165
 166   --full-iter
 167         Use full distance matrix for guide-tree calculation during iteration (mBed is default)
 168
 169
 170 In order to produce a multiple alignment Clustal-Omega requires a
 171 guide tree which defines the order in which sequences/profiles are
 172 aligned. A guide tree in turn is constructed, based on a distance
 173 matrix. Conventionally, this distance matrix is comprised of all the
 174 pair-wise distances of the sequences. The distance measure
 175 Clustal-Omega uses for pair-wise distances of un-aligned sequences is
 176 the k-tuple measure [4], which was also implemented in Clustal 1.83
 177 and ClustalW2 [5,6]. If the sequences inputted via -i are aligned
 178 Clustal-Omega uses the Kimura-corrected pairwise aligned identities
 179 [7]. The computational effort (time/memory) to calculate and store a
 180 full distance matrix grows quadratically with the number of sequences.
 181 Clustal-Omega can improve this scalability to N*log(N) by employing a
 182 fast clustering algorithm called mBed [2]; this option is
 183 automatically invoked (default). If a full distance matrix evaluation
 184 is desired, then the --full flag has to be set. The mBed mode
 185 calculates a reduced set of pair-wise distances. These distances are
 186 used in a k-means algorithm, that clusters at most 100 sequences. For
 187 each cluster a full distance matrix is calculated. No full distance
 188 matrix (of all input sequences) is calculated in mBed mode. If there
 189 are less than 100 sequences in the input, then in effect a full
 190 distance matrix is calculated in mBed mode, however, no distance
 191 matrix can be outputted (see below).
 192
 193
 194 Clustal-Omega uses Muscle's [8] fast UPGMA implementation to construct
 195 its guide trees from the distance matrix. By default, the distance
 196 matrix is used internally to construct the guide tree and is then
 197 discarded. By specifying --distmat-out the internal distance matrix
 198 can be written to file. This is only possible in --full mode. The
 199 guide trees by default are used internally to guide the multiple
 200 alignment and are then discarded. By specifying the --guidetree-out
 201 option these internal guide trees can be written out to
 202 file. Conversely, the distance calculation and/or guide tree building
 203 stage can be skipped, by reading in a pre-calculated distance matrix
 204 and/or pre-calculated guide tree. These options are invoked by
 205 specifying the --distmat-in and/or --guidetree-in flags,
 206 respectively. However, distance matrix reading is disabled in the
 207 current version. By default, distance matrix and guide tree files are
 208 not over-written, if a file with the specified name already exists. In
 209 this case Clustal-Omega aborts during the command-line processing
 210 stage. To force over-writing of already existing files use the --force
 211 flag (see MISCELLANEOUS).  In mBed mode a full distance matrix cannot
 212 be outputted, distance matrix output is only possible in --full mode.
 213 mBed or --full distance mode do not affect the ability to write out
 214 guide-trees.
 215
 216 Guide trees can be iterated to refine the alignment (see section
 217 ITERATION). Clustal-Omega takes the alignment, that was produced
 218 initially and constructs a new distance matrix from this alignment.
 219 The distance measure used at this stage is the Kimura distance [7]. By
 220 default, Clustal-Omega constructs a reduced distance matrix at this
 221 stage using the mBed algorithm, which will then be used to create an
 222 improved (iterated) new guide tree. To turn off mBed-like clustering
 223 at this stage the --full-iter flag has to be set. While Kimura
 224 distances in general are much faster to calculate than k-tuple
 225 distances, time and memory requirements still scale quadratically with
 226 the number of sequences and --full-iter clustering should only be
 227 considered for smaller cases (<< 10,000 sequences).
 228
 229
 230
 231 ALIGNMENT OUTPUT:
 232
 233   -o, --out, --outfile={file,-} Multiple sequence alignment output file (default: stdout)
 234
 235   --outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} MSA output file format (default: fasta)
 236
 237
 238 By default Clustal-Omega writes its results (alignments) to stdout. An
 239 output file can be specified with the -o flag. Output to stdout is not
 240 possible in verbose mode (-v, see MISCELLANEOUS) as verbose/debugging
 241 messages would interfere with the alignment output.  By default,
 242 alignment files are not over-written, if a file with the specified
 243 name already exists. In this case Clustal-Omega aborts during the
 244 command-line processing stage. To force over-writing of already
 245 existing files use the --force flag (see MISCELLANEOUS).
 246
 247 Clustal-Omega can output alignments in various formats by setting the
 248 --outfmt flag:
 249
 250   * for Fasta format set: --outfmt=a2m  or  --outfmt=fa  or  --outfmt=fasta
 251
 252   * for Clustal format set: --outfmt=clu  or  --outfmt=clustal
 253
 254   * for Msf format: set --outfmt= msf
 255
 256   * for Phylip format set: --outfmt=phy  or  --outfmt=phylip
 257
 258   * for Selex format set: --outfmt=selex
 259
 260   * for Stockholm format set: --outfmt=st  or  --outfmt=stockholm
 261
 262   * for Vienna format set: --outfmt=vie  or  --outfmt=vienna
 263
 264
 265 ITERATION:
 266
 267   --iterations, --iter=<n>  Number of (combined guide tree/HMM) iterations
 268
 269   --max-guidetree-iterations=<n> Maximum guide tree iterations
 270
 271   --max-hmm-iterations=<n>  Maximum number of HMM iterations
 272
 273
 274 By default, Clustal-Omega calculates (or reads in) a guide tree and
 275 performs a multiple alignment in the order specified by this guide
 276 tree. This alignment is then outputted. Clustal-Omega can 'iterate'
 277 its guide tree. The hope is that the (Kimura) distances, that can be
 278 derived from the initial alignment, will give rise to a better guide
 279 tree, and by extension, to a better alignment.
 280
 281 A similar rationale applies to HMM-iteration. MSAs in general are very
 282 'vulnerable' at their early stages. Sequences that are aligned at an
 283 early stage remain fixed for the rest of the MSA. Another way of
 284 putting this is: 'once a gap, always a gap'. This behaviour can be
 285 mitigated by HMM iteration. An initial alignment is created and turned
 286 into a HMM. This HMM can help in a new round of MSA to 'anticipate'
 287 where residues should align. This is using the HMM as an External
 288 Profile and carrying out iterative EPA.  In practice, individual
 289 sequences and profiles are aligned to the External HMM, derived after
 290 the initial alignment. Pseudo-count information is then transferred to
 291 the (internal) HMM, corresponding to the individual
 292 sequence/profile. The now somewhat 'softened' sequences/profiles are
 293 then in turn aligned in the order specified by the guide
 294 tree. Pseudo-count transfer is reduced with the size of the
 295 profile. Individual sequences attain the greatest pseudo-count
 296 transfer, larger profiles less so. Pseudo-count transfer to profiles
 297 larger than, say, 10 is negligible. The effect of HMM iteration is
 298 more pronounced in larger test sets (that is, with more sequences).
 299
 300 Both, HMM- and guide tree-iteration come at a cost of increasing the
 301 run-time. One round of guide tree iteration adds on (roughly) the time
 302 it took to construct the initial alignment. If, for example, the
 303 initial alignment took 1min, then it will take (roughly) 2min to
 304 iterate the guide tree once, 3min to iterate the guide tree twice, and
 305 so on. HMM-iteration is more costly, as each round of iteration adds
 306 three times the time required for the alignment stage. For example, if
 307 the initial alignment took 1min, then each additional round of HMM
 308 iteration will add on 3min; so 4 iterations will take 13min
 309 (=1min+4*3min). The factor of 3 stems from the fact that at every
 310 stage both intermediate profiles have to be aligned with the
 311 background HMM, and finally the (softened) HMMs have to be aligned as
 312 well. All times are quoted for single processors.
 313
 314 By default, guide tree iteration and HMM-iteration are coupled. This
 315 means, at each iteration step both, guide tree and HMM, are
 316 re-calculated. This is invoked by setting the --iter flag. For
 317 example, if --iter=1, then first an initial alignment is produced
 318 (without external HMM background information and using k-tuple
 319 distances to calculate the guide tree). This initial alignment is then
 320 used to re-calculate a new guide tree (using Kimura distances) and to
 321 create a HMM. The new guide tree and the HMM are then used to produce
 322 a new MSA.
 323
 324 Iteration of guide tree and HMM can be de-coupled. This means that the
 325 number of guide tree iterations and HMM iterations can be
 326 different. This can be done by combining the --iter flag with the
 327 --max-guidetree-iterations and/or the --max-hmm-iterations flag.  The
 328 number of guide tree iterations is the minimum of --iter and
 329 --max-guidetree-iterations, while the number of HMM iterations is the
 330 minimum of --iter and --max-hmm-iterations.  If, for example, HMM
 331 iteration should be performed 5 times but guide tree iteration should
 332 be performed only 3 times, then one should set --iter=5 and
 333 --max-guidetree-iterations=3. All three flags can be specified at the
 334 same time (however, this makes no sense). It is not sufficient just to
 335 specify --max-guidetree-iterations and --max-hmm-iterations but not
 336 --iter. If any iteration is desired --iter has to be set.
 337
 338
 339 LIMITS (will exit early, if exceeded):
 340
 341   --maxnumseq=<n>           Maximum allowed number of sequences
 342
 343   --maxseqlen=<l>           Maximum allowed sequence length
 344
 345 Limits can be imposed on the number of sequences in the input file
 346 and/or the lengths of the sequences. This cap can be set with the
 347 --maxnumseq and --maxseqlen flags, respectively. Clustal-Omega will
 348 exit early, if these limits are exceeded.
 349
 350
 351 MISCELLANEOUS:
 352
 353   --auto                    Set options automatically (might overwrite some of your options)
 354
 355   --threads=<n>             Number of processors to use
 356
 357   -l, --log=<file>          Log all non-essential output to this file
 358
 359   -h, --help                Print help and exit
 360
 361   -v, --verbose             Verbose output (increases if given multiple times)
 362
 363   --version                 Print version information and exit
 364
 365   --long-version            Print long version information and exit
 366
 367   --force                   Force file overwriting
 368
 369
 370 Users may feel unsure which options are appropriate in certain
 371 situations even though using ClustalO without any special options
 372 should give you the desired results. The --auto flag tries to
 373 alleviate this problem and selects accuracy/speed flags according to
 374 the number of sequences. For all cases will use mBed and thereby
 375 possibly overwrite the --full option. For more than 1,000 sequences
 376 the iteration is turned off as the effect of iteration is more
 377 noticeable for 'larger' problems. Otherwise iterations are set to 1 if
 378 not already set to a higher value by the user. Expert users may want
 379 to avoid this flag and exercise more fine tuned control by selecting
 380 the appropriate options manually.
 381
 382 Certain parts of the MSA calculation have been parallelised. Most
 383 noticeably, the distance matrix calculation, and certain aspects of
 384 the HMM building stage. Clustal-Omega uses OpenMP. By default,
 385 Clustal-Omega will attempt to use as many threads as possible. For
 386 example, on a 4-core machine Clustal-Omega will attempt to use 4
 387 threads. The number of threads can be limited by setting the --threads
 388 flag. This may be desirable, for example, in the case of
 389 benchmarking/timing.
 390
 391 Usually, non-essential (verbose) output is written to screen. This
 392 output can be written to file by specifying the --log flag.
 393
 394 Help is available by specifying the -h flag.
 395
 396 By default Clustal-Omega does not print any information to stdout
 397 (other than the final alignment, if no output file is
 398 specified). Information concerning the progress of the alignment can
 399 be obtained by specifying one verbosity flag (-v). This may be
 400 desirable, to verify what Clustal-Omega is actually doing at the
 401 moment. If two verbosity flags (-v -v) are specified, command-line
 402 flags (explicitly and implicitly set) are printed in addition to the
 403 progress report.  Triple verbose level (-v -v -v) is the most verbose
 404 level. In addition to single- and double-verbose information much more
 405 information is displayed: input sequences and names, details of the
 406 tree construction and intermediate alignments. Tree construction
 407 information includes pairwise distances. The number of pairwise
 408 distances scales with the square of the number of sequences, and
 409 double verbose mode is probably only useful for a small number of
 410 sequences.
 411
 412 The current version number of Clustal-Omega can be displayed by
 413 setting the --version flag.
 414
 415 The current version number of Clustal-Omega as well as the code-name
 416 and the build date can be displayed by setting the --long-version
 417 flag.
 418
 419 By default, Clustal-Omega does not over-write files. These can be (i)
 420 alignment output, (ii) distance matrix and (iii) guide
 421 tree. Overwriting can be forced by setting the --force flag.
 422
 423
 424 EXAMPLES:
 425
 426 ./clustalo -i globin.fa
 427
 428 Clustal-Omega reads the sequence file globin.fa, aligns the sequences
 429 and prints the result to screen in fasta/a2m format.
 430
 431
 432 ./clustalo -i globin.fa -o globin.sto --outfmt=st
 433
 434 If the file globin.sto does not exist, then Clustal-Omega reads the
 435 sequence file globin.fa, aligns the sequences and prints the result to
 436 globin.sto in Stockholm format. If the file globin.sto does exist
 437 already, then Clustal-Omega terminates the alignment process before
 438 reading globin.fa.
 439
 440
 441 ./clustalo -i globin.fa -o globin.aln --outfmt=clu --force
 442
 443 Clustal-Omega reads the sequence file globin.fa, aligns the sequences
 444 and prints the result to globin.aln in Clustal format, overwriting the
 445 file globin.aln, if it already exists.
 446
 447
 448 ./clustalo -i globin.fa --distmat-out=globin.mat --guidetree-out=globin.dnd --force
 449
 450 Clustal-Omega reads the sequence file globin.fa, aligns the sequences,
 451 prints the result to screen in fasta/a2m format (default), the guide
 452 tree to globin.dnd and the distance matrix to globin.mat, overwriting
 453 those files if they already exist.
 454
 455
 456 ./clustalo -i globin.fa --guidetree-in=globin.dnd
 457
 458 Clustal-Omega reads the files globin.fa and globin.dnd, skipping
 459 distance calculation and guide tree creation, using instead the guide
 460 tree specified in globin.dnd.
 461
 462
 463 ./clustalo -i globin.fa --hmm-in=PF00042.hmm
 464
 465 Clustal-Omega reads the sequence file globin.fa and the HMM file
 466 PF00042.hmm (in HMMer2 or HMMer3 format).  It then performs the
 467 alignment, transferring pseudo-count information contained in
 468 PF00042.hmm to the sequences/profiles during the MSA.
 469
 470
 471 ./clustalo -i globin.sto
 472
 473 Clustal-Omega reads the file globin.sto (of aligned sequences in
 474 Stockholm format). It converts the alignment into a HMM, de-aligns the
 475 sequences and re-aligns them, transferring pseudo-count information to
 476 the sequences/profiles during the MSA. The guide tree is constructed
 477 using a full distance matrix of Kimura distances.
 478
 479
 480 ./clustalo -i globin.sto  --dealign
 481
 482 Clustal-Omega reads the file globin.sto (of aligned sequences in
 483 Stockholm format). It de-aligns the sequences and then re-aligns
 484 them. No HMM is produced in the process, no pseudo-count information
 485 is transferred. Consequently, the output must be the same as for
 486 unaligned output (like in the first example ./clustalo -i globin.fa)
 487
 488
 489 ./clustalo -i globin.fa --iter=2
 490
 491 Clustal-Omega reads the file globin.fa, creates a UPGMA guide tree
 492 built from k-tuple distances, and performs an initial alignment. This
 493 initial alignment is converted into a HMM and a new guide tree is
 494 built from the Kimura distances of the initial alignment. The
 495 un-aligned sequences are then aligned (for the second time but this
 496 time) using pseudo-count information from the HMM created after the
 497 initial alignment (and using the new guide tree). This second
 498 alignment is then again converted into a HMM and a new guide tree is
 499 constructed. The un-aligned sequences are then aligned (for a third
 500 time), again using pseudo-count information of the HMM from the
 501 previous step and the most recent guide tree. The final alignment is
 502 written to screen.
 503
 504
 505 ./clustalo -i globin.fa --iter=5 --max-guidetree-iterations=1
 506
 507 Clustal-Omega reads the file globin.fa, creates a UPGMA guide tree
 508 built from k-tuple distances, and performs an initial alignment. This
 509 initial alignment is converted into a HMM and a new guide tree is
 510 built from the Kimura distances of the initial alignment. The
 511 un-aligned sequences are then aligned (for the second time but this
 512 time) using pseudo-count information from the HMM created after the
 513 initial alignment (and using the new guide tree). For the last 4
 514 iterations the guide tree is left unchanged and only HMM iteration is
 515 performed. This means that intermediate alignments are converted to
 516 HMMs, and these intermediate HMMs are used to guide the MSA during
 517 subsequent iteration stages.
 518
 519
 520 ./clustalo -i globin.fa -o globin.a2m -v
 521
 522 In case the file globin.a2m does not exist, Clustal-Omega reads the
 523 file globin.fa, prints a progress report to screen and writes the
 524 alignment in (default) Fasta format to globin.a2m. The progress report
 525 consists of the number of threads used, the number of sequences read,
 526 the current progress in the k-tuple distance calculation, completion
 527 of the guide tree computation and current progress of the MSA stage.
 528 If the file globin.a2m already exists Clustal-Omega aborts before
 529 reading the file globin.fa. Note that in verbose mode an output file
 530 has to be specified, because progress/debugging information, which is
 531 printed to screen, would interfere with the alignment being printed to
 532 screen.
 533
 534
 535 ./clustalo -i PF00042_full.fa --dealign --full --outfmt=vie -o PF00042_full.vie --force
 536
 537 Clustal-Omega reads the file PF00042_full.fa. This file contains
 538 several thousand aligned sequences. --dealign tells Clustal-Omega to
 539 erase all alignment information and re-align the sequences from
 540 scratch. As there are several thousand sequences calculating a full
 541 distance matrix may be slow. Setting the --full flag specifically
 542 selects the full distance mode over the default mBed mode. The
 543 alignment is then written out in Vienna format (fasta format all on
 544 one line, no line breaks per sequence) to file PF00042_full.vie.
 545
 546
 547 ./clustalo -i PF00042_full.fa --dealign --outfmt=vie -o PF00042_full.vie --force
 548
 549 Clustal-Omega reads the file PF00042_full.fa. This file contains
 550 several thousand aligned sequences. --dealign tells Clustal-Omega to
 551 erase all alignment information and re-align the sequences from
 552 scratch. Calculating the distance matrix will be done by mBed
 553 (default). Clustal-Omega will calculate pairwise distances to a
 554 small number of reference sequences only. This will give a significant
 555 speed-up. The speed-up is greater for larger families (more
 556 sequences). The alignment is then written out in Vienna format (fasta
 557 format all on one line, no line breaks per sequence) to file
 558 PF00042_full.vie.
 559
 560
 561 ./clustalo --p1=globin.sto --p2=PF00042_full.vie -o globin+pf00042.fa
 562
 563 Clustal-Omega reads files globin.sto and PF00042_full.vie of aligned
 564 sequences (profiles). Both profiles are then aligned. The relative
 565 positions of residues in both profiles are not changed during this
 566 alignment, however, columns of gaps may be inserted into the profiles,
 567 respectively. The final alignment is written to file globin+pf00042.fa
 568 in fasta format.
 569
 570
 571 ./clustalo -i globin.fa --p1=PF00042_full.vie -o pf00042+globin.fa
 572
 573 Clustal-Omega reads file globin.fa of un-aligned sequences and the
 574 profile (of aligned sequences) in file PF00042_full.vie. A HMM is
 575 created from the profile. This HMM is used to guide the alignment of
 576 the un-aligned sequences in globin.fa. The profile that was generated
 577 during this alignment of un-aligned globin.fa sequences is then
 578 aligned to the input profile PF00042_full.vie. The relative positions
 579 of residues in profile PF00042_full.vie is not changed during this
 580 alignment, however, columns of gaps may be inserted into the
 581 profile. The final alignment is output to file pf00042+globin.fa in
 582 fasta format. The alignment in this example may be slightly different
 583 from the alignment in the previous example, because no HMM guidance
 584 was used generate the profile globin.sto. In this example HMM guidance
 585 was used to align the sequences in globin.fa; the hope being that this
 586 intermediate alignment will have profited from the bigger profile.
 587
 588
 589
 590 LITERATURE:
 591
 592 [1] Johannes Soding (2005) Protein homology detection by HMM-HMM
 593     comparison. Bioinformatics 21 (7): 951–960.
 594
 595 [2] Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG.  Sequence
 596     embedding for fast construction of guide trees for multiple
 597     sequence alignment.  Algorithms Mol Biol. 2010 May 14;5:21.
 598
 599 [3] http://www.genetics.wustl.edu/eddy/software/#squid
 600
 601 [4] Wilbur and Lipman, 1983; PMID 6572363
 602
 603 [5] Thompson JD, Higgins DG, Gibson TJ.  (1994). CLUSTAL W: improving
 604     the sensitivity of progressive multiple sequence alignment through
 605     sequence weighting, position-specific gap penalties and weight
 606     matrix choice. Nucleic Acids Res., 22, 4673-4680.
 607
 608 [6] Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA,
 609     McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD,
 610     Gibson TJ, Higgins DG.  (2007). Clustal W and Clustal X version
 611     2.0. Bioinformatics, 23, 2947-2948.
 612
 613 [7] Kimura M (1980). "A simple method for estimating evolutionary
 614     rates of base substitutions through comparative studies of
 615     nucleotide sequences". Journal of Molecular Evolution 16: 111–120.
 616
 617 [8] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
 618     accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797.
 619