forester/archive/RIO/others/hmmer/documentation/man/hmmbuild.man

   1 .TH "hmmbuild" 1 @RELEASEDATE@ "HMMER @RELEASE@" "HMMER Manual"
   2
   3 .SH NAME
   4 .TP
   5 hmmbuild - build a profile HMM from an alignment
   6
   7 .SH SYNOPSIS
   8 .B hmmbuild
   9 .I [options]
  10 .I hmmfile
  11 .I alignfile
  12
  13 .SH DESCRIPTION
  14
  15 .B hmmbuild
  16 reads a multiple sequence alignment file
  17 .I alignfile
  18 , builds a new profile HMM, and saves the HMM in
  19 .I hmmfile.
  20
  21 .PP
  22 .I alignfile
  23 may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned FASTA
  24 alignment format. The format is automatically detected.
  25
  26 .PP
  27 By default, the model is configured to find one or more
  28 nonoverlapping alignments to the complete model: multiple
  29 global alignments with respect to the model, and local with
  30 respect to the sequence.
  31 This
  32 is analogous to the behavior of the
  33 .B hmmls
  34 program of HMMER 1.
  35 To configure the model for multiple
  36 .I local
  37 alignments
  38 with respect to the model and local with respect to
  39 the sequence,
  40 a la the old program
  41 .B hmmfs,
  42 use the
  43 .B -f
  44 (fragment) option. More rarely, you may want to
  45 configure the model for a single
  46 global alignment (global with respect to both
  47 model and sequence), using the
  48 .B -g
  49 option;
  50 or to configure the model for a single local/local alignment
  51 (a la standard Smith/Waterman, or the old
  52 .B hmmsw
  53 program), use the
  54 .B -s
  55 option.
  56
  57 .SH OPTIONS
  58
  59 .TP
  60 .B -f
  61 Configure the model for finding multiple domains per sequence,
  62 where each domain can be a local (fragmentary) alignment. This
  63 is analogous to the old
  64 .B hmmfs
  65 program of HMMER 1.
  66
  67 .TP
  68 .B -g
  69 Configure the model for finding a single global alignment to
  70 a target sequence, analogous to
  71 the old
  72 .B hmms
  73 program of HMMER 1.
  74
  75 .TP
  76 .B -h
  77 Print brief help; includes version number and summary of
  78 all options, including expert options.
  79
  80 .TP
  81 .BI -n " <s>"
  82 Name this HMM
  83 .I <s>.
  84 .I <s>
  85 can be any string of non-whitespace characters (e.g. one "word").
  86 There is no length limit (at least not one imposed by HMMER;
  87 your shell will complain about command line lengths first).
  88
  89 .TP
  90 .BI -o " <f>"
  91 Re-save the starting alignment to
  92 .I <f>,
  93 in Stockholm format.
  94 The columns which were assigned to match states will be
  95 marked with x's in an #=RF annotation line.
  96 If either the
  97 .B --hand
  98 or
  99 .B --fast
 100 construction options were chosen, the alignment may have
 101 been slightly altered to be compatible with Plan 7 transitions,
 102 so saving the final alignment and comparing to the
 103 starting alignment can let you view these alterations.
 104 See the User's Guide for more information on this arcane
 105 side effect.
 106
 107 .TP
 108 .B -s
 109 Configure the model for finding a single local alignment per
 110 target sequence. This is analogous to the standard Smith/Waterman
 111 algorithm or the
 112 .B hmmsw
 113 program of HMMER 1.
 114
 115 .TP
 116 .B -A
 117 Append this model to an existing
 118 .I hmmfile
 119 rather than creating
 120 .I hmmfile.
 121 Useful for building HMM libraries (like Pfam).
 122
 123 .TP
 124 .B -F
 125 Force overwriting of an existing
 126 .I hmmfile.
 127 Otherwise HMMER will refuse to clobber your existing HMM files,
 128 for safety's sake.
 129
 130 .SH EXPERT OPTIONS
 131
 132 .TP
 133 .B --amino
 134 Force the sequence alignment to be interpreted as amino acid
 135 sequences. Normally HMMER autodetects whether the alignment is
 136 protein or DNA, but sometimes alignments are so small that
 137 autodetection is ambiguous. See
 138 .B --nucleic.
 139
 140 .TP
 141 .BI --archpri " <x>"
 142 Set the "architecture prior" used by MAP architecture construction to
 143 .I <x>,
 144 where
 145 .I <x>
 146 is a probability between 0 and 1. This parameter governs a geometric
 147 prior distribution over model lengths. As
 148 .I <x>
 149 increases, longer models are favored a priori.
 150 As
 151 .I <x>
 152 decreases, it takes more residue conservation in a column to
 153 make a column a "consensus" match column in the model architecture.
 154 The 0.85 default has been chosen empirically as a reasonable setting.
 155
 156 .TP
 157 .B --binary
 158 Write the HMM to
 159 .I hmmfile
 160 in HMMER binary format instead of readable ASCII text.
 161
 162 .TP
 163 .BI --cfile " <f>"
 164 Save the observed emission and transition counts to
 165 .I <f>
 166 after the architecture has been determined (e.g. after residues/gaps
 167 have been assigned to match, delete, and insert states).
 168 This option is used in HMMER development for generating data files
 169 useful for training new Dirichlet priors. The format of
 170 count files is documented in the User's Guide.
 171
 172 .TP
 173 .B --fast
 174 Quickly and heuristically determine the architecture of the model by
 175 assigning all columns will more than a certain fraction of gap
 176 characters to insert states. By default this fraction is 0.5, and it
 177 can be changed using the
 178 .B --gapmax
 179 option.
 180 The default construction algorithm is a maximum a posteriori (MAP)
 181 algorithm, which is slower.
 182
 183 .TP
 184 .BI --gapmax " <x>"
 185 Controls the
 186 .I --fast
 187 model construction algorithm, but if
 188 .I --fast
 189 is not being used, has no effect.
 190 If a column has more than a fraction
 191 .I <x>
 192 of gap symbols in it, it gets assigned to an insert column.
 193 .I <x>
 194 is a frequency from 0 to 1, and by default is set
 195 to 0.5. Higher values of
 196 .I <x>
 197 mean more columns get assigned to consensus, and models get
 198 longer; smaller values of
 199 .I <x>
 200 mean fewer columns get assigned to consensus, and models get
 201 smaller.
 202 .I <x>
 203
 204 .TP
 205 .B --hand
 206 Specify the architecture of the model by hand: the alignment file must
 207 be in SELEX or Stockholm format, and the reference annotation
 208 line (#=RF in SELEX, #=GC RF in Stockholm) is used to specify
 209 the architecture. Any column marked with a non-gap symbol (such
 210 as an 'x', for instance) is assigned as a consensus (match) column in
 211 the model.
 212
 213 .TP
 214 .BI --idlevel " <x>"
 215 Controls both the determination of effective sequence number and
 216 the behavior of the
 217 .I --wblosum
 218 weighting option. The sequence alignment is clustered by percent
 219 identity, and the number of clusters at a cutoff threshold of
 220 .I <x>
 221 is used to determine the effective sequence number.
 222 Higher values of
 223 .I <x>
 224 give more clusters and higher effective sequence
 225 numbers; lower values of
 226 .I <x>
 227 give fewer clusters and lower effective sequence numbers.
 228 .I <x>
 229 is a fraction from 0 to 1, and
 230 by default is set to 0.62 (corresponding to the clustering level used
 231 in constructing the BLOSUM62 substitution matrix).
 232
 233 .TP
 234 .BI --informat " <s>"
 235 Assert that the input
 236 .I seqfile
 237 is in format
 238 .I <s>;
 239 do not run Babelfish format autodection. This increases
 240 the reliability of the program somewhat, because
 241 the Babelfish can make mistakes; particularly
 242 recommended for unattended, high-throughput runs
 243 of HMMER. Valid format strings include FASTA,
 244 GENBANK, EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,
 245 CLUSTAL, and PHYLIP. See the User's Guide for a complete
 246 list.
 247
 248 .TP
 249 .B --noeff
 250 Turn off the effective sequence number calculation, and use the
 251 true number of sequences instead. This will usually reduce the
 252 sensitivity of the final model (so don't do it without good reason!)
 253
 254 .TP
 255 .B --nucleic
 256 Force the alignment to be interpreted as nucleic acid sequence,
 257 either RNA or DNA. Normally HMMER autodetects whether the alignment is
 258 protein or DNA, but sometimes alignments are so small that
 259 autodetection is ambiguous. See
 260 .B --amino.
 261
 262 .TP
 263 .BI --null " <f>"
 264 Read a null model from
 265 .I <f>.
 266 The default for protein is to use average amino acid frequencies from
 267 Swissprot 34 and p1 = 350/351; for nucleic acid, the default is
 268 to use 0.25 for each base and p1 = 1000/1001. For documentation
 269 of the format of the null model file and further explanation
 270 of how the null model is used, see the User's Guide.
 271
 272 .TP
 273 .BI --pam " <f>"
 274 Apply a heuristic PAM- (substitution matrix-) based prior on match
 275 emission probabilities instead of
 276 the default mixture Dirichlet. The substitution matrix is read
 277 from
 278 .I <f>.
 279 See
 280 .B --pamwgt.
 281
 282 The default Dirichlet state transition prior and insert emission prior
 283 are unaffected. Therefore in principle you could combine
 284 .B --prior
 285 with
 286 .B --pam
 287 but this isn't recommended, as it hasn't been tested. (
 288 .B --pam
 289 itself hasn't been tested much!)
 290
 291 .TP
 292 .BI --pamwgt " <x>"
 293 Controls the weight on a PAM-based prior. Only has effect if
 294 .B --pam
 295 option is also in use.
 296 .I <x>
 297 is a positive real number, 20.0 by default.
 298 .I <x>
 299 is the number of "pseudocounts" contriubuted by the heuristic
 300 prior. Very high values of
 301 .I <x>
 302 can force a scoring system that is entirely driven by the
 303 substitution matrix, making
 304 HMMER somewhat approximate Gribskov profiles.
 305
 306 .TP
 307 .BI --pbswitch " <n>"
 308 For alignments with a very large number of sequences,
 309 the GSC, BLOSUM, and Voronoi weighting schemes are slow;
 310 they're O(N^2) for N sequences. Henikoff position-based
 311 weights (PB weights) are more efficient. At or above a certain
 312 threshold sequence number
 313 .I <n>
 314 .B hmmbuild
 315 will switch from GSC, BLOSUM, or Voronoi weights to
 316 PB weights. To disable this switching behavior (at the cost
 317 of compute time, set
 318 .I <n>
 319 to be something larger than the number of sequences in
 320 your alignment.
 321 .I <n>
 322 is a positive integer; the default is 1000.
 323
 324 .TP
 325 .BI --prior " <f>"
 326 Read a Dirichlet prior from
 327 .I <f>,
 328 replacing the default mixture Dirichlet.
 329 The format of prior files is documented in the User's Guide,
 330 and an example is given in the Demos directory of the HMMER
 331 distribution.
 332
 333 .TP
 334 .BI --swentry " <x>"
 335 Controls the total probability that is distributed to local entries
 336 into the model, versus starting at the beginning of the model
 337 as in a global alignment.
 338 .I <x>
 339 is a probability from 0 to 1, and by default is set to 0.5.
 340 Higher values of
 341 .I <x>
 342 mean that hits that are fragments on their left (N or 5'-terminal) side will be
 343 penalized less, but complete global alignments will be penalized more.
 344 Lower values of
 345 .I <x>
 346 mean that fragments on the left will be penalized more, and
 347 global alignments on this side will be favored.
 348 This option only affects the configurations that allow local
 349 alignments,
 350 e.g.
 351 .B -s
 352 and
 353 .B -f;
 354 unless one of these options is also activated, this option has no effect.
 355 You have independent control over local/global alignment behavior for
 356 the N/C (5'/3') termini of your target sequences using
 357 .B --swentry
 358 and
 359 .B --swexit.
 360
 361 .TP
 362 .BI --swexit " <x>"
 363 Controls the total probability that is distributed to local exits
 364 from the model, versus ending an alignment at the end of the model
 365 as in a global alignment.
 366 .I <x>
 367 is a probability from 0 to 1, and by default is set to 0.5.
 368 Higher values of
 369 .I <x>
 370 mean that hits that are fragments on their right (C or 3'-terminal) side will be
 371 penalized less, but complete global alignments will be penalized more.
 372 Lower values of
 373 .I <x>
 374 mean that fragments on the right will be penalized more, and
 375 global alignments on this side will be favored.
 376 This option only affects the configurations that allow local
 377 alignments,
 378 e.g.
 379 .B -s
 380 and
 381 .B -f;
 382 unless one of these options is also activated, this option has no effect.
 383 You have independent control over local/global alignment behavior for
 384 the N/C (5'/3') termini of your target sequences using
 385 .B --swentry
 386 and
 387 .B --swexit.
 388
 389 .TP
 390 .B --verbose
 391 Print more possibly useful stuff, such as the individual scores for
 392 each sequence in the alignment.
 393
 394 .TP
 395 .B --wblosum
 396 Use the BLOSUM filtering algorithm to weight the sequences,
 397 instead of the default.
 398 Cluster the sequences at a given percentage identity
 399 (see
 400 .B --idlevel);
 401 assign each cluster a total weight of 1.0, distributed equally
 402 amongst the members of that cluster.
 403
 404
 405 .TP
 406 .B --wgsc
 407 Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting
 408 algorithm. This is already the default, so this option has no effect
 409 (unless it follows another option in the --w family, in which case it
 410 overrides it).
 411
 412 .TP
 413 .B --wme
 414 Use the Krogh/Mitchison maximum entropy algorithm to "weight"
 415 the sequences. This supercedes the Eddy/Mitchison/Durbin
 416 maximum discrimination algorithm, which gives almost
 417 identical weights but is less robust. ME weighting seems
 418 to give a marginal increase in sensitivity
 419 over the default GSC weights, but takes a fair amount of time.
 420
 421 .TP
 422 .B --wnone
 423 Turn off all sequence weighting.
 424
 425 .TP
 426 .B --wpb
 427 Use the Henikoff position-based weighting scheme.
 428
 429 .TP
 430 .B --wvoronoi
 431 Use the Sibbald/Argos Voronoi sequence weighting algorithm
 432 in place of the default GSC weighting.
 433
 434 .SH SEE ALSO
 435
 436 .PP
 437 Master man page, with full list of and guide to the individual man
 438 pages: see
 439 .B hmmer(1).
 440 .PP
 441 A User guide and tutorial came with the distribution:
 442 .B Userguide.ps
 443 [Postscript] and/or
 444 .B Userguide.pdf
 445 [PDF].
 446 .PP
 447 Finally, all documentation is also available online via WWW:
 448 .B http://hmmer.wustl.edu/
 449
 450 .SH AUTHOR
 451
 452 This software and documentation is:
 453 .nf
 454 @COPYRIGHT@
 455 HMMER - Biological sequence analysis with profile HMMs
 456 Copyright (C) 1992-1999 Washington University School of Medicine
 457 All Rights Reserved
 458
 459     This source code is distributed under the terms of the
 460     GNU General Public License. See the files COPYING and LICENSE
 461     for details.
 462 .fi
 463 See the file COPYING in your distribution for complete details.
 464
 465 .nf
 466 Sean Eddy
 467 HHMI/Dept. of Genetics
 468 Washington Univ. School of Medicine
 469 4566 Scott Ave.
 470 St Louis, MO 63110 USA
 471 Phone: 1-314-362-7666
 472 FAX  : 1-314-362-7855
 473 Email: eddy@genetics.wustl.edu
 474 .fi
 475
 476