1 .TH "hmmbuild" 1 @RELEASEDATE@ "HMMER @RELEASE@" "HMMER Manual"
5 hmmbuild - build a profile HMM from an alignment
16 reads a multiple sequence alignment file
18 , builds a new profile HMM, and saves the HMM in
23 may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned FASTA
24 alignment format. The format is automatically detected.
27 By default, the model is configured to find one or more
28 nonoverlapping alignments to the complete model: multiple
29 global alignments with respect to the model, and local with
30 respect to the sequence.
32 is analogous to the behavior of the
35 To configure the model for multiple
38 with respect to the model and local with respect to
44 (fragment) option. More rarely, you may want to
45 configure the model for a single
46 global alignment (global with respect to both
47 model and sequence), using the
50 or to configure the model for a single local/local alignment
51 (a la standard Smith/Waterman, or the old
61 Configure the model for finding multiple domains per sequence,
62 where each domain can be a local (fragmentary) alignment. This
63 is analogous to the old
69 Configure the model for finding a single global alignment to
70 a target sequence, analogous to
77 Print brief help; includes version number and summary of
78 all options, including expert options.
85 can be any string of non-whitespace characters (e.g. one "word").
86 There is no length limit (at least not one imposed by HMMER;
87 your shell will complain about command line lengths first).
91 Re-save the starting alignment to
94 The columns which were assigned to match states will be
95 marked with x's in an #=RF annotation line.
100 construction options were chosen, the alignment may have
101 been slightly altered to be compatible with Plan 7 transitions,
102 so saving the final alignment and comparing to the
103 starting alignment can let you view these alterations.
104 See the User's Guide for more information on this arcane
109 Configure the model for finding a single local alignment per
110 target sequence. This is analogous to the standard Smith/Waterman
117 Append this model to an existing
121 Useful for building HMM libraries (like Pfam).
125 Force overwriting of an existing
127 Otherwise HMMER will refuse to clobber your existing HMM files,
134 Force the sequence alignment to be interpreted as amino acid
135 sequences. Normally HMMER autodetects whether the alignment is
136 protein or DNA, but sometimes alignments are so small that
137 autodetection is ambiguous. See
142 Set the "architecture prior" used by MAP architecture construction to
146 is a probability between 0 and 1. This parameter governs a geometric
147 prior distribution over model lengths. As
149 increases, longer models are favored a priori.
152 decreases, it takes more residue conservation in a column to
153 make a column a "consensus" match column in the model architecture.
154 The 0.85 default has been chosen empirically as a reasonable setting.
160 in HMMER binary format instead of readable ASCII text.
164 Save the observed emission and transition counts to
166 after the architecture has been determined (e.g. after residues/gaps
167 have been assigned to match, delete, and insert states).
168 This option is used in HMMER development for generating data files
169 useful for training new Dirichlet priors. The format of
170 count files is documented in the User's Guide.
174 Quickly and heuristically determine the architecture of the model by
175 assigning all columns will more than a certain fraction of gap
176 characters to insert states. By default this fraction is 0.5, and it
177 can be changed using the
180 The default construction algorithm is a maximum a posteriori (MAP)
181 algorithm, which is slower.
187 model construction algorithm, but if
189 is not being used, has no effect.
190 If a column has more than a fraction
192 of gap symbols in it, it gets assigned to an insert column.
194 is a frequency from 0 to 1, and by default is set
195 to 0.5. Higher values of
197 mean more columns get assigned to consensus, and models get
198 longer; smaller values of
200 mean fewer columns get assigned to consensus, and models get
206 Specify the architecture of the model by hand: the alignment file must
207 be in SELEX or Stockholm format, and the reference annotation
208 line (#=RF in SELEX, #=GC RF in Stockholm) is used to specify
209 the architecture. Any column marked with a non-gap symbol (such
210 as an 'x', for instance) is assigned as a consensus (match) column in
215 Controls both the determination of effective sequence number and
218 weighting option. The sequence alignment is clustered by percent
219 identity, and the number of clusters at a cutoff threshold of
221 is used to determine the effective sequence number.
224 give more clusters and higher effective sequence
225 numbers; lower values of
227 give fewer clusters and lower effective sequence numbers.
229 is a fraction from 0 to 1, and
230 by default is set to 0.62 (corresponding to the clustering level used
231 in constructing the BLOSUM62 substitution matrix).
234 .BI --informat " <s>"
235 Assert that the input
239 do not run Babelfish format autodection. This increases
240 the reliability of the program somewhat, because
241 the Babelfish can make mistakes; particularly
242 recommended for unattended, high-throughput runs
243 of HMMER. Valid format strings include FASTA,
244 GENBANK, EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,
245 CLUSTAL, and PHYLIP. See the User's Guide for a complete
250 Turn off the effective sequence number calculation, and use the
251 true number of sequences instead. This will usually reduce the
252 sensitivity of the final model (so don't do it without good reason!)
256 Force the alignment to be interpreted as nucleic acid sequence,
257 either RNA or DNA. Normally HMMER autodetects whether the alignment is
258 protein or DNA, but sometimes alignments are so small that
259 autodetection is ambiguous. See
264 Read a null model from
266 The default for protein is to use average amino acid frequencies from
267 Swissprot 34 and p1 = 350/351; for nucleic acid, the default is
268 to use 0.25 for each base and p1 = 1000/1001. For documentation
269 of the format of the null model file and further explanation
270 of how the null model is used, see the User's Guide.
274 Apply a heuristic PAM- (substitution matrix-) based prior on match
275 emission probabilities instead of
276 the default mixture Dirichlet. The substitution matrix is read
282 The default Dirichlet state transition prior and insert emission prior
283 are unaffected. Therefore in principle you could combine
287 but this isn't recommended, as it hasn't been tested. (
289 itself hasn't been tested much!)
293 Controls the weight on a PAM-based prior. Only has effect if
295 option is also in use.
297 is a positive real number, 20.0 by default.
299 is the number of "pseudocounts" contriubuted by the heuristic
300 prior. Very high values of
302 can force a scoring system that is entirely driven by the
303 substitution matrix, making
304 HMMER somewhat approximate Gribskov profiles.
307 .BI --pbswitch " <n>"
308 For alignments with a very large number of sequences,
309 the GSC, BLOSUM, and Voronoi weighting schemes are slow;
310 they're O(N^2) for N sequences. Henikoff position-based
311 weights (PB weights) are more efficient. At or above a certain
312 threshold sequence number
315 will switch from GSC, BLOSUM, or Voronoi weights to
316 PB weights. To disable this switching behavior (at the cost
319 to be something larger than the number of sequences in
322 is a positive integer; the default is 1000.
326 Read a Dirichlet prior from
328 replacing the default mixture Dirichlet.
329 The format of prior files is documented in the User's Guide,
330 and an example is given in the Demos directory of the HMMER
335 Controls the total probability that is distributed to local entries
336 into the model, versus starting at the beginning of the model
337 as in a global alignment.
339 is a probability from 0 to 1, and by default is set to 0.5.
342 mean that hits that are fragments on their left (N or 5'-terminal) side will be
343 penalized less, but complete global alignments will be penalized more.
346 mean that fragments on the left will be penalized more, and
347 global alignments on this side will be favored.
348 This option only affects the configurations that allow local
354 unless one of these options is also activated, this option has no effect.
355 You have independent control over local/global alignment behavior for
356 the N/C (5'/3') termini of your target sequences using
363 Controls the total probability that is distributed to local exits
364 from the model, versus ending an alignment at the end of the model
365 as in a global alignment.
367 is a probability from 0 to 1, and by default is set to 0.5.
370 mean that hits that are fragments on their right (C or 3'-terminal) side will be
371 penalized less, but complete global alignments will be penalized more.
374 mean that fragments on the right will be penalized more, and
375 global alignments on this side will be favored.
376 This option only affects the configurations that allow local
382 unless one of these options is also activated, this option has no effect.
383 You have independent control over local/global alignment behavior for
384 the N/C (5'/3') termini of your target sequences using
391 Print more possibly useful stuff, such as the individual scores for
392 each sequence in the alignment.
396 Use the BLOSUM filtering algorithm to weight the sequences,
397 instead of the default.
398 Cluster the sequences at a given percentage identity
401 assign each cluster a total weight of 1.0, distributed equally
402 amongst the members of that cluster.
407 Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting
408 algorithm. This is already the default, so this option has no effect
409 (unless it follows another option in the --w family, in which case it
414 Use the Krogh/Mitchison maximum entropy algorithm to "weight"
415 the sequences. This supercedes the Eddy/Mitchison/Durbin
416 maximum discrimination algorithm, which gives almost
417 identical weights but is less robust. ME weighting seems
418 to give a marginal increase in sensitivity
419 over the default GSC weights, but takes a fair amount of time.
423 Turn off all sequence weighting.
427 Use the Henikoff position-based weighting scheme.
431 Use the Sibbald/Argos Voronoi sequence weighting algorithm
432 in place of the default GSC weighting.
437 Master man page, with full list of and guide to the individual man
441 A User guide and tutorial came with the distribution:
447 Finally, all documentation is also available online via WWW:
448 .B http://hmmer.wustl.edu/
452 This software and documentation is:
455 HMMER - Biological sequence analysis with profile HMMs
456 Copyright (C) 1992-1999 Washington University School of Medicine
459 This source code is distributed under the terms of the
460 GNU General Public License. See the files COPYING and LICENSE
463 See the file COPYING in your distribution for complete details.
467 HHMI/Dept. of Genetics
468 Washington Univ. School of Medicine
470 St Louis, MO 63110 USA
471 Phone: 1-314-362-7666
473 Email: eddy@genetics.wustl.edu