forester/archive/RIO/others/hmmer/NOTES

   1 HMMER 2.2 release notes
   2 http://hmmer.wustl.edu/
   3 SRE, Fri May  4 13:00:33 2001
   4 ---------------------------------------------------------------
   5
   6 As it has been more than 2 years since the last HMMER release, this is
   7 unlikely to be a comprehensive list of changes.
   8
   9 HMMER is now maintained under CVS. Anonymous read-only access to the
  10 development code is permitted. To download the current snapshot:
  11     > setenv CVSROOT :pserver:anonymous@skynet.wustl.edu:/repository/sre
  12     > cvs login
  13       [password is "anonymous"]
  14     > cvs checkout hmmer
  15     > cd hmmer
  16     > cvs checkout squid
  17     > cvs logout
  18
  19 The following programs were added to the distribution:
  20
  21    - The program "afetch" can fetch an alignment from
  22      a Stockholm format multiple alignment database (e.g. Pfam).
  23      "afetch --index" creates the index files for such
  24      a database.
  25
  26    - The program "shuffle" makes "randomized" sequences.
  27      It supports a variety of sequence randomization methods,
  28      including an implementation of Altschul/Erickson's
  29      shuffling-while-preserving-digram-composition algorithm.
  30
  31    - The program "sindex" creates SSI indices from sequence
  32      files, that "sfetch" can use to rapidly retrieve sequences
  33      from databases. Previously, index files were constructed
  34      with Perl scripts that were not supported as part of the
  35      HMMER distribution.
  36
  37 The following features were added:
  38
  39    - hmmsearch and hmmpfam can now use Pfam GA, TC, NC cutoffs,
  40      if these have been picked up in the HMM file (by hmmbuild).
  41      See the --cut_ga, --cut_tc, and --cut_nc options.
  42
  43    - "Stockholm format" alignments are supported, and have replaced
  44      SELEX format as the default alignment format. Stockholm format
  45      is the alignment format agreed upon by the Pfam Consortium,
  46      providing extensible markup and annotation capabilities. HMMER
  47      writes Stockholm format alignments by default. The program
  48      sreformat can reformat alignments to other formats, including
  49      Clustal and GCG MSF formats.
  50
  51    - To improve robustness, particularly in high-throughput annotation
  52      pipelines, all programs now accept an option --informat <s>,
  53      where <s> is the name of a sequence file format (FASTA, for
  54      example). The format autodetection code that is used by default
  55      is almost always right, and is very helpful in interactive use
  56      (HMMER reads almost anything without you worrying much about
  57      format issues). --informat bypasses the autodetector, asserts
  58      a particular format, and decreases the likelihood that HMMER
  59      misparses a sequence file.
  60
  61    - new options:
  62      hmmpfam --acc reports HMM accession numbers instead of
  63      HMM names in output files. [Pfam infrastructure]
  64
  65      sreformat --nogap, when reformatting an alignment,
  66      removes all columns containing any gap symbols; useful
  67      as a prefilter for phylogenetic analysis.
  68
  69    - The real software version of HMMER is logged into
  70      the HMMER2.0 line of ASCII save files, for better
  71      version control (e.g. bug tracking, but there are
  72      no bugs in HMMER).
  73
  74    - GCG MSF format reading/writing is now much more robust,
  75      thanks to assistance from Steve Smith at GCG.
  76
  77    - The PVM implementation of hmmcalibrate is now
  78      parallelized in a finer grained fashion; single models
  79      can be accelerated. (The previous version parallelized
  80      by assigning models to processors, so could not
  81      accelerate a single model calibration.)
  82
  83    - hmmemit can now take HMM libraries as input, not just
  84      a single HMM at a time - useful for instance for producing
  85      "consensus sequences" for every model in Pfam with one
  86      command.
  87
  88 The following changes may affect HMMER-compatible software:
  89
  90    - The name of the sequence retrieval program "getseq" was
  91      changed to "sfetch" in this release. The name "getseq"
  92      clashes with a Genetics Computer Group package program
  93      of similar functionality.
  94
  95    - The output format for the headers of hmmsearch and hmmpfam
  96      were changed. The accessions and descriptions of query
  97      HMMs or sequences, respectively, are reported on separate
  98      lines. An option ("--compat") is provided for reverting
  99      to the previous format, if you don't want to rewrite your
 100      parser(s) right away.
 101
 102    - hmmpfam now calculates E-values based on the actual
 103      number of HMMs in the database that is searched, unless
 104      overridden with the -Z option from the command line.
 105      It used to use Z=59021 semi-arbitrarily to make results
 106      jibe with a typical hmmsearch, but this just confused
 107      people more than it helped. hmmpfam E-values will therefore
 108      become more significant in this release by about 37x,
 109      for a typical Pfam search (59021/1600 = 37).
 110
 111 The following major bugs were fixed:
 112     [none]
 113
 114 The following minor bugs were fixed:
 115    - more argument casting to silence compiler warnings
 116      [M. Regelson, Paracel ]
 117
 118    - a potential reentrancy problem with setting the
 119      alphabet type in the threads version was
 120      fixed, but this problem is unlikely to have ever affected
 121      anyone. [M. Sievers, Paracel].
 122
 123    - fixed a bug where hmmbuild on Solaris machines would crash
 124      when presented with an alignment with an #=ID line.
 125      Same bug caused a crash when building a model from a single
 126      sequence FASTA file [A. Bateman, Sanger]
 127
 128    - The configure script was modified to deal better with
 129      different vendor's implementations of pthreads, in response
 130      to a DEC Digital UNIX compilation problem [W. Pearson,
 131      U. Virginia]
 132
 133    - Automatic sequence file format detection was slightly
 134      improved, fixing a bug in detecting GCG-reformatted
 135      Swissprot files [reported by J. Holzwarth]
 136
 137    - hmmpfam-pvm and hmmindex had a bad interaction if an HMM file had
 138      accession numbers as well as names (e.g., Pfam). The phenotype was
 139      that hmmpfam-pvm would search each model twice: once for its name,
 140      and once for its accession. hmmindex now uses a new
 141      indexing scheme (SSI, replacing GSI). [multiple reports;
 142      often manifested as a failure of the StL Pfam server to
 143      install, because of an hmmindex --one2one option in the Makefile; this was
 144      a local hack, never distributed in HMMER].
 145
 146    - a rare floating exception bug in ExtremeValueP() was fixed;
 147      range-checking protections in the function were in error, and
 148      a range error in a log() calculation appeared on
 149      Digital Unix platforms for a *very* tiny set of scores
 150      for any given mu, lambda.
 151
 152    - The default null2 score correction was applied in
 153      a way that was justifiable, but differed between per-seq
 154      and per-domain scores; thus per-domain scores did not
 155      necessarily add up to per-seq scores. In certain cases
 156      this produced counterintuitive results. null2 is now
 157      applied in a way that is still justifiable, and also
 158      consistent; per-domain scores add up to the per-seq score.
 159      [first reported by David Kerk]
 160
 161    - --domE and --domT did not work correctly in hmmpfam, because
 162      the code assumed that E-values are monotonic with score.
 163      In some cases, this could cause HMMER to fail to report some
 164      significant domains. [Christiane VanSchlun, GCG]
 165
 166 The following obscure bugs were fixed (i.e., there were no reports of
 167 anyone but me detecting these bugs):
 168
 169   - sreformat no longer core dumps when reformatting a
 170     single sequence to an alignment format.
 171
 172   - Banner() was printing a line to stdout instead of its
 173     file handle... but Banner is always called w/ stdout as
 174     its filehandle in the current implementation.
 175     [M. Regelson, Paracel]
 176
 177   - .gz file reading is only supported on POSIX OS's. A compile
 178     time define, SRE_STRICT_ANSI, may be defined to allow compiling
 179     on ANSI compliant but non-POSIX operating systems.
 180
 181   - Several problems with robustness w.r.t. unexpected
 182     combinations of command line options were detected by
 183     GCG quality control testing. [Christiane VanSchlun]
 184
 185 (At least) the following projects remain incomplete:
 186
 187   - Ian Holmes' posterior probability routines (POSTAL) are
 188     partially assimilated; see postprob.c, display.c
 189
 190   - CPU times can now be reported for serial, threaded,
 191     and PVM executions; this is only supported by hmmcalibrate
 192     right now.
 193
 194   - Mixture Dirichlet priors now include some ongoing work
 195     in collaboration with Michael Asman and Erik Sonnhammer
 196     in Stockholm; also #=GC X-PRM, X-PRT, X-PRI support in
 197     hmmbuild/Stockholm annotation.