binaries/src/fasta34/readme.v33t0

   1
   2  $Name: fa_34_26_5 $ - $Id: readme.v33t0,v 1.45 2001/07/10 18:03:42 wrp Exp $
   3
   4 ================ readme.v33t0 ================
   5
   6 This release includes an MPI implementation of the parallel
   7 library-vs-library comparison code.  See readme.mpi_3.3 and
   8 readme.pvm_3.3 for more information.
   9
  10 =====
  11 >>July 9, 2001
  12
  13 Considerable changes to support no-global library functions.
  14
  15 (1) Separate ascii/sequence mapping arrays are used by the
  16     query-reading (qascii), library-reading (lascii), and sequence
  17     comparison function (pascii) routines.  As a result, there is no
  18     longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both
  19     functions.
  20
  21 (2) This also allows us to remove all #ifdef TFAST/FASTX conditionals
  22     from complib.c/comp_thr.c/p2_complib.c.  We no longer need
  23     tcomp_thr.o, comp_thrx.o, etc.  We still have a variety of
  24     p2_complib.o variations to support the different c34.work* files.
  25
  26 (3) Because non-global openlib/getlib functions are available, exactly
  27     the same open/get functions are available for reading both the
  28     query and reference libraries in pv34comp* programs.  The
  29     host-specific openlib/getlib functions in hxgetaa.c are now
  30     provided by nmgetlib.c, etc. This has two effect:
  31
  32     (a) it is now possible to compare a query database generated by an
  33         SQL query to a library database generated by a different SQL
  34         query.
  35
  36     (b) pv34comp* has lost (at least in this version) the ability to
  37         automatically detect the query sequence type. To search with a
  38         DNA query, you MUST use "-n".
  39
  40 (4) the resetp() function is now responsible for almost all of the
  41     function sepcific (TFAST/FASTX/etc) initializations.  All of the
  42     function specific code has been removed from complib.c/comp_thr.c
  43     and most of it has been moved to initfa.c/resetp().
  44
  45 (5) manageacc.c has been merged into compacc.c (mostly prhist()).
  46
  47 (6) Although it may reflect a subtle bug in my code, it is not
  48     possible to reliably run threaded/memory mapped versions of the
  49     fasta34_t code.  I have spent considerable time tracking down the
  50     problem, and have determined that, in threaded code, something
  51     happens during the thread initialization to corrupt the
  52     description offset information used when files are memory mapped.
  53     This never occurs when the unthreaded versions of the code are
  54     used.  And it does not occur under MacOSX, Compaq Tru64Unix, Sun
  55     Solaris/Sparc, or SGI IRIX.
  56
  57     Thus, I cannot recommend using the threaded code versions (_t)
  58     under Linux (RH6.2 or 7.1).
  59
  60 =====
  61 >>June 1, 2001
  62
  63 Many changes to accomodate a new - no global variable - strategy for
  64 reading sequence databases.  Every time a file is opened, a struct
  65 lmf_str is allocated which can be used for memory mapped files, ncbl2,
  66 files, and mysql files.
  67
  68 In addition, an open'ed file has a default sequence type: DNA or
  69 protein, or one can open a file in a mode that will allow the sequence
  70 type to be changed.
  71
  72 =====
  73 >>May 18, 2001          CVS: fa33t09d0
  74
  75 A new compile time parameter - -DGAP_OPEN, is available to change the
  76 definition of the "-f gap-open" parameter from the penalty for the
  77 first residue in a gap to a true gap-open penalty, as is used in BLAST
  78 and many other comparison algorithms.  This will probably become the
  79 default for fasta in version 3.4.
  80
  81 Fixes to conflicts between "-S" and "-s matrix".  When a scoring
  82 matrix file was specified, lower-case alignments were not displayed
  83 with -S (although the scores were calculated properly).
  84
  85 More extensive testting of mysql_lib.c (mySQL query-libraries) with
  86 the pv4comp* and mp4comp* programs.
  87
  88 =====
  89 >>April 5, 2001         CVS: fa33t08d4b3
  90
  91 Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence
  92 descriptions for PCOMPLIB (pv4/mp3comp*).  Also fix p2_complib.c to
  93 request DNA library for translated comparisons.
  94
  95 Fix for prss33(_t) to read both sequences from stdin.
  96
  97 =====
  98 >>March 27, 2001        CVS: fa33t08d4
  99  --> fa33t08d4
 100
 101 Problems in ncbl2_mlib.c found searching NCBI non-redundant nucleotide
 102 database "nt" were fixed.  Testing revealed a minor memory leak, which
 103 was fixed by modifying showbest.c, showalign.c, comp_thr.c, complib.c,
 104 and p2_complib.c to remember the last opened database file more
 105 effectively.
 106
 107 Modifications to allow 64-bit fseek/ftell on machines like Sun,
 108 Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE
 109 off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO.  Machines
 110 with 64-bit long's do not need this option.  Machines with 32-bit
 111 longs that allow files >2 Gb can do so with 64-bit file access
 112 functions, including fseeko() and ftello(), which work with off_t file
 113 offsets instead of long's.
 114
 115 =====
 116 >>March 3, 2001         CVS: fa33t08d2
 117
 118 Corrected problems in nmgetaa.c and mysql_lib.c with parallel
 119 programs, and one serious problem with alternate DNA scoring matrices
 120 (initfa.c, initsw.c) not being set properly.  A subtle problem with
 121 the merge of scaleswn.c and scaleswg.c is fixed.
 122
 123 >>February 17, 2001
 124
 125 Modified mysql_lib.c to use "#", rather than "%ld", to indicate the
 126 position of the GID.  This change was made because sprintf() cannot be
 127 used reliably to generate an SQL string, as '"' and '%' are used in
 128 such strings.
 129
 130 =====
 131 >>January 17, 2001
 132 (no version change, date change)
 133
 134 Minro fixes to initfa.c, initsw.c to deal with DNA scoring matrices
 135 properly. "-n -s dna.mat" is required for the sequence/matrix to be
 136 recognized as DNA.
 137
 138 >>January 16, 2001
 139 -->v34t00
 140
 141 Merge of the main CVS trunk - fa33t06 with the latest release branch,
 142 fa33t08.
 143
 144 In addition, PCOMPLIB mods have been made to mysql_lib.c.  Because
 145 p2_complib.c gets sequence description information during the first
 146 read of the database, the mysql_query must be changed to return:
 147 result[0]=GID, result[1]=description, result[2]=sequence.  In the
 148 PCOMPLIB case, the other SQL queries (for GID description, sequence)
 149 are not necessary but must still be provided.
 150
 151 =====
 152 >>January 16, 2001
 153 (no version change, previous version not released)
 154
 155 changes to p2_complib.c to correct openlib() incompatibility.
 156
 157 changes to nmgetaa.c, ncbl2_lib.c to incorporate PCOMPLIB.  nxgetaa.c
 158 removed.
 159
 160 =====
 161 >>January 12, 2001
 162 (no version change, previous version not released)
 163
 164 Change to initfa.c to move ktup check from query_parm() to last_init().
 165
 166 =====
 167 >>January 10, 2001
 168 --> v33t08
 169
 170 Fixes to complib.c, comp_thr.c to deal properly with long query
 171 protein sequences when a short library chunk (e.g. -N 5000) was given.
 172 In the case where the chunk size is too short, it will be reset to a
 173 length which allows the search to proceed, by including an amount of
 174 new sequence that is equal to the amount of overlap sequence.
 175
 176 scaleswn.c and scaleswg.c have been merged.
 177
 178 v33t08 includes the initial implementation for mySQL described below
 179 for v33t07x.
 180
 181 ======
 182 >>Dec. 20, 2000
 183 --> v33t07x
 184
 185 Initial implementation of a syntax for mySQL database queries.  A new
 186 file, mysql_lib.c has been added, and changes have been made to
 187 nmgetaa.c (which should now replace nxgetaa.c) and altlib.h.  A mySQL
 188 database search needs a file with 4 parts:
 189
 190 (1) description of the database, user, password
 191 (2) a select statement that generates the set of protein sequences
 192     as: UID, sequence
 193 (3) a select statement that generates a UID, description given a UID
 194 (4) a select statement that generats a single UID, sequence given a UID
 195
 196 Each of the four parts should be separated by ';'.  For example, in
 197 the database that we are using for testing, a file "demo.sql" that
 198 contains:
 199
 200 ================
 201 localhost taxonomy username secret;
 202 SELECT proteins.gid, proteins.sequence FROM proteins,swissprot WHERE proteins.gid=swissprot.gid AND swissprot.spid IS NOT NULL;
 203 select proteins.gid, concat(swissprot.spid," ",proteins.description) from proteins,swissprot where proteins.gid=%ld AND swissprot.gid=proteins.gid;
 204 select gid, sequence from proteins where gid=%ld;
 205 ================
 206
 207 will find all the proteins in the BLAST "nr" database that also have
 208 SwissProt ID's when given the command line:
 209
 210         fasta33 -q query.aa "demo.sql 16"
 211
 212 At least for simple queries, there is surprisingly little overhead for the
 213 search.  For more complex queries involving several tables, the overhead
 214 can be significant.
 215
 216 At the moment, libraries that need the functions in mysql_lib.c will
 217 use library type 16.  We may also use file type 17 for SQL queries
 218 that return binary sequences.
 219
 220 This implementation of mysql_lib.c was written to require a minimal
 221 amount of change to the other programs.  Only nmgetaa.c and altlib.h
 222 needed to be changed to incorporate this new capability.  One result
 223 of this limitation is that one cannot mix mySQL databases queries with
 224 other databases in the same search.  Eventually, I would like to make
 225 a mySQL database like any other, so that several mysql database
 226 queries could be searched in the same run, and mysql databases could
 227 be mixed with other (flat file) databases, but this will require some
 228 changes in the function calls throughout the code.  (Right now, the
 229 various programs do not distinguish between an openlib() that is made
 230 before searching a large database, and one before retrieving a single
 231 sequence.  This must be changed for a database query like mySQL to
 232 behave like other databases.
 233
 234 Several mySQL demo files have been provided: mysql_demo*.sql.
 235
 236 (10 January 2001) The mySQL code has been tested on Intel Linux and
 237 Compaq/Alpha/Tru64 Unix.
 238
 239 >>Dec. 9, 2000
 240
 241 Changes to apam.c that to tie different default gap penalties to
 242 alternate scoring matrices.  In addition, changes to apam.c, to deal
 243 with user-specified matrices with or without '*'.
 244
 245 >>Nov. 5, 2000 (date updated)
 246
 247 pst.dnaseq can now have 3 values, -1, or 0-> protein, 1->DNA, and 2->other.
 248 This becomes important for thing like init_karlin_a, which needs a
 249 background frequency of residues.
 250
 251 >>Nov. 1, 2000
 252
 253 Significant bug fixes for the -z 6/-z 16 option.  An ininitialized
 254 variable was fixed in karlin.c, and comp_thr.c did not pass the
 255 correct composition argument type in find_zp().  The -z 6/16 option
 256 has now been tested and works correctly on Alphas, Linux x86, SGI, Sun
 257 and Mac OSX. Another problem was fixed in scaleswn.c (simplex()) that
 258 prevented the code from being reused by the pv4/mp4 complib programs.
 259
 260 >>Oct. 9, 2000
 261
 262 Several changes made to accomodate Mac OSX.  Longer lists of superfamily
 263 numbers now supported in p[su]4comp/m[su]4comp programs.
 264
 265 >>Sept 25, 2000
 266
 267 All global variables have been removed from scaleswn.c. The last to
 268 go, db_struct db, required many edits, because until now, the fasta
 269 programs have kept two versions of the db_struct data (entries,
 270 length). One version was kept by the main program, which updated entry
 271 number and db length as sequences were read; a second copy of this
 272 information was kept by the statistical estimation routines.  Now
 273 there is only one copy, which means that the E() values will be a
 274 function of the complete database, not the database with some high
 275 scoring sequences removed.
 276
 277 >>Sept 23, 2000
 278
 279 Continued removal of global variables from scaleswn.c.  Only one
 280 global is left, db_struct db, which contains the number of entries in
 281 the database and the number of residues.  It will be the next to go
 282 (changing all the zs_to_*() functions) and scaleswn. will be free
 283 of globals.  scaleswg.c is gone - scaleswn.c compiles to scaleswg.c
 284 with -DNORMAL_DIST.
 285
 286 >>Sept 20, 2000
 287
 288 Removal of histogram globals required changes in p2_complib.c as well.
 289 p_complib.c has not been updated.  scaleswg.c has been modified to
 290 reflect the new histogram strategy.
 291
 292 >>Sept 19, 2000
 293
 294 Substantial changes to remove globals for printing histogram.  m_msg
 295 now contains a hist_str, which keeps histogram information.
 296
 297 >>Sept. 19, 2000
 298 (no version change, previous version not released)
 299
 300 Correct bug introduced into scaleswn.c (inithist()) by changing
 301 score2_sums[], score_sums[] from int to double.
 302
 303 Reporting of version numbers is more consistent between fasta33,
 304 fasta33_t, and pv4compfa/mp4compfa.  The programs now report the same
 305 numbers/dates in similar places.
 306
 307 >>Sept. 15, 2000
 308 --> v33t07
 309
 310 Changes to fix problems with statistical estimates when a large
 311 fraction (but not all) of the database is related.  Several users
 312 reported problems when searching with rRNA genes with version 33t06.
 313 In some cases, a 100% identitical match over 1500 nt would not be
 314 statistically significant against a search of the bacterial division
 315 of Genbank.  This problem was not seen with some releases of v33t05.
 316
 317 The cause of the problem was a change between v33t05 and v33t06 to
 318 allow scoring matrices with unusual scaling to be used.  In v33t05,
 319 there was a line that excluded all scores > 300 from the statistical
 320 estimation procedure.  While 300 is a high score with any "normal"
 321 scoring matrix, some investigators were using matrices scaled 10X, so
 322 that a score of 300 was really a score of 30 with a conventional
 323 matrix, and should not be excluded.  Unfortunately, removing the test
 324 to exclude scores > 300 meant that when a rRNA sequence was used to
 325 search the bacterial division, tens of thousands of high scoring
 326 related sequences were treated as if they were unrelated, with the
 327 result that the variance estimates were much too high, and thus high
 328 real scores had low z-scores, and thus were not statistically
 329 significant.  (There appear to be more than 20,000 rRNA sequences in
 330 the bacterial division of Genbank, almost 25% of all sequences).
 331
 332 The solution to the problem is a substantial enhancement in the
 333 strategies used to exclude high-scoring, related sequences, the -z 1,
 334 4, and 5 parameter estimation strategies.  The programs now estimate
 335 the expected high scoring sequence by calculating an ungapped Lambda
 336 and K, and then use a relatively conservative threshold for excluding
 337 scores that are higher than would be expected 0.01 times by chance.
 338 By calculating Lambda and K, we can scale the cutoff thresholds to
 339 allow scoring matrices with unusual scales.  For "normal" searches,
 340 there should be little change, but there should be an improvement for
 341 searches with large numbers of related sequences in the database.
 342
 343 As a result of testing for this change, a bug in the karlin() function
 344 used with -z 6 was found and corrected.
 345
 346 =======
 347 >>Sept. 9, 2000
 348
 349 Changes to manshowbest.c to include correct display coordinates.
 350
 351 Significant changes to structs.h, param.h, p2_complib.c,
 352 p2_workcomp.c, to store and use a reliable a_struct for alignment
 353 coordinates.
 354
 355 Other cosmetic changes.
 356
 357 >>Sept. 7, 2000
 358
 359 Minor changes to complib.c, showrss.c, so that prss33 -q uses 200
 360 shuffles and prss33 provides bit scores, rather than z-scores.
 361 (no version number change).
 362
 363 Modifications to p2_complib.c to include superfamily numbers for
 364 ps4comp* ms4comp*.
 365
 366 >>Aug 22, 2000
 367
 368 Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX.
 369 00README.1st updated to reflect the current version and correct
 370 outdated information on threads.
 371
 372 >>Aug. 3, 2000
 373
 374 Modifications to initpam2() in initsw.c to correct a problem with pam_x
 375 when the -S option is used.
 376
 377 Modifications to compacc.c, scaleswn.c to ensure that residue numbers
 378 are calculated properly when more than 2 Gb of sequence is searched.
 379
 380 >>July 12, 2000
 381
 382 Modifications to dropnfa.c so that DNA matches to 'N' will be included
 383 in the "ungapped %identity".  Thus, a sequence that is 100% identical
 384 for 100 nt on either side of a 100 nt region that has been masked to
 385 'NNNNN' will be reported as: "67% identical (100% ungapped)".  This
 386 has been added to deal with masked BAC-end databases.  It would be
 387 better if masking changed the letters to lowercase, but the mouse
 388 BAC-end sequences at TIGR use 'NNNNN'.  This is currently available
 389 only for the fasta function, not [t]fast[x/y], etc, and only for DNA
 390 sequences.
 391
 392 mk_n_pam() in apam.c modified to ensure that mismatch scores of -1
 393 remain -1.
 394
 395 >>June 25, 2000
 396
 397 Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accession
 398 number as part of the descriptive string.
 399
 400 >>June 11, 2000
 401
 402 (no version change - not yet released)
 403
 404 Modifications to calcons(), calc_id(), showbest(), p_workcomp.c to
 405 provide ngap_q (number of alignment gaps in query) , ngap_l (number
 406 of gaps in library) information for -m 9 output.
 407
 408 >>June 6, 2000
 409
 410 (no version change - not yet released)
 411
 412 Modified scaleswn.c to provide better support for unconventional
 413 scoring scoring matrices, in particular, scoring matrices where every
 414 value is 50-times higher.  Previous versions of the MLE estimator (-z
 415 2) started with lambda = 0.2, which is too high for a scoring matrix
 416 going from -500:+1500. The initial estimate for lambda is now
 417 calculated using the formula: lambda = pi/sqrt(6*variance).  For the
 418 default -z 1, a restriction to limit scores to a maximum of 300 for
 419 the statistical analysis was removed.
 420
 421 >>June 3, 2000
 422
 423 Modified aligment output, and -m 9 and -m10, to report an "ungapped"
 424 identity as well as the traditional "gapped" identity.  The
 425 traditional "gapped" identity reports the number of identities divided
 426 by the overall length of the alignment, including gaps.  The
 427 "ungapped" identity does not include gaps in the length of the
 428 alignment.  This new value is included for alignments that include
 429 introns; thus, a tfastx33 search might find the 100% identical genomic
 430 sequence but report the gapped percent identity if a short intron were
 431 included in the alignment (the alignment probably would not span a
 432 long exon) as 66%.  The "ungapped" identity would remain 100%.  The
 433 ungapped identity value is also shown in the "-m 9" output line after
 434 the "gapped" fraction identical.
 435
 436 >>June 1, 2000
 437
 438 Modified -m 9 output to provide fraction identical, alignment boundary
 439 information with the initial list of high scoring sequences, just as
 440 the pv3comp and mp_comp versions do.  The -m 9 option now shows the
 441 same alignment display as -m 0, but the width of the alignment is
 442 increased by 40.  Thus, by default, -m 9 will show the list of best
 443 hits, with percent identity, Smith-Waterman score, and alignment
 444 boundaries initially, and then show alignments standard (-m 0)
 445 alignments with 100 residues/line.
 446
 447 >>May 29, 2000
 448
 449 Correct some problems with reading data files with <CR>'s under unix.
 450
 451 nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert <TAB>
 452 ('\t') to <SPC> (' ') in descriptive lines.
 453
 454 =======
 455
 456 >>May 3, 2000
 457
 458   Corrected problem with very low mean_var in fit_llen() in scaleswn.c.
 459
 460 >>May 2, 2000
 461   (no version number change - previous version not released)
 462
 463   Merged fasta33t05d2 with fasta33t06.  Also removed restriction on
 464 "-M size-range" to proteins - the size range now can be applied to DNA
 465 as well.
 466
 467 >>May 1, 2000
 468  (changes to v33t05d merged into v33t06)
 469
 470 Introduced changes to include '*' as a valid sequence character, which
 471 indicates termination.  Thus, 'TGA', 'TAG', and 'TAA' are now
 472 tranlated to '*' rather than 'X', and the protein PAM matrices have
 473 been modified to provide a match score of approximately 1/2 the max
 474 identity score for a '*:*' match.  Otherise, '*' is the same as 'X'.
 475 This change only affects query sequences that include a '*' to
 476 indicate an end of sequence, the '*' is not there by default.
 477
 478 The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33,
 479 and tfasty33, which were fixed today.
 480
 481 >>March 28, 2000/April 24, 2000
 482  --> v33t06
 483
 484 (a) -z 6 statistics that factor in composition
 485 (b) -smatrix-offset pam-offset parameter
 486
 487 (a) This release provides a new statistics option, -z 6, which
 488 provides a more sophisticated model that accounts for sequence
 489 composition.  When -z 6 is used (only for fasta33(_t) and
 490 ssearch33(_t)), the program calculates a composition parameter
 491 comp=1/lambda using a modified version of the Karlin-Altschul karlin()
 492 function.  As a result, every sequence in the database has an
 493 associated length (n1) and composition (comp).
 494
 495 The length n1 and composition comp are used in the maximum likelihood
 496 estimation described by Mott (1992) Bull. Math. Biol. 54:59-75.  Four
 497 parameters are estimated, a0, a1, a2, and b1, and the probability of
 498 obtaining a score is then:
 499
 500 p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp)))
 501
 502 The maximum likelihood estimates of a0, a1, a2, and b1 are calculated
 503 using the Nelder-Mead simplex search strategy.
 504
 505 The average Lambda is reported for the search using Lambda =
 506 1/(b1*ave_comp).  Where ave_comp is the geometric mean of the comp values
 507 calculated during the statistical estimates.
 508
 509 The "lambda/comp" calculation can fail for sequences with very biased
 510 amino acid composition.  When this occurs, 'comp' is set to -1.0 (as
 511 is 'H', the information content parameter) and the 'ave_comp' value is
 512 used to calculate statistical significance.  (But obviously 'ave_comp'
 513 is not really appropriate, since if the sequence had an average 'comp'
 514 value, it would have been calculated.)  When -z 6 is used, the
 515 alignment display shows the 'comp' and 'H' values for that library
 516 sequence.
 517
 518 (b) Scoring matrix offsets - The main reason that the "lamdba/comp"
 519 calculation fails is that, for the particular query/library sequence
 520 pair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0.
 521 This problem is reported to 'stderr' when it occurs.  The simplest
 522 solution to the problem is to provide an offset to the scoring matrix;
 523 for example, to use Blosum62 - 1, which ranges from +10 to -5, rather
 524 than the standard +11 to -4.  This option used to be available with
 525 the -S offset option, but -S is now used to specify a lower-case
 526 seg-ed database.  The offset can now be specified as part of the
 527 scoring matrix name.  Thus, "-s BL62-1" uses Blosum62 reduced by 1 at
 528 each entry.  The '-' character is used to indicate an offset, so
 529 scoring matrix files must not have a '-' in their name.
 530 Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value.
 531
 532 nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid string
 533 run-off problems after strncpy().
 534
 535 Fixed problem where positive gap extension penalties in ssearch33
 536 were not converted to negative values.
 537
 538 >>April 8, 2000
 539
 540 Fixed problem in calculating corrected sequence lengths for
 541 Altschul-Gish probabilities.
 542
 543 >>March 30, 2000
 544   (no version change, date updated to March 30, 2000)
 545
 546 Corrected problem with -m 9 option.
 547
 548 The '*' character is now available to allow translated alignments to
 549 extend through the termination codon. Thus, if a protein sequence ends
 550 with a '*', and matches in to a translated termination codon, the
 551 score will be increased.  The *:* match score is set to 1/2 the max
 552 positive score for the matrix (see upam.h).  This strategy can also be
 553 used to upweight a match that extends all the way to the end of a
 554 full-length sequence by putting '*' at the end of both the query and
 555 library protein sequences.  Recognition of '*' will probably become a
 556 command line option.
 557
 558 >>March 21, 2000
 559   (no version change, previous version not distributed)
 560
 561 Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate large
 562 sequence files.  Long (64-bit on some systems) variables are now used
 563 to specify file and memory position for the memory mapped functions.
 564 As a result, there are now two *.xin (memory mapped index) file
 565 formats: MP0, which uses 32-bit longs, and MP1, which uses 64-bit
 566 longs. On 64-bit machines, MP0 32-bit indices are read properly, but
 567 limit the database size to 2 or 4 Gb; MP1 64-bit indices allow very
 568 large databases.  Blast2.0 formatdb databases are still limited to
 569 4Gb.  To compile map_db.c to generate 64-bit index files, include the
 570 compile time option -DBIG_LIB64 in the Makefile.  (Currently this
 571 option has been tested only on the DEC Alpha and SGI platforms, and
 572 will work only with Unix versions that provide 64-bit longs and 64-bit
 573 ftell()'s.)
 574
 575 The -R results file now uses sfn_cmp() to report a matching
 576 superfamily number, if one exists, and '0' otherwise.
 577
 578 >>March 12, 2000
 579   (no version change, previous version not distributed)
 580
 581 Provide new strategy for specifying library abbreviations.  In
 582 addition to:
 583
 584         fasta33 query.aa %anr
 585
 586 one can also specify:
 587
 588         fasta33 query.aa %pir1+sp+nr
 589 or
 590         fasta33 query.aa +pir1+sp+nr
 591 or
 592         fasta33 query.aa %+pir1+sp+nr
 593
 594 where the + anywhere in the library name string indicates that
 595 variable length library names, separated by '+', are being used (the
 596 last '+' is optional).  The FASTLIBS file then becomes:
 597
 598 ================
 599 PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lseg
 600 NBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.nam
 601 NRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5
 602 NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lseg
 603 NCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg
 604 ================
 605
 606 The two abbreviation types, single letter and +word+, cannot be
 607 intermixed, and at least initially, +word+ specifiers are
 608 case-sensitive (single letter abbreviations are not) and will not be
 609 available interactively, only on the command line.
 610
 611 Removed 'K' estimate for Expectation_n, Expectation_i fits to the
 612 distribution of unrelated similarity scores.  'K' cannot be calculated
 613 from the data available.  'Lamdba' can be calculated, it is
 614 1.28255/sqrt(mean_var), and is still available.
 615
 616 >>March 3, 2000
 617  (no version change)
 618
 619 changed Makefile33.common, Makefile.common, to incorporate $(NRAND)
 620 rather than "rand48".  Provide nrandom.c which uses random(), as
 621 replacement for nrand.c, which uses rand48().
 622
 623 >>February 8, 2000
 624   --> v33t05
 625
 626 Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly.
 627 Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n),
 628 and -z 1/14 (Expectation_i) statistical estimates.
 629
 630 Modifications to calc_id() to correct bug in counting identities.
 631 Modified showalign() to use calc_id() with -m 9, for simpler
 632 debugging.
 633
 634 Additional modifications to dropfa*.c files to deal properly with 'n's
 635 and 'x's.
 636
 637 Added new option: -x #, which allows one to override the penalty for a
 638 match against 'x' (or 'N') provided by the scoring matrix.  This
 639 option is particularly useful in fast[x/y] searches, where out of
 640 frame low complexity regions can generate high scores.
 641
 642 The old function of '-x' - to specify an alternate coordinate system,
 643 is now available as '-X # #'.
 644
 645 Updated scaleswn.c to provide window shuffle information for -z 12.
 646
 647 Updated compacc.c, workacc.c, to fix serious bug in wshuffle()
 648 that destroyed aa1[n1]=0.
 649
 650 >>January 25, 2000
 651   --> v33t04
 652
 653   A serious bug in all of the fasta related programs has been
 654 corrected.  The new code in fasta33 which ignores certain residues
 655 failed to initialize one of the arrays properly.  As a result, in
 656 pathological situations, a very strong match could be missed.
 657
 658   Corrected minor bug in initsw.c that cause misplaced "ktup" command
 659 line argument, which should be ingnored by ssearch, to be read as -d
 660 ktup.
 661
 662   Improved error message for 0 length query sequence.
 663
 664 >>January 17, 2000
 665   --> no external version number change
 666
 667 Modified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mapping
 668 of genbank flatfile (format=1) files.  This format could be read much
 669 more efficiently, however.
 670
 671 >>January 12, 2000
 672   --> no external version number change
 673
 674 Changed the behavior of the options that set the number of high scores
 675 (-b) and alignments (-d) that are displayed.  Previously, fasta33 -E
 676 10.0 -d 10 would show 50 best scores, rather than all the scores with
 677 E() < 10.0.  To get the -E threshold to limit, -E 10.0 -b 10000 -d 10
 678 was required. This is now fixed. Setting "-d 10" does not affect the
 679 number of best scores shown.
 680
 681 Minor change in mw.h to remove unused defines.
 682
 683 fasta3x.me (fasta3x.doc) updated.
 684
 685 >>January 6, 2000
 686   --> v33t03
 687
 688 Corrected bug in memory mapped reads of gcg_binary format files
 689 that potentially caused the last 63 residues to be read improperly.
 690
 691 Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c to
 692 ensure that each thread has its own work_info structure. This solves
 693 some minor race conditions that sometimes caused some parameters
 694 not to be reported properly.
 695
 696 Changes to most of the drop*.c files to correct some minor problems
 697 with sequence alphabets. Code in mmgetaa.c (memory mapped code for
 698 FASTA, GCG compressed files) reordered to prevent files from being
 699 memory mapped if appropriate index files are not available.
 700
 701 See readme.pvm_3.3 for updates to the pvm programs.
 702
 703 >>December 10, 1999
 704   (no version change - modifications largely affect ps3comp*)
 705
 706 Modifications to showsum.c to deal with 2 scores/sequence.  Modifications
 707 to mmgetaa.c for superfamily numbers.
 708
 709 >>December 7, 1999
 710  (no version change, previous version not released)
 711
 712 Corrected problem in mmgetaa.c that caused searches on a memory mapped
 713 single long sequence (e.g. Chr22) to fail.  Corrected bug in map_db.c
 714 that caused it to crash on some architectures if a filename was not
 715 specified.  Corrected off-by-three error in fasty/tfasty.  Corrected
 716 indexing error in dropfz2.c.
 717
 718 >>December 5, 1999
 719  --> v33t02
 720
 721 corrected some bugs in inifa.c/initsw.c/doinit.c that caused
 722 abbreviated function names to be lost.
 723
 724 modify showbest.c, showalign.c to include information on position in
 725 library sequence (bbp->cont) to distinguish subsegment of very long
 726 sequences.  Currently, the new label is available only with -m 6.
 727
 728 >>November 29, 1999
 729  [t]fastz33 uses v33t02 of fasty function.
 730
 731 Replace dropfz.c with dropfz2.c.  Dropfz2.c interprets any codons,
 732 that include the nucleotide 'N' as the amino 'X'. Previously, 'N' was
 733 treated as 'A', so 'NNN' ended up 'K'.  This modification, together
 734 with the -S option and lower-case pseg'ed databases, should ensure
 735 that DNA queries with large numbers of 'N's do not match low
 736 complexity regions.
 737
 738 >>November 20, 1999
 739  (no version change, previous version not released)
 740
 741 Modify initfa.c to disply initn, init1 scores for [t]fast[fs].
 742 Include "-B" option to show previous z-scores.
 743
 744 >>November 17, 1999
 745  (no version change, previous version not released)
 746
 747 Modify dropfx.c to use saatran(), rather than aatran().  saatran
 748 translates any 'N' containing codon as 'X'.  aatran() treats 'N' as
 749 an 'A'.  Although more steps are required for translation, the program
 750 appears to run just as fast.
 751
 752 >>November 7, 1999
 753  --> v33t01
 754
 755 Substantial changes to the output format in showbest.c (the list of
 756 high scoring sequences) and showalign.c (the alignments).  The classic
 757 list of best scores:
 758
 759 The best scores are:                             initn init1 opt z-sc E(82014)
 760 gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO  ( 218) 1497 1497 1497 1761.1 2.3e-91
 761 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE  ( 218) 1413 1413 1413 1662.9 6.7e-86
 762
 763 has been replaced by:
 764
 765 The best scores are:                                       opt bits E(82138)
 766 gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN  ( 218) 1497 354 7.6e-98
 767 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF  ( 218) 1413 335 5.3e-92
 768
 769 This display provides more information and removes the outdated initn
 770 and init1 scores, which are no longer used. The "bit" score is
 771 comparable to the blast2 bit score.  It is calculated as: (lambda*S -
 772 ln K)/ln 2, where S is the raw similarity score, lambda and K are
 773 statistical parameters estimated from the distribution of unrelated
 774 sequence similarity scores.  All of the similarity scores, including
 775 init1, initn, and z-scores are reported with the alignment data.
 776 Z-scores are displayed instead of bit scores in the list of high
 777 scores if the command line option "-B" is specified.
 778
 779 In addition, the alignment score line has changed from:
 780
 781 >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa)
 782  initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56
 783 Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)
 784
 785 to:
 786
 787 >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa)
 788  initn: 954 init1: 954 opt: 958  Z-score: 1130.9  bits: 216.4 E(): 2.8e-56
 789 Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)
 790
 791 In addition to the addition of the "bits:" score, the "expect()" label
 792 has changed to "E()" to save some space.
 793
 794 >>November 4,12, 1999
 795 (no version change)
 796
 797 Fixed serious bug in -z 2 lambda/K calculation in scaleswn.c
 798
 799 Fixed bugs in llgetaa.c (openlib()) and definition of superfamily
 800 numbers.
 801
 802 >>October 21, 1999
 803 (no version change)
 804
 805 Begin using CVS for version control. Correct faulty error message in
 806 dropfs.c.  Corrected bad "goto loopl;" in dropfz.c.  Corrected prss3.rsp
 807 for Makefile.tc (Win32 version).
 808
 809 >>October 18, 1999
 810  --> v33t0
 811
 812 Corrected some serious bugs with the various fasta/x/y programs when
 813 the -DALLOCN0 was used to save memory.  Improvements to fasta3x.me/.doc
 814 documentation.
 815
 816 >>October 12, 1999
 817  --> v33tx
 818
 819 For this initial release of version 33 of the FASTA programs, the
 820 Makefile's have been modified to make "fasta33(_t)", "fastx33(_t)",
 821 etc, so that you can test fasta33 while retaining fasta3 (from release
 822 v32t08).  The FASTA33 programs are somewhat slower than previous
 823 releases, but I believe the ability to handle low complexity regions
 824 without 'X'ing them out outweighs the slowdown.  By (temporarily)
 825 changing the names of the programs slightly, it will be easier for you
 826 to judge the relative cost and benefit.  To "make" the programs as
 827 "fasta3(_t)", etc, simply replace "Makefile33.common" with
 828 "Makefile.common" in the "Makefile" that you use.
 829
 830 >>September 30, 1999
 831
 832 ssearch3/fasta3/fastx3/fasty3 have been modified to search databases
 833 containing both upper and lower case letters, where lower case letters
 834 indicate low-complexity regions.  With the modified programs, lower
 835 case letters are treated as 'X's' in the initial scan, but are then
 836 treated normally in the final alignment.  In addition, alignments can
 837 contain lower case letters.  Lower case letters are treated as
 838 low-complexity regions during the seach phase of the program, but as
 839 "conventional" residues during the alignment phase, with the "-S"
 840 option.  Currently, lower case letters are mapped to 'X's during the
 841 scan of the entire library.  In the future, alternate weights will be
 842 available. This is a substantial improvement for very large scale
 843 comparison, where one seeks both accurate statistical estimates and
 844 accurate %identities and alignments, and for translated DNA:protein
 845 comparisons, like "fastx3" and "fasty3", where out-of-frame
 846 translations tend to match low complexity regions (see Pearson et
 847 al. (1997) Genomics 46:24-36).
 848
 849 Protein databases (and query sequences) can be generated in the
 850 appropriate format using John Wooton's "pseg" program, available from
 851 ftp://ncbi.nlm.nih.gov/pub/seg/pseg.  Once you have compiled the "pseg"
 852 program, use the command:
 853
 854         pseg database.fasta -z 1 -q  > database.lc_seg
 855
 856 Once you have database.lc_seg, run the command "map_db" to generate
 857 a ".xin" file that can be used to efficiently memory map the database.
 858
 859 You can then search database.lc_seg with or without the "-S" option.
 860 Without "-S", the database is treated as any other FASTA format file -
 861 all the residues are present.  With "-S", lower case residues will be
 862 treated as 'x's' during the initial scan but as normal residues when
 863 final alignments are displayed.
 864
 865 When the -S option is used, the matrix information line is changed
 866 from: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS".  The "-S"
 867 option is no longer available to provide a scoring matrix offset.
 868
 869 Unfortunately, Blast2.0 format files cannot contain lower case
 870 letters.  We have addressed this problem by providing efficient memory
 871 mapped access to Fasta and GCG/PIR, and GCG/compressed-binary files in
 872 the last release of fasta32t08. The memory mapped file I/O
 873 improvements are provided in fasta33 as well.
 874
 875 ================ readme.v32 ================
 876
 877 FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
 878 search both the forward and reverse strands by default.
 879
 880 The documentation in fasta3x.me/fasta3x.doc has been substantially
 881 revised.
 882
 883 >>October 20, 1999
 884 (no version change)
 885
 886 Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
 887
 888 >>October 9, 1999
 889  --> v32t08 (no version number change)
 890
 891 Added "-M low-high" option, where low and high are inclusion limits
 892 for library sequences.  If a library sequence is shorter than "low" or
 893 longer than "high", it will not be considered in the search.  Thus,
 894 "-M 200-250" limits the database search to proteins between 200 and
 895 250 residues in length.  This should be particularly useful for fasts3
 896 and fastf3.  -M -500 searches library sequences < 500; -M 200 -
 897 searches sequences > 200.  This limit applies only to protein
 898 sequences.
 899
 900 Modified scaleswn.c to fall back to maximum likelihood estimates of
 901 lambda, K rather than mean/variance estimates. (This allows MLE
 902 estimation to be used instead of proc_hist_n when a limited range of
 903 scores is examined.)
 904
 905 >>October 2, 1999
 906  --> v32t08
 907
 908 Many changes:
 909
 910 (1) memory mapped (mmap()ed) database reading - other database reading fixes
 911 (2) BLAST2 databases supported
 912 (3) true maximum likelihood estimates for Lambda, K
 913 (4) Misc. minor fixes
 914
 915 (1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
 916 It is now possible to use mmap()ed access to FASTA format databases,
 917 if the "map_db" program has been used to produce an ".xin" file.  If
 918 USE_MMAP is defined at compile time and a ".xin" file is present, the
 919 ".xin" will be used to access sequences directly after the file is
 920 mmap()ed.  On my 4-processor Alpha, this can reduce elapsed time by
 921 50%. It is not quite as efficient as BLAST2 format, but it is close.
 922
 923 Currently, memory mapping is supported for type 0 (FASTA), 5
 924 (PIR/GCG ascii), and 6 (GCG binary).  Memory mapping is used if a
 925 ".xin" file is present. ".xin" files are created by the new program
 926 "map_db".  The syntax for "map_db" is:
 927
 928         map_db [-n] "/dir/database.fa"
 929
 930 which creates the file /dir/database.fa.xin.  Library types can be
 931 included in the filename; thus:
 932
 933         map_db -n "/gcggenbank/gb_om.seq 6"
 934
 935 would be used for a type 6 GCG binary file.
 936
 937 The ".xin" file must be updated each time the database file changes.
 938 map_db writes the size of the database file into the ".xin" file, so
 939 that if the database file changes, making the ".xin" offset
 940 information invalid, the ".xin" file is not used. "list_db" is
 941 provided to print out the offset information in the ".xin" file.
 942
 943 (Oct 2, 1999) The memory mapping routines have been changed to
 944 allow several files to be memory mapped simultaneously. Indeed, once a
 945 database has been memory mapped, it will not be unmap()ed until the
 946 program finishes.  This fixes a problem under Digital Unix, and should
 947 make re-access to mmap()ed files (as when displaying high scores and
 948 alignments) much more efficient.  If no more memory is available for
 949 mmap()ing, the file will be read using conventional fread/fgets.
 950
 951 (Oct 2, 1999) The names of the database reading functions has been
 952 changed to allow both Blast1.4 and Blast2.0 databases to be read.  In
 953 addition, Makefile.common now includes an option to link both
 954 ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
 955 However, Blast1.4 support has not been tested.
 956
 957 The Makefile structure has been improved.  Each architecture specific
 958 Makefile (Makefile.alpha, Makefile.linux, etc) now includes
 959 Makefile.common.  Thus, changes to the program structure should be
 960 correct for all platforms.  "map_db" and "list_db" are not made with
 961 "make all".
 962
 963 The database reading functions in nxgetaa.c can now return a database
 964 length of 0, which indicates that no residues were read.  Previously,
 965 0-length sequences returned a length of 1, which were ignored.
 966 Complib.c and comp_thr.c have changed to accommodate this
 967 modification.  This change was made to ensure that each residue,
 968 including the last, of each sequence is read.
 969
 970 Corrected bug in nxgetaa.c with FASTA format files with very long
 971 (>512 char) definition lines.
 972
 973 (2) (September 20, 1999) BLAST2 format databases supported
 974
 975 This release supports NCBI Blast2.0 format databases, using either
 976 conventional file reading or memory mapped files.  The Blast2.0 format
 977 can be read very efficiently, so there is only a modest improvement in
 978 performance with memory mapping.  The decision to use mmap()'ed files
 979 is made at compile time, by defining USE_MMAP.  My thanks to Eamonn
 980 O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
 981 providing mmap()'ed modifications to fasta3.  On my machines, Blast2.0
 982 format reduces search time by about 30%.  At the moment, ambiguous DNA
 983 sequences are not decoded properly.
 984
 985 (3) (September 30, 1999) A new statistical estimation option is
 986 available.  -z 2 has been changed from ln()-scaling, which never
 987 should have been used, to scaling using Maximum Likelihood Estimates
 988 (MLEs) of Lambda and K.  The MLE estimation routines were written by
 989 Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
 990 written by Sean Eddy.  The MLE estimation examines the middle 95% of
 991 scores, if there are fewer than 10000 sequences in the database;
 992 otherwise it excludes (censors) the top 250 scores and the bottom 250
 993 scores.  This approach seems to effectively prevent related sequences
 994 from contaminating the estimation process.  As with -z 1, -z 12 causes
 995 the program to generate a shuffled sequence score for each of the
 996 library sequences; in this case, no censoring is done.  If the
 997 estimation process is reliable, Lambda and K should not vary much with
 998 different queries or query lengths.  Lambda appears not to vary much
 999 with the comparison algorithm, although K does.
1000
1001 (4) Minor changes include fixes to some of the alignment display routines,
1002 individual copies of the pstruct structure for each thread, and some
1003 changes to ensure that every last residue in a library is available
1004 for matching (sometime the last residue could be ignored).  This
1005 version has undergone extensive testing with high-throughput sequences
1006 to confirm that long sequences are read properly.  Problems with
1007 fastf3/fasts3 alignment display have also been addressed.
1008
1009 >>August 26, 1999 (no version change - not released)
1010
1011 Corrected problem in "apam.c" that prevented scoring matrices from
1012 being imported for [t]fasts3/[t]fastf3.
1013
1014 >>August 17, 1999
1015  --> v32t07
1016
1017 Corrected problem with opt_cut initialization that only appeared
1018 with pvcomp* programs.
1019
1020 Improved calculation of FASTA optcut threshold for DNA sequence
1021 comparison for match scores much less than +5 (e.g. +3).  The previous
1022 optcut theshold was too high when the match penalty was < 4 and
1023 ktup=6; it is now scaled more appropriately.
1024
1025 Optcut thresholds have also been raised slightly for
1026 fastx/y3/tfastx/y3.  This should improve performance with minimal
1027 effects on sensitivity.
1028
1029 >>July 29, 1999
1030 (no version change - date change)
1031
1032 Corrected various uninitialized variables and buffer overruns
1033 detected.
1034
1035 >>July 26, 1999 - new distribution
1036 (no version change - v32t06, previous version not released)
1037
1038 Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
1039 programs.
1040
1041 Statistical calculations for tfasta/x/y in unthreaded version
1042 corrected.  Statistical estimates for threaded and unthreaded versions
1043 of the tfasta/x/y/s/f programs should be much more consistent.
1044
1045 Substantial modifications in alignment coordinate calculation/
1046 presentation.  Minor error in fastx/y/tfastx/y end of alignment
1047 corrected.  Major problems with tfasta alignment coordinates
1048 corrected.  tfasta and tfastx/y coordinates should now be consistent.
1049
1050 Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
1051 with long query sequences.
1052
1053 Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
1054 to try to avoid "cannot allocate diagonal arrays" error message.
1055 Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
1056 so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
1057 I am still getting this message, so it has not been completely
1058 successful.  Makefile.linux now uses -DALLOCN0 to avoid this problem,
1059 at some cost in speed.
1060
1061 The pvcomp* programs have been updated to work properly with
1062 forward/reverse DNA searches.  See readme.pvm_3.2.
1063
1064 >>July 7, 1999 - not released
1065  --> v32t06
1066
1067 Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
1068 dumps with "-o" option.
1069
1070 Corrected a subtle bug in fastx/y/tfastx/y alignment display.
1071
1072 >>June 30, 1999 - new distribution
1073 (no version change)
1074
1075 Corrected doinit.c to allow DNA substitution matrices with -s matrix
1076 option.
1077
1078 Changed ".gbl" files to ".h" files.
1079
1080 >>June 2 - 9, 1999 - new distribution
1081 (no version change)
1082
1083 Added additional DNA lambda/K/H to alt_param.h.  Corrected some
1084 other problems with those table. for the case where (inf,inf)
1085 gap penalties were not included.
1086
1087 Fixed complib.c/comp_thr.c error message to properly report filename
1088 when library file is not found.
1089
1090 Included approximate Lambda/K/H for BL80 in alt_parms.h.
1091 BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
1092
1093 Included some additional perl files for searchfa.cgi, searchnn.cgi
1094 in the distribution (my-cgi.pl, cgi-lib.pl).
1095
1096 >>May 30, 1999, June 2, 1999 - new distribution
1097 (no version number change)
1098
1099 Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h.  Changed
1100 zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
1101 when only one sequence is compared and -z 3 is used.
1102
1103 >>May 27, 1999
1104 (no version number change)
1105
1106 Corrected bug in alignment numbering on the % identity line
1107         27.4% identity in 234 aa (101-234:110-243)
1108 for reverse complements with offset coordinates (test.aa:101-250)
1109
1110 >>May 23, 1999
1111 (no version number change)
1112
1113 Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
1114
1115 >>May 19, 1999
1116 (no version number change)
1117
1118 Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
1119 Changes to showsum.c to change off-end reporting.  (Neither of these
1120 changes is likely to affect anyone outside my research group.)
1121
1122 >>May 12, 1999
1123  --> v32t05
1124
1125 Fixed a serious bug in the fastx3/tfastx3 alignment display which
1126 caused t/fastx3 to produce incorrect alignments (and incorrectly low
1127 percent identities).  The scores were correct, but the alignment
1128 percent identities were too low and the alignments were wrong.
1129
1130 Numbering errors were also corrected in fastx3/tfastx3 and
1131 fasty3/tfasty3 and when partial query sequences were used.
1132
1133 >>May 7, 1999
1134
1135 Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
1136 incorrect Smith-Waterman scores after do_walign() had been called.
1137 This affected only pvcompsw searches with the "-m 9" option.
1138
1139 >>May 5, 1999
1140
1141 Modified showalign.c to provide improved alignment information that
1142 includes explicitly the boundaries of the alignment.  Default
1143 alignments now say:
1144
1145 Smith-Waterman score: 175;  24.645% identity in 211 aa overlap (5:207-7:207)
1146
1147 >>May 3, 1999
1148
1149 Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
1150 "not" superfamily annotation for the query sequence only.  The
1151 goal is to be able to specify that certain superfamily numbers be
1152 ignored in some of the search summaries.  Thus, a description line
1153 of the form:
1154
1155 >GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
1156
1157 says that GT8.7 belongs to superfamily 40001, but any library
1158 sequences with superfamily number 90043 should be ignored in any
1159 listing or summary of best scores.
1160
1161 In addition, it is now possible to make a fasta3r/prcompfa, which is
1162 the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
1163 unrelated sequences in a search using the superfamily annotation.
1164 fasta3r shows only the scores of related sequences.  This might be
1165 used in combination with the -F e_val option to show the scores
1166 obtained by the most distantly related members of a family.
1167
1168 >>April 25, 1999
1169
1170  -->v32t04 (not distributed)
1171
1172 Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
1173 (necessary for a more rational Makefile structure).  No code changes.
1174
1175 >>April 19, 1999
1176
1177 Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
1178 (no version number change).
1179
1180 >>April 17, 1999
1181
1182  --> v32t03
1183
1184 A serious bug in DNA alignments when the sequence has been broken into
1185 multiple segments that was introduced in version fasta32 has been
1186 fixed.  In addition, several minor problems with -z 3 statistics on
1187 DNA sequences were fixed.
1188
1189 Added -m 9 option, which unfortunately does different things in
1190 pvcompfa/sw and fasta3/ssearch3.  In both programs, -m 9 provides the
1191 id's of the two sequences, length, E(), %_ident, and start and end of
1192 the alignment in both sequences.  pvcompfa/sw provides this
1193 information with the list of high scoring sequences.  fasta3/ssearch3
1194 provides the information in lieu of an alignment.
1195
1196 >>March 18, 1999
1197
1198  --> v32t02
1199
1200 Added information on the algorithm/parameter description line to
1201 report the range of the pam matrices.  Useful for matrices like
1202 MD_10, _20, and _40 which require much higher gap penalties.
1203
1204 >>March 13, 1999 (not distributed)
1205
1206  --> v32t01
1207
1208  -r results.file  has been changed to -R results.file to accomodate
1209  DNA match/mismatch penalties of the form: -r "+1/-3".
1210
1211 >>February 10, 1999
1212
1213 Modify functions in scalesw*.c to prevent underflow after exp() on
1214 Alpha Linux machines.  The Alpha/LINUX gcc compiler is buggy and
1215 doesn't behave properly with "denormalized" numbers, so "gcc -g -m
1216 ieee" is recommended.
1217
1218 Add "Display alignments also (y/n)[n] "
1219
1220 pvcomplib.c again provides alignments!!  In addition, there is a
1221 new "-m 9" option, which reports alignments as:
1222
1223 >>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
1224 HS5               30    HS5               30    1.873e-11       1.000     30       1      30       1      30
1225 HS5               30    HS2249            40    1.061e-07       0.774     31       1      30       7      37
1226 HS5               30    HS2221            38    1.207e-07       0.833     30       1      30       7      35
1227 HS5               30    HS2283            40    1.455e-07       0.774     31       1      30       7      37
1228 HS5               30    HS2239            38    1.939e-07       0.800     30       1      30       7      35
1229
1230 where the columns are:
1231
1232 query-name      q-len   lib-name      lib-len   E()             %id    align-len  q-start q-end   l-start l-end
1233
1234 >>February 9, 1999
1235
1236 Corrected bug in showalign.c that offset reverse complement alignments
1237 by one.
1238
1239 >>Febrary 2, 1999
1240
1241 Changed the formatting slightly in showbest.c to have columns line up better.
1242
1243 >>January 11, 1999
1244
1245 Corrected some bugs introduced into fastf3(_t) in the previous version.
1246
1247 >>December 28, 1998
1248
1249 Corrected various problems in dropfz.c affecting alignment scores
1250 and coordinates.
1251
1252 Introduced a new program, fasts3(_t), for searching with peptide
1253 sequences.
1254
1255 >>November 11, 1998
1256
1257   --> v32t0
1258
1259 Added code to correct problems with coordinate number in long library
1260 sequences with tfastx/tfasty.  With this release, sequences should be
1261 numbered properly, and sequence numbers count down with reverse
1262 complement library sequences.
1263
1264 In addition, with this release, fastx/y and tfastx/y translated
1265 protein alignments are numbered as nucleotides (increasing by 3,
1266 labels every 30 nucleotides) rather than codons.
1267