binaries/src/fasta34/readme.v32t0

   1
   2 FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
   3 search both the forward and reverse strands by default.
   4
   5 The documentation in fasta3x.me/fasta3x.doc has been substantially
   6 revised.
   7
   8 >>October 9, 1999
   9  --> v32t08 (no version number change)
  10
  11 Added "-M low-high" option, where low and high are inclusion limits
  12 for library sequences.  If a library sequence is shorter than "low" or
  13 longer than "high", it will not be considered in the search.  Thus,
  14 "-M 200-250" limits the database search to proteins between 200 and
  15 250 residues in length.  This should be particularly useful for fasts3
  16 and fastf3.  This limit applies only to protein sequences.
  17
  18 Modified scaleswn.c to fall back to maximum likelihood estimates of
  19 lambda, K rather than mean/variance estimates. (This allows MLE
  20 estimation to be used instead of proc_hist_n when a limited range of
  21 scores is examined.)
  22
  23 >>October 20, 1999
  24 (no version change)
  25
  26 Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
  27
  28 >>October 9, 1999
  29  --> v32t08 (no version number change)
  30
  31 Added "-M low-high" option, where low and high are inclusion limits
  32 for library sequences.  If a library sequence is shorter than "low" or
  33 longer than "high", it will not be considered in the search.  Thus,
  34 "-M 200-250" limits the database search to proteins between 200 and
  35 250 residues in length.  This should be particularly useful for fasts3
  36 and fastf3.  -M -500 searches library sequences < 500; -M 200 -
  37 searches sequences > 200. This limit applies only to protein
  38 sequences.
  39
  40 Modified scaleswn.c to fall back to maximum likelihood estimates of
  41 lambda, K rather than mean/variance estimates. (This allows MLE
  42 estimation to be used instead of proc_hist_n when a limited range of
  43 scores is examined.)
  44
  45 >>October 2, 1999
  46  --> v32t08
  47
  48 Many changes:
  49
  50 (1) memory mapped (mmap()ed) database reading - other database reading fixes
  51 (2) BLAST2 databases supported
  52 (3) true maximum likelihood estimates for Lambda, K
  53 (4) Misc. minor fixes
  54
  55 (1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
  56 It is now possible to use mmap()ed access to FASTA format databases,
  57 if the "map_db" program has been used to produce an ".xin" file.  If
  58 USE_MMAP is defined at compile time and a ".xin" file is present, the
  59 ".xin" will be used to access sequences directly after the file is
  60 mmap()ed.  On my 4-processor Alpha, this can reduce elapsed time by
  61 50%. It is not quite as efficient as BLAST2 format, but it is close.
  62
  63 Currently, memory mapping is supported for type 0 (FASTA), 5
  64 (PIR/GCG ascii), and 6 (GCG binary).  Memory mapping is used if a
  65 ".xin" file is present. ".xin" files are created by the new program
  66 "map_db".  The syntax for "map_db" is:
  67
  68         map_db [-n] "/dir/database.fa"
  69
  70 which creates the file /dir/database.fa.xin.  Library types can be
  71 included in the filename; thus:
  72
  73         map_db -n "/gcggenbank/gb_om.seq 6"
  74
  75 would be used for a type 6 GCG binary file.
  76
  77 The ".xin" file must be updated each time the database file changes.
  78 map_db writes the size of the database file into the ".xin" file, so
  79 that if the database file changes, making the ".xin" offset
  80 information invalid, the ".xin" file is not used. "list_db" is
  81 provided to print out the offset information in the ".xin" file.
  82
  83 (Oct 2, 1999) The memory mapping routines have been changed to
  84 allow several files to be memory mapped simultaneously. Indeed, once a
  85 database has been memory mapped, it will not be unmap()ed until the
  86 program finishes.  This fixes a problem under Digital Unix, and should
  87 make re-access to mmap()ed files (as when displaying high scores and
  88 alignments) much more efficient.  If no more memory is available for
  89 mmap()ing, the file will be read using conventional fread/fgets.
  90
  91 (Oct 2, 1999) The names of the database reading functions has been
  92 changed to allow both Blast1.4 and Blast2.0 databases to be read.  In
  93 addition, Makefile.common now includes an option to link both
  94 ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
  95 However, Blast1.4 support has not been tested.
  96
  97 The Makefile structure has been improved.  Each architecture specific
  98 Makefile (Makefile.alpha, Makefile.linux, etc) now includes
  99 Makefile.common.  Thus, changes to the program structure should be
 100 correct for all platforms.  "map_db" and "list_db" are not made with
 101 "make all".
 102
 103 The database reading functions in nxgetaa.c can now return a database
 104 length of 0, which indicates that no residues were read.  Previously,
 105 0-length sequences returned a length of 1, which were ignored.
 106 Complib.c and comp_thr.c have changed to accommodate this
 107 modification.  This change was made to ensure that each residue,
 108 including the last, of each sequence is read.
 109
 110 Corrected bug in nxgetaa.c with FASTA format files with very long
 111 (>512 char) definition lines.
 112
 113 (2) (September 20, 1999) BLAST2 format databases supported
 114
 115 This release supports NCBI Blast2.0 format databases, using either
 116 conventional file reading or memory mapped files.  The Blast2.0 format
 117 can be read very efficiently, so there is only a modest improvement in
 118 performance with memory mapping.  The decision to use mmap()'ed files
 119 is made at compile time, by defining USE_MMAP.  My thanks to Eamonn
 120 O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
 121 providing mmap()'ed modifications to fasta3.  On my machines, Blast2.0
 122 format reduces search time by about 30%.  At the moment, ambiguous DNA
 123 sequences are not decoded properly.
 124
 125 (3) (September 30, 1999) A new statistical estimation option is
 126 available.  -z 2 has been changed from ln()-scaling, which never
 127 should have been used, to scaling using Maximum Likelihood Estimates
 128 (MLEs) of Lambda and K.  The MLE estimation routines were written by
 129 Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
 130 written by Sean Eddy.  The MLE estimation examines the middle 95% of
 131 scores, if there are fewer than 10000 sequences in the database;
 132 otherwise it excludes (censors) the top 250 scores and the bottom 250
 133 scores.  This approach seems to effectively prevent related sequences
 134 from contaminating the estimation process.  As with -z 1, -z 12 causes
 135 the program to generate a shuffled sequence score for each of the
 136 library sequences; in this case, no censoring is done.  If the
 137 estimation process is reliable, Lambda and K should not vary much with
 138 different queries or query lengths.  Lambda appears not to vary much
 139 with the comparison algorithm, although K does.
 140
 141 (4) Minor changes include fixes to some of the alignment display routines,
 142 individual copies of the pstruct structure for each thread, and some
 143 changes to ensure that every last residue in a library is available
 144 for matching (sometime the last residue could be ignored).  This
 145 version has undergone extensive testing with high-throughput sequences
 146 to confirm that long sequences are read properly.  Problems with
 147 fastf3/fasts3 alignment display have also been addressed.
 148
 149 >>August 26, 1999 (no version change - not released)
 150
 151 Corrected problem in "apam.c" that prevented scoring matrices from
 152 being imported for [t]fasts3/[t]fastf3.
 153
 154 >>August 17, 1999
 155  --> v32t07
 156
 157 Corrected problem with opt_cut initialization that only appeared
 158 with pvcomp* programs.
 159
 160 Improved calculation of FASTA optcut threshold for DNA sequence
 161 comparison for match scores much less than +5 (e.g. +3).  The previous
 162 optcut theshold was too high when the match penalty was < 4 and
 163 ktup=6; it is now scaled more appropriately.
 164
 165 Optcut thresholds have also been raised slightly for
 166 fastx/y3/tfastx/y3.  This should improve performance with minimal
 167 effects on sensitivity.
 168
 169 >>July 29, 1999
 170 (no version change - date change)
 171
 172 Corrected various uninitialized variables and buffer overruns
 173 detected.
 174
 175 >>July 26, 1999 - new distribution
 176 (no version change - v32t06, previous version not released)
 177
 178 Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
 179 programs.
 180
 181 Statistical calculations for tfasta/x/y in unthreaded version
 182 corrected.  Statistical estimates for threaded and unthreaded versions
 183 of the tfasta/x/y/s/f programs should be much more consistent.
 184
 185 Substantial modifications in alignment coordinate calculation/
 186 presentation.  Minor error in fastx/y/tfastx/y end of alignment
 187 corrected.  Major problems with tfasta alignment coordinates
 188 corrected.  tfasta and tfastx/y coordinates should now be consistent.
 189
 190 Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
 191 with long query sequences.
 192
 193 Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
 194 to try to avoid "cannot allocate diagonal arrays" error message.
 195 Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
 196 so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
 197 I am still getting this message, so it has not been completely
 198 successful.  Makefile.linux now uses -DALLOCN0 to avoid this problem,
 199 at some cost in speed.
 200
 201 The pvcomp* programs have been updated to work properly with
 202 forward/reverse DNA searches.  See readme.pvm_3.2.
 203
 204 >>July 7, 1999 - not released
 205  --> v32t06
 206
 207 Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
 208 dumps with "-o" option.
 209
 210 Corrected a subtle bug in fastx/y/tfastx/y alignment display.
 211
 212 >>June 30, 1999 - new distribution
 213 (no version change)
 214
 215 Corrected doinit.c to allow DNA substitution matrices with -s matrix
 216 option.
 217
 218 Changed ".gbl" files to ".h" files.
 219
 220 >>June 2 - 9, 1999 - new distribution
 221 (no version change)
 222
 223 Added additional DNA lambda/K/H to alt_param.h.  Corrected some
 224 other problems with those table. for the case where (inf,inf)
 225 gap penalties were not included.
 226
 227 Fixed complib.c/comp_thr.c error message to properly report filename
 228 when library file is not found.
 229
 230 Included approximate Lambda/K/H for BL80 in alt_parms.h.
 231 BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
 232
 233 Included some additional perl files for searchfa.cgi, searchnn.cgi
 234 in the distribution (my-cgi.pl, cgi-lib.pl).
 235
 236 >>May 30, 1999, June 2, 1999 - new distribution
 237 (no version number change)
 238
 239 Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h.  Changed
 240 zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
 241 when only one sequence is compared and -z 3 is used.
 242
 243 >>May 27, 1999
 244 (no version number change)
 245
 246 Corrected bug in alignment numbering on the % identity line
 247         27.4% identity in 234 aa (101-234:110-243)
 248 for reverse complements with offset coordinates (test.aa:101-250)
 249
 250 >>May 23, 1999
 251 (no version number change)
 252
 253 Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
 254
 255 >>May 19, 1999
 256 (no version number change)
 257
 258 Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
 259 Changes to showsum.c to change off-end reporting.  (Neither of these
 260 changes is likely to affect anyone outside my research group.)
 261
 262 >>May 12, 1999
 263  --> v32t05
 264
 265 Fixed a serious bug in the fastx3/tfastx3 alignment display which
 266 caused t/fastx3 to produce incorrect alignments (and incorrectly low
 267 percent identities).  The scores were correct, but the alignment
 268 percent identities were too low and the alignments were wrong.
 269
 270 Numbering errors were also corrected in fastx3/tfastx3 and
 271 fasty3/tfasty3 and when partial query sequences were used.
 272
 273 >>May 7, 1999
 274
 275 Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
 276 incorrect Smith-Waterman scores after do_walign() had been called.
 277 This affected only pvcompsw searches with the "-m 9" option.
 278
 279 >>May 5, 1999
 280
 281 Modified showalign.c to provide improved alignment information that
 282 includes explicitly the boundaries of the alignment.  Default
 283 alignments now say:
 284
 285 Smith-Waterman score: 175;  24.645% identity in 211 aa overlap (5:207-7:207)
 286
 287 >>May 3, 1999
 288
 289 Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
 290 "not" superfamily annotation for the query sequence only.  The
 291 goal is to be able to specify that certain superfamily numbers be
 292 ignored in some of the search summaries.  Thus, a description line
 293 of the form:
 294
 295 >GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
 296
 297 says that GT8.7 belongs to superfamily 40001, but any library
 298 sequences with superfamily number 90043 should be ignored in any
 299 listing or summary of best scores.
 300
 301 In addition, it is now possible to make a fasta3r/prcompfa, which is
 302 the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
 303 unrelated sequences in a search using the superfamily annotation.
 304 fasta3r shows only the scores of related sequences.  This might be
 305 used in combination with the -F e_val option to show the scores
 306 obtained by the most distantly related members of a family.
 307
 308 >>April 25, 1999
 309
 310  -->v32t04 (not distributed)
 311
 312 Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
 313 (necessary for a more rational Makefile structure).  No code changes.
 314
 315 >>April 19, 1999
 316
 317 Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
 318 (no version number change).
 319
 320 >>April 17, 1999
 321
 322  --> v32t03
 323
 324 A serious bug in DNA alignments when the sequence has been broken into
 325 multiple segments that was introduced in version fasta32 has been
 326 fixed.  In addition, several minor problems with -z 3 statistics on
 327 DNA sequences were fixed.
 328
 329 Added -m 9 option, which unfortunately does different things in
 330 pvcompfa/sw and fasta3/ssearch3.  In both programs, -m 9 provides the
 331 id's of the two sequences, length, E(), %_ident, and start and end of
 332 the alignment in both sequences.  pvcompfa/sw provides this
 333 information with the list of high scoring sequences.  fasta3/ssearch3
 334 provides the information in lieu of an alignment.
 335
 336 >>March 18, 1999
 337
 338  --> v32t02
 339
 340 Added information on the algorithm/parameter description line to
 341 report the range of the pam matrices.  Useful for matrices like
 342 MD_10, _20, and _40 which require much higher gap penalties.
 343
 344 >>March 13, 1999 (not distributed)
 345
 346  --> v32t01
 347
 348  -r results.file  has been changed to -R results.file to accomodate
 349  DNA match/mismatch penalties of the form: -r "+1/-3".
 350
 351 >>February 10, 1999
 352
 353 Modify functions in scalesw*.c to prevent underflow after exp() on
 354 Alpha Linux machines.  The Alpha/LINUX gcc compiler is buggy and
 355 doesn't behave properly with "denormalized" numbers, so "gcc -g -m
 356 ieee" is recommended.
 357
 358 Add "Display alignments also (y/n)[n] "
 359
 360 pvcomplib.c again provides alignments!!  In addition, there is a
 361 new "-m 9" option, which reports alignments as:
 362
 363 >>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
 364 HS5               30    HS5               30    1.873e-11       1.000     30       1      30       1      30
 365 HS5               30    HS2249            40    1.061e-07       0.774     31       1      30       7      37
 366 HS5               30    HS2221            38    1.207e-07       0.833     30       1      30       7      35
 367 HS5               30    HS2283            40    1.455e-07       0.774     31       1      30       7      37
 368 HS5               30    HS2239            38    1.939e-07       0.800     30       1      30       7      35
 369
 370 where the columns are:
 371
 372 query-name      q-len   lib-name      lib-len   E()             %id    align-len  q-start q-end   l-start l-end
 373
 374 >>February 9, 1999
 375
 376 Corrected bug in showalign.c that offset reverse complement alignments
 377 by one.
 378
 379 >>Febrary 2, 1999
 380
 381 Changed the formatting slightly in showbest.c to have columns line up better.
 382
 383 >>January 11, 1999
 384
 385 Corrected some bugs introduced into fastf3(_t) in the previous version.
 386
 387 >>December 28, 1998
 388
 389 Corrected various problems in dropfz.c affecting alignment scores
 390 and coordinates.
 391
 392 Introduced a new program, fasts3(_t), for searching with peptide
 393 sequences.
 394
 395 >>November 11, 1998
 396
 397   --> v32t0
 398
 399 Added code to correct problems with coordinate number in long library
 400 sequences with tfastx/tfasty.  With this release, sequences should be
 401 numbered properly, and sequence numbers count down with reverse
 402 complement library sequences.
 403
 404 In addition, with this release, fastx/y and tfastx/y translated
 405 protein alignments are numbered as nucleotides (increasing by 3,
 406 labels every 30 nucleotides) rather than codons.
 407