2 FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
3 search both the forward and reverse strands by default.
5 The documentation in fasta3x.me/fasta3x.doc has been substantially
9 --> v32t08 (no version number change)
11 Added "-M low-high" option, where low and high are inclusion limits
12 for library sequences. If a library sequence is shorter than "low" or
13 longer than "high", it will not be considered in the search. Thus,
14 "-M 200-250" limits the database search to proteins between 200 and
15 250 residues in length. This should be particularly useful for fasts3
16 and fastf3. This limit applies only to protein sequences.
18 Modified scaleswn.c to fall back to maximum likelihood estimates of
19 lambda, K rather than mean/variance estimates. (This allows MLE
20 estimation to be used instead of proc_hist_n when a limited range of
26 Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
29 --> v32t08 (no version number change)
31 Added "-M low-high" option, where low and high are inclusion limits
32 for library sequences. If a library sequence is shorter than "low" or
33 longer than "high", it will not be considered in the search. Thus,
34 "-M 200-250" limits the database search to proteins between 200 and
35 250 residues in length. This should be particularly useful for fasts3
36 and fastf3. -M -500 searches library sequences < 500; -M 200 -
37 searches sequences > 200. This limit applies only to protein
40 Modified scaleswn.c to fall back to maximum likelihood estimates of
41 lambda, K rather than mean/variance estimates. (This allows MLE
42 estimation to be used instead of proc_hist_n when a limited range of
50 (1) memory mapped (mmap()ed) database reading - other database reading fixes
51 (2) BLAST2 databases supported
52 (3) true maximum likelihood estimates for Lambda, K
55 (1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
56 It is now possible to use mmap()ed access to FASTA format databases,
57 if the "map_db" program has been used to produce an ".xin" file. If
58 USE_MMAP is defined at compile time and a ".xin" file is present, the
59 ".xin" will be used to access sequences directly after the file is
60 mmap()ed. On my 4-processor Alpha, this can reduce elapsed time by
61 50%. It is not quite as efficient as BLAST2 format, but it is close.
63 Currently, memory mapping is supported for type 0 (FASTA), 5
64 (PIR/GCG ascii), and 6 (GCG binary). Memory mapping is used if a
65 ".xin" file is present. ".xin" files are created by the new program
66 "map_db". The syntax for "map_db" is:
68 map_db [-n] "/dir/database.fa"
70 which creates the file /dir/database.fa.xin. Library types can be
71 included in the filename; thus:
73 map_db -n "/gcggenbank/gb_om.seq 6"
75 would be used for a type 6 GCG binary file.
77 The ".xin" file must be updated each time the database file changes.
78 map_db writes the size of the database file into the ".xin" file, so
79 that if the database file changes, making the ".xin" offset
80 information invalid, the ".xin" file is not used. "list_db" is
81 provided to print out the offset information in the ".xin" file.
83 (Oct 2, 1999) The memory mapping routines have been changed to
84 allow several files to be memory mapped simultaneously. Indeed, once a
85 database has been memory mapped, it will not be unmap()ed until the
86 program finishes. This fixes a problem under Digital Unix, and should
87 make re-access to mmap()ed files (as when displaying high scores and
88 alignments) much more efficient. If no more memory is available for
89 mmap()ing, the file will be read using conventional fread/fgets.
91 (Oct 2, 1999) The names of the database reading functions has been
92 changed to allow both Blast1.4 and Blast2.0 databases to be read. In
93 addition, Makefile.common now includes an option to link both
94 ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
95 However, Blast1.4 support has not been tested.
97 The Makefile structure has been improved. Each architecture specific
98 Makefile (Makefile.alpha, Makefile.linux, etc) now includes
99 Makefile.common. Thus, changes to the program structure should be
100 correct for all platforms. "map_db" and "list_db" are not made with
103 The database reading functions in nxgetaa.c can now return a database
104 length of 0, which indicates that no residues were read. Previously,
105 0-length sequences returned a length of 1, which were ignored.
106 Complib.c and comp_thr.c have changed to accommodate this
107 modification. This change was made to ensure that each residue,
108 including the last, of each sequence is read.
110 Corrected bug in nxgetaa.c with FASTA format files with very long
111 (>512 char) definition lines.
113 (2) (September 20, 1999) BLAST2 format databases supported
115 This release supports NCBI Blast2.0 format databases, using either
116 conventional file reading or memory mapped files. The Blast2.0 format
117 can be read very efficiently, so there is only a modest improvement in
118 performance with memory mapping. The decision to use mmap()'ed files
119 is made at compile time, by defining USE_MMAP. My thanks to Eamonn
120 O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
121 providing mmap()'ed modifications to fasta3. On my machines, Blast2.0
122 format reduces search time by about 30%. At the moment, ambiguous DNA
123 sequences are not decoded properly.
125 (3) (September 30, 1999) A new statistical estimation option is
126 available. -z 2 has been changed from ln()-scaling, which never
127 should have been used, to scaling using Maximum Likelihood Estimates
128 (MLEs) of Lambda and K. The MLE estimation routines were written by
129 Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
130 written by Sean Eddy. The MLE estimation examines the middle 95% of
131 scores, if there are fewer than 10000 sequences in the database;
132 otherwise it excludes (censors) the top 250 scores and the bottom 250
133 scores. This approach seems to effectively prevent related sequences
134 from contaminating the estimation process. As with -z 1, -z 12 causes
135 the program to generate a shuffled sequence score for each of the
136 library sequences; in this case, no censoring is done. If the
137 estimation process is reliable, Lambda and K should not vary much with
138 different queries or query lengths. Lambda appears not to vary much
139 with the comparison algorithm, although K does.
141 (4) Minor changes include fixes to some of the alignment display routines,
142 individual copies of the pstruct structure for each thread, and some
143 changes to ensure that every last residue in a library is available
144 for matching (sometime the last residue could be ignored). This
145 version has undergone extensive testing with high-throughput sequences
146 to confirm that long sequences are read properly. Problems with
147 fastf3/fasts3 alignment display have also been addressed.
149 >>August 26, 1999 (no version change - not released)
151 Corrected problem in "apam.c" that prevented scoring matrices from
152 being imported for [t]fasts3/[t]fastf3.
157 Corrected problem with opt_cut initialization that only appeared
158 with pvcomp* programs.
160 Improved calculation of FASTA optcut threshold for DNA sequence
161 comparison for match scores much less than +5 (e.g. +3). The previous
162 optcut theshold was too high when the match penalty was < 4 and
163 ktup=6; it is now scaled more appropriately.
165 Optcut thresholds have also been raised slightly for
166 fastx/y3/tfastx/y3. This should improve performance with minimal
167 effects on sensitivity.
170 (no version change - date change)
172 Corrected various uninitialized variables and buffer overruns
175 >>July 26, 1999 - new distribution
176 (no version change - v32t06, previous version not released)
178 Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
181 Statistical calculations for tfasta/x/y in unthreaded version
182 corrected. Statistical estimates for threaded and unthreaded versions
183 of the tfasta/x/y/s/f programs should be much more consistent.
185 Substantial modifications in alignment coordinate calculation/
186 presentation. Minor error in fastx/y/tfastx/y end of alignment
187 corrected. Major problems with tfasta alignment coordinates
188 corrected. tfasta and tfastx/y coordinates should now be consistent.
190 Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
191 with long query sequences.
193 Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
194 to try to avoid "cannot allocate diagonal arrays" error message.
195 Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
196 so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
197 I am still getting this message, so it has not been completely
198 successful. Makefile.linux now uses -DALLOCN0 to avoid this problem,
199 at some cost in speed.
201 The pvcomp* programs have been updated to work properly with
202 forward/reverse DNA searches. See readme.pvm_3.2.
204 >>July 7, 1999 - not released
207 Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
208 dumps with "-o" option.
210 Corrected a subtle bug in fastx/y/tfastx/y alignment display.
212 >>June 30, 1999 - new distribution
215 Corrected doinit.c to allow DNA substitution matrices with -s matrix
218 Changed ".gbl" files to ".h" files.
220 >>June 2 - 9, 1999 - new distribution
223 Added additional DNA lambda/K/H to alt_param.h. Corrected some
224 other problems with those table. for the case where (inf,inf)
225 gap penalties were not included.
227 Fixed complib.c/comp_thr.c error message to properly report filename
228 when library file is not found.
230 Included approximate Lambda/K/H for BL80 in alt_parms.h.
231 BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
233 Included some additional perl files for searchfa.cgi, searchnn.cgi
234 in the distribution (my-cgi.pl, cgi-lib.pl).
236 >>May 30, 1999, June 2, 1999 - new distribution
237 (no version number change)
239 Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h. Changed
240 zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
241 when only one sequence is compared and -z 3 is used.
244 (no version number change)
246 Corrected bug in alignment numbering on the % identity line
247 27.4% identity in 234 aa (101-234:110-243)
248 for reverse complements with offset coordinates (test.aa:101-250)
251 (no version number change)
253 Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
256 (no version number change)
258 Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
259 Changes to showsum.c to change off-end reporting. (Neither of these
260 changes is likely to affect anyone outside my research group.)
265 Fixed a serious bug in the fastx3/tfastx3 alignment display which
266 caused t/fastx3 to produce incorrect alignments (and incorrectly low
267 percent identities). The scores were correct, but the alignment
268 percent identities were too low and the alignments were wrong.
270 Numbering errors were also corrected in fastx3/tfastx3 and
271 fasty3/tfasty3 and when partial query sequences were used.
275 Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
276 incorrect Smith-Waterman scores after do_walign() had been called.
277 This affected only pvcompsw searches with the "-m 9" option.
281 Modified showalign.c to provide improved alignment information that
282 includes explicitly the boundaries of the alignment. Default
285 Smith-Waterman score: 175; 24.645% identity in 211 aa overlap (5:207-7:207)
289 Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
290 "not" superfamily annotation for the query sequence only. The
291 goal is to be able to specify that certain superfamily numbers be
292 ignored in some of the search summaries. Thus, a description line
295 >GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
297 says that GT8.7 belongs to superfamily 40001, but any library
298 sequences with superfamily number 90043 should be ignored in any
299 listing or summary of best scores.
301 In addition, it is now possible to make a fasta3r/prcompfa, which is
302 the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
303 unrelated sequences in a search using the superfamily annotation.
304 fasta3r shows only the scores of related sequences. This might be
305 used in combination with the -F e_val option to show the scores
306 obtained by the most distantly related members of a family.
310 -->v32t04 (not distributed)
312 Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
313 (necessary for a more rational Makefile structure). No code changes.
317 Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
318 (no version number change).
324 A serious bug in DNA alignments when the sequence has been broken into
325 multiple segments that was introduced in version fasta32 has been
326 fixed. In addition, several minor problems with -z 3 statistics on
327 DNA sequences were fixed.
329 Added -m 9 option, which unfortunately does different things in
330 pvcompfa/sw and fasta3/ssearch3. In both programs, -m 9 provides the
331 id's of the two sequences, length, E(), %_ident, and start and end of
332 the alignment in both sequences. pvcompfa/sw provides this
333 information with the list of high scoring sequences. fasta3/ssearch3
334 provides the information in lieu of an alignment.
340 Added information on the algorithm/parameter description line to
341 report the range of the pam matrices. Useful for matrices like
342 MD_10, _20, and _40 which require much higher gap penalties.
344 >>March 13, 1999 (not distributed)
348 -r results.file has been changed to -R results.file to accomodate
349 DNA match/mismatch penalties of the form: -r "+1/-3".
353 Modify functions in scalesw*.c to prevent underflow after exp() on
354 Alpha Linux machines. The Alpha/LINUX gcc compiler is buggy and
355 doesn't behave properly with "denormalized" numbers, so "gcc -g -m
356 ieee" is recommended.
358 Add "Display alignments also (y/n)[n] "
360 pvcomplib.c again provides alignments!! In addition, there is a
361 new "-m 9" option, which reports alignments as:
363 >>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
364 HS5 30 HS5 30 1.873e-11 1.000 30 1 30 1 30
365 HS5 30 HS2249 40 1.061e-07 0.774 31 1 30 7 37
366 HS5 30 HS2221 38 1.207e-07 0.833 30 1 30 7 35
367 HS5 30 HS2283 40 1.455e-07 0.774 31 1 30 7 37
368 HS5 30 HS2239 38 1.939e-07 0.800 30 1 30 7 35
370 where the columns are:
372 query-name q-len lib-name lib-len E() %id align-len q-start q-end l-start l-end
376 Corrected bug in showalign.c that offset reverse complement alignments
381 Changed the formatting slightly in showbest.c to have columns line up better.
385 Corrected some bugs introduced into fastf3(_t) in the previous version.
389 Corrected various problems in dropfz.c affecting alignment scores
392 Introduced a new program, fasts3(_t), for searching with peptide
399 Added code to correct problems with coordinate number in long library
400 sequences with tfastx/tfasty. With this release, sequences should be
401 numbered properly, and sequence numbers count down with reverse
402 complement library sequences.
404 In addition, with this release, fastx/y and tfastx/y translated
405 protein alignments are numbered as nucleotides (increasing by 3,
406 labels every 30 nucleotides) rather than codons.