2 $Name: fa_34_26_5 $ - $Id: readme.v33t0,v 1.45 2001/07/10 18:03:42 wrp Exp $
4 ================ readme.v33t0 ================
6 This release includes an MPI implementation of the parallel
7 library-vs-library comparison code. See readme.mpi_3.3 and
8 readme.pvm_3.3 for more information.
13 Considerable changes to support no-global library functions.
15 (1) Separate ascii/sequence mapping arrays are used by the
16 query-reading (qascii), library-reading (lascii), and sequence
17 comparison function (pascii) routines. As a result, there is no
18 longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both
21 (2) This also allows us to remove all #ifdef TFAST/FASTX conditionals
22 from complib.c/comp_thr.c/p2_complib.c. We no longer need
23 tcomp_thr.o, comp_thrx.o, etc. We still have a variety of
24 p2_complib.o variations to support the different c34.work* files.
26 (3) Because non-global openlib/getlib functions are available, exactly
27 the same open/get functions are available for reading both the
28 query and reference libraries in pv34comp* programs. The
29 host-specific openlib/getlib functions in hxgetaa.c are now
30 provided by nmgetlib.c, etc. This has two effect:
32 (a) it is now possible to compare a query database generated by an
33 SQL query to a library database generated by a different SQL
36 (b) pv34comp* has lost (at least in this version) the ability to
37 automatically detect the query sequence type. To search with a
38 DNA query, you MUST use "-n".
40 (4) the resetp() function is now responsible for almost all of the
41 function sepcific (TFAST/FASTX/etc) initializations. All of the
42 function specific code has been removed from complib.c/comp_thr.c
43 and most of it has been moved to initfa.c/resetp().
45 (5) manageacc.c has been merged into compacc.c (mostly prhist()).
47 (6) Although it may reflect a subtle bug in my code, it is not
48 possible to reliably run threaded/memory mapped versions of the
49 fasta34_t code. I have spent considerable time tracking down the
50 problem, and have determined that, in threaded code, something
51 happens during the thread initialization to corrupt the
52 description offset information used when files are memory mapped.
53 This never occurs when the unthreaded versions of the code are
54 used. And it does not occur under MacOSX, Compaq Tru64Unix, Sun
55 Solaris/Sparc, or SGI IRIX.
57 Thus, I cannot recommend using the threaded code versions (_t)
58 under Linux (RH6.2 or 7.1).
63 Many changes to accomodate a new - no global variable - strategy for
64 reading sequence databases. Every time a file is opened, a struct
65 lmf_str is allocated which can be used for memory mapped files, ncbl2,
66 files, and mysql files.
68 In addition, an open'ed file has a default sequence type: DNA or
69 protein, or one can open a file in a mode that will allow the sequence
73 >>May 18, 2001 CVS: fa33t09d0
75 A new compile time parameter - -DGAP_OPEN, is available to change the
76 definition of the "-f gap-open" parameter from the penalty for the
77 first residue in a gap to a true gap-open penalty, as is used in BLAST
78 and many other comparison algorithms. This will probably become the
79 default for fasta in version 3.4.
81 Fixes to conflicts between "-S" and "-s matrix". When a scoring
82 matrix file was specified, lower-case alignments were not displayed
83 with -S (although the scores were calculated properly).
85 More extensive testting of mysql_lib.c (mySQL query-libraries) with
86 the pv4comp* and mp4comp* programs.
89 >>April 5, 2001 CVS: fa33t08d4b3
91 Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence
92 descriptions for PCOMPLIB (pv4/mp3comp*). Also fix p2_complib.c to
93 request DNA library for translated comparisons.
95 Fix for prss33(_t) to read both sequences from stdin.
98 >>March 27, 2001 CVS: fa33t08d4
101 Problems in ncbl2_mlib.c found searching NCBI non-redundant nucleotide
102 database "nt" were fixed. Testing revealed a minor memory leak, which
103 was fixed by modifying showbest.c, showalign.c, comp_thr.c, complib.c,
104 and p2_complib.c to remember the last opened database file more
107 Modifications to allow 64-bit fseek/ftell on machines like Sun,
108 Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE
109 off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO. Machines
110 with 64-bit long's do not need this option. Machines with 32-bit
111 longs that allow files >2 Gb can do so with 64-bit file access
112 functions, including fseeko() and ftello(), which work with off_t file
113 offsets instead of long's.
116 >>March 3, 2001 CVS: fa33t08d2
118 Corrected problems in nmgetaa.c and mysql_lib.c with parallel
119 programs, and one serious problem with alternate DNA scoring matrices
120 (initfa.c, initsw.c) not being set properly. A subtle problem with
121 the merge of scaleswn.c and scaleswg.c is fixed.
125 Modified mysql_lib.c to use "#", rather than "%ld", to indicate the
126 position of the GID. This change was made because sprintf() cannot be
127 used reliably to generate an SQL string, as '"' and '%' are used in
132 (no version change, date change)
134 Minro fixes to initfa.c, initsw.c to deal with DNA scoring matrices
135 properly. "-n -s dna.mat" is required for the sequence/matrix to be
141 Merge of the main CVS trunk - fa33t06 with the latest release branch,
144 In addition, PCOMPLIB mods have been made to mysql_lib.c. Because
145 p2_complib.c gets sequence description information during the first
146 read of the database, the mysql_query must be changed to return:
147 result[0]=GID, result[1]=description, result[2]=sequence. In the
148 PCOMPLIB case, the other SQL queries (for GID description, sequence)
149 are not necessary but must still be provided.
153 (no version change, previous version not released)
155 changes to p2_complib.c to correct openlib() incompatibility.
157 changes to nmgetaa.c, ncbl2_lib.c to incorporate PCOMPLIB. nxgetaa.c
162 (no version change, previous version not released)
164 Change to initfa.c to move ktup check from query_parm() to last_init().
170 Fixes to complib.c, comp_thr.c to deal properly with long query
171 protein sequences when a short library chunk (e.g. -N 5000) was given.
172 In the case where the chunk size is too short, it will be reset to a
173 length which allows the search to proceed, by including an amount of
174 new sequence that is equal to the amount of overlap sequence.
176 scaleswn.c and scaleswg.c have been merged.
178 v33t08 includes the initial implementation for mySQL described below
185 Initial implementation of a syntax for mySQL database queries. A new
186 file, mysql_lib.c has been added, and changes have been made to
187 nmgetaa.c (which should now replace nxgetaa.c) and altlib.h. A mySQL
188 database search needs a file with 4 parts:
190 (1) description of the database, user, password
191 (2) a select statement that generates the set of protein sequences
193 (3) a select statement that generates a UID, description given a UID
194 (4) a select statement that generats a single UID, sequence given a UID
196 Each of the four parts should be separated by ';'. For example, in
197 the database that we are using for testing, a file "demo.sql" that
201 localhost taxonomy username secret;
202 SELECT proteins.gid, proteins.sequence FROM proteins,swissprot WHERE proteins.gid=swissprot.gid AND swissprot.spid IS NOT NULL;
203 select proteins.gid, concat(swissprot.spid," ",proteins.description) from proteins,swissprot where proteins.gid=%ld AND swissprot.gid=proteins.gid;
204 select gid, sequence from proteins where gid=%ld;
207 will find all the proteins in the BLAST "nr" database that also have
208 SwissProt ID's when given the command line:
210 fasta33 -q query.aa "demo.sql 16"
212 At least for simple queries, there is surprisingly little overhead for the
213 search. For more complex queries involving several tables, the overhead
216 At the moment, libraries that need the functions in mysql_lib.c will
217 use library type 16. We may also use file type 17 for SQL queries
218 that return binary sequences.
220 This implementation of mysql_lib.c was written to require a minimal
221 amount of change to the other programs. Only nmgetaa.c and altlib.h
222 needed to be changed to incorporate this new capability. One result
223 of this limitation is that one cannot mix mySQL databases queries with
224 other databases in the same search. Eventually, I would like to make
225 a mySQL database like any other, so that several mysql database
226 queries could be searched in the same run, and mysql databases could
227 be mixed with other (flat file) databases, but this will require some
228 changes in the function calls throughout the code. (Right now, the
229 various programs do not distinguish between an openlib() that is made
230 before searching a large database, and one before retrieving a single
231 sequence. This must be changed for a database query like mySQL to
232 behave like other databases.
234 Several mySQL demo files have been provided: mysql_demo*.sql.
236 (10 January 2001) The mySQL code has been tested on Intel Linux and
237 Compaq/Alpha/Tru64 Unix.
241 Changes to apam.c that to tie different default gap penalties to
242 alternate scoring matrices. In addition, changes to apam.c, to deal
243 with user-specified matrices with or without '*'.
245 >>Nov. 5, 2000 (date updated)
247 pst.dnaseq can now have 3 values, -1, or 0-> protein, 1->DNA, and 2->other.
248 This becomes important for thing like init_karlin_a, which needs a
249 background frequency of residues.
253 Significant bug fixes for the -z 6/-z 16 option. An ininitialized
254 variable was fixed in karlin.c, and comp_thr.c did not pass the
255 correct composition argument type in find_zp(). The -z 6/16 option
256 has now been tested and works correctly on Alphas, Linux x86, SGI, Sun
257 and Mac OSX. Another problem was fixed in scaleswn.c (simplex()) that
258 prevented the code from being reused by the pv4/mp4 complib programs.
262 Several changes made to accomodate Mac OSX. Longer lists of superfamily
263 numbers now supported in p[su]4comp/m[su]4comp programs.
267 All global variables have been removed from scaleswn.c. The last to
268 go, db_struct db, required many edits, because until now, the fasta
269 programs have kept two versions of the db_struct data (entries,
270 length). One version was kept by the main program, which updated entry
271 number and db length as sequences were read; a second copy of this
272 information was kept by the statistical estimation routines. Now
273 there is only one copy, which means that the E() values will be a
274 function of the complete database, not the database with some high
275 scoring sequences removed.
279 Continued removal of global variables from scaleswn.c. Only one
280 global is left, db_struct db, which contains the number of entries in
281 the database and the number of residues. It will be the next to go
282 (changing all the zs_to_*() functions) and scaleswn. will be free
283 of globals. scaleswg.c is gone - scaleswn.c compiles to scaleswg.c
288 Removal of histogram globals required changes in p2_complib.c as well.
289 p_complib.c has not been updated. scaleswg.c has been modified to
290 reflect the new histogram strategy.
294 Substantial changes to remove globals for printing histogram. m_msg
295 now contains a hist_str, which keeps histogram information.
298 (no version change, previous version not released)
300 Correct bug introduced into scaleswn.c (inithist()) by changing
301 score2_sums[], score_sums[] from int to double.
303 Reporting of version numbers is more consistent between fasta33,
304 fasta33_t, and pv4compfa/mp4compfa. The programs now report the same
305 numbers/dates in similar places.
310 Changes to fix problems with statistical estimates when a large
311 fraction (but not all) of the database is related. Several users
312 reported problems when searching with rRNA genes with version 33t06.
313 In some cases, a 100% identitical match over 1500 nt would not be
314 statistically significant against a search of the bacterial division
315 of Genbank. This problem was not seen with some releases of v33t05.
317 The cause of the problem was a change between v33t05 and v33t06 to
318 allow scoring matrices with unusual scaling to be used. In v33t05,
319 there was a line that excluded all scores > 300 from the statistical
320 estimation procedure. While 300 is a high score with any "normal"
321 scoring matrix, some investigators were using matrices scaled 10X, so
322 that a score of 300 was really a score of 30 with a conventional
323 matrix, and should not be excluded. Unfortunately, removing the test
324 to exclude scores > 300 meant that when a rRNA sequence was used to
325 search the bacterial division, tens of thousands of high scoring
326 related sequences were treated as if they were unrelated, with the
327 result that the variance estimates were much too high, and thus high
328 real scores had low z-scores, and thus were not statistically
329 significant. (There appear to be more than 20,000 rRNA sequences in
330 the bacterial division of Genbank, almost 25% of all sequences).
332 The solution to the problem is a substantial enhancement in the
333 strategies used to exclude high-scoring, related sequences, the -z 1,
334 4, and 5 parameter estimation strategies. The programs now estimate
335 the expected high scoring sequence by calculating an ungapped Lambda
336 and K, and then use a relatively conservative threshold for excluding
337 scores that are higher than would be expected 0.01 times by chance.
338 By calculating Lambda and K, we can scale the cutoff thresholds to
339 allow scoring matrices with unusual scales. For "normal" searches,
340 there should be little change, but there should be an improvement for
341 searches with large numbers of related sequences in the database.
343 As a result of testing for this change, a bug in the karlin() function
344 used with -z 6 was found and corrected.
349 Changes to manshowbest.c to include correct display coordinates.
351 Significant changes to structs.h, param.h, p2_complib.c,
352 p2_workcomp.c, to store and use a reliable a_struct for alignment
355 Other cosmetic changes.
359 Minor changes to complib.c, showrss.c, so that prss33 -q uses 200
360 shuffles and prss33 provides bit scores, rather than z-scores.
361 (no version number change).
363 Modifications to p2_complib.c to include superfamily numbers for
368 Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX.
369 00README.1st updated to reflect the current version and correct
370 outdated information on threads.
374 Modifications to initpam2() in initsw.c to correct a problem with pam_x
375 when the -S option is used.
377 Modifications to compacc.c, scaleswn.c to ensure that residue numbers
378 are calculated properly when more than 2 Gb of sequence is searched.
382 Modifications to dropnfa.c so that DNA matches to 'N' will be included
383 in the "ungapped %identity". Thus, a sequence that is 100% identical
384 for 100 nt on either side of a 100 nt region that has been masked to
385 'NNNNN' will be reported as: "67% identical (100% ungapped)". This
386 has been added to deal with masked BAC-end databases. It would be
387 better if masking changed the letters to lowercase, but the mouse
388 BAC-end sequences at TIGR use 'NNNNN'. This is currently available
389 only for the fasta function, not [t]fast[x/y], etc, and only for DNA
392 mk_n_pam() in apam.c modified to ensure that mismatch scores of -1
397 Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accession
398 number as part of the descriptive string.
402 (no version change - not yet released)
404 Modifications to calcons(), calc_id(), showbest(), p_workcomp.c to
405 provide ngap_q (number of alignment gaps in query) , ngap_l (number
406 of gaps in library) information for -m 9 output.
410 (no version change - not yet released)
412 Modified scaleswn.c to provide better support for unconventional
413 scoring scoring matrices, in particular, scoring matrices where every
414 value is 50-times higher. Previous versions of the MLE estimator (-z
415 2) started with lambda = 0.2, which is too high for a scoring matrix
416 going from -500:+1500. The initial estimate for lambda is now
417 calculated using the formula: lambda = pi/sqrt(6*variance). For the
418 default -z 1, a restriction to limit scores to a maximum of 300 for
419 the statistical analysis was removed.
423 Modified aligment output, and -m 9 and -m10, to report an "ungapped"
424 identity as well as the traditional "gapped" identity. The
425 traditional "gapped" identity reports the number of identities divided
426 by the overall length of the alignment, including gaps. The
427 "ungapped" identity does not include gaps in the length of the
428 alignment. This new value is included for alignments that include
429 introns; thus, a tfastx33 search might find the 100% identical genomic
430 sequence but report the gapped percent identity if a short intron were
431 included in the alignment (the alignment probably would not span a
432 long exon) as 66%. The "ungapped" identity would remain 100%. The
433 ungapped identity value is also shown in the "-m 9" output line after
434 the "gapped" fraction identical.
438 Modified -m 9 output to provide fraction identical, alignment boundary
439 information with the initial list of high scoring sequences, just as
440 the pv3comp and mp_comp versions do. The -m 9 option now shows the
441 same alignment display as -m 0, but the width of the alignment is
442 increased by 40. Thus, by default, -m 9 will show the list of best
443 hits, with percent identity, Smith-Waterman score, and alignment
444 boundaries initially, and then show alignments standard (-m 0)
445 alignments with 100 residues/line.
449 Correct some problems with reading data files with <CR>'s under unix.
451 nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert <TAB>
452 ('\t') to <SPC> (' ') in descriptive lines.
458 Corrected problem with very low mean_var in fit_llen() in scaleswn.c.
461 (no version number change - previous version not released)
463 Merged fasta33t05d2 with fasta33t06. Also removed restriction on
464 "-M size-range" to proteins - the size range now can be applied to DNA
468 (changes to v33t05d merged into v33t06)
470 Introduced changes to include '*' as a valid sequence character, which
471 indicates termination. Thus, 'TGA', 'TAG', and 'TAA' are now
472 tranlated to '*' rather than 'X', and the protein PAM matrices have
473 been modified to provide a match score of approximately 1/2 the max
474 identity score for a '*:*' match. Otherise, '*' is the same as 'X'.
475 This change only affects query sequences that include a '*' to
476 indicate an end of sequence, the '*' is not there by default.
478 The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33,
479 and tfasty33, which were fixed today.
481 >>March 28, 2000/April 24, 2000
484 (a) -z 6 statistics that factor in composition
485 (b) -smatrix-offset pam-offset parameter
487 (a) This release provides a new statistics option, -z 6, which
488 provides a more sophisticated model that accounts for sequence
489 composition. When -z 6 is used (only for fasta33(_t) and
490 ssearch33(_t)), the program calculates a composition parameter
491 comp=1/lambda using a modified version of the Karlin-Altschul karlin()
492 function. As a result, every sequence in the database has an
493 associated length (n1) and composition (comp).
495 The length n1 and composition comp are used in the maximum likelihood
496 estimation described by Mott (1992) Bull. Math. Biol. 54:59-75. Four
497 parameters are estimated, a0, a1, a2, and b1, and the probability of
498 obtaining a score is then:
500 p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp)))
502 The maximum likelihood estimates of a0, a1, a2, and b1 are calculated
503 using the Nelder-Mead simplex search strategy.
505 The average Lambda is reported for the search using Lambda =
506 1/(b1*ave_comp). Where ave_comp is the geometric mean of the comp values
507 calculated during the statistical estimates.
509 The "lambda/comp" calculation can fail for sequences with very biased
510 amino acid composition. When this occurs, 'comp' is set to -1.0 (as
511 is 'H', the information content parameter) and the 'ave_comp' value is
512 used to calculate statistical significance. (But obviously 'ave_comp'
513 is not really appropriate, since if the sequence had an average 'comp'
514 value, it would have been calculated.) When -z 6 is used, the
515 alignment display shows the 'comp' and 'H' values for that library
518 (b) Scoring matrix offsets - The main reason that the "lamdba/comp"
519 calculation fails is that, for the particular query/library sequence
520 pair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0.
521 This problem is reported to 'stderr' when it occurs. The simplest
522 solution to the problem is to provide an offset to the scoring matrix;
523 for example, to use Blosum62 - 1, which ranges from +10 to -5, rather
524 than the standard +11 to -4. This option used to be available with
525 the -S offset option, but -S is now used to specify a lower-case
526 seg-ed database. The offset can now be specified as part of the
527 scoring matrix name. Thus, "-s BL62-1" uses Blosum62 reduced by 1 at
528 each entry. The '-' character is used to indicate an offset, so
529 scoring matrix files must not have a '-' in their name.
530 Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value.
532 nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid string
533 run-off problems after strncpy().
535 Fixed problem where positive gap extension penalties in ssearch33
536 were not converted to negative values.
540 Fixed problem in calculating corrected sequence lengths for
541 Altschul-Gish probabilities.
544 (no version change, date updated to March 30, 2000)
546 Corrected problem with -m 9 option.
548 The '*' character is now available to allow translated alignments to
549 extend through the termination codon. Thus, if a protein sequence ends
550 with a '*', and matches in to a translated termination codon, the
551 score will be increased. The *:* match score is set to 1/2 the max
552 positive score for the matrix (see upam.h). This strategy can also be
553 used to upweight a match that extends all the way to the end of a
554 full-length sequence by putting '*' at the end of both the query and
555 library protein sequences. Recognition of '*' will probably become a
559 (no version change, previous version not distributed)
561 Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate large
562 sequence files. Long (64-bit on some systems) variables are now used
563 to specify file and memory position for the memory mapped functions.
564 As a result, there are now two *.xin (memory mapped index) file
565 formats: MP0, which uses 32-bit longs, and MP1, which uses 64-bit
566 longs. On 64-bit machines, MP0 32-bit indices are read properly, but
567 limit the database size to 2 or 4 Gb; MP1 64-bit indices allow very
568 large databases. Blast2.0 formatdb databases are still limited to
569 4Gb. To compile map_db.c to generate 64-bit index files, include the
570 compile time option -DBIG_LIB64 in the Makefile. (Currently this
571 option has been tested only on the DEC Alpha and SGI platforms, and
572 will work only with Unix versions that provide 64-bit longs and 64-bit
575 The -R results file now uses sfn_cmp() to report a matching
576 superfamily number, if one exists, and '0' otherwise.
579 (no version change, previous version not distributed)
581 Provide new strategy for specifying library abbreviations. In
584 fasta33 query.aa %anr
586 one can also specify:
588 fasta33 query.aa %pir1+sp+nr
590 fasta33 query.aa +pir1+sp+nr
592 fasta33 query.aa %+pir1+sp+nr
594 where the + anywhere in the library name string indicates that
595 variable length library names, separated by '+', are being used (the
596 last '+' is optional). The FASTLIBS file then becomes:
599 PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lseg
600 NBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.nam
601 NRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5
602 NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lseg
603 NCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg
606 The two abbreviation types, single letter and +word+, cannot be
607 intermixed, and at least initially, +word+ specifiers are
608 case-sensitive (single letter abbreviations are not) and will not be
609 available interactively, only on the command line.
611 Removed 'K' estimate for Expectation_n, Expectation_i fits to the
612 distribution of unrelated similarity scores. 'K' cannot be calculated
613 from the data available. 'Lamdba' can be calculated, it is
614 1.28255/sqrt(mean_var), and is still available.
619 changed Makefile33.common, Makefile.common, to incorporate $(NRAND)
620 rather than "rand48". Provide nrandom.c which uses random(), as
621 replacement for nrand.c, which uses rand48().
626 Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly.
627 Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n),
628 and -z 1/14 (Expectation_i) statistical estimates.
630 Modifications to calc_id() to correct bug in counting identities.
631 Modified showalign() to use calc_id() with -m 9, for simpler
634 Additional modifications to dropfa*.c files to deal properly with 'n's
637 Added new option: -x #, which allows one to override the penalty for a
638 match against 'x' (or 'N') provided by the scoring matrix. This
639 option is particularly useful in fast[x/y] searches, where out of
640 frame low complexity regions can generate high scores.
642 The old function of '-x' - to specify an alternate coordinate system,
643 is now available as '-X # #'.
645 Updated scaleswn.c to provide window shuffle information for -z 12.
647 Updated compacc.c, workacc.c, to fix serious bug in wshuffle()
648 that destroyed aa1[n1]=0.
653 A serious bug in all of the fasta related programs has been
654 corrected. The new code in fasta33 which ignores certain residues
655 failed to initialize one of the arrays properly. As a result, in
656 pathological situations, a very strong match could be missed.
658 Corrected minor bug in initsw.c that cause misplaced "ktup" command
659 line argument, which should be ingnored by ssearch, to be read as -d
662 Improved error message for 0 length query sequence.
665 --> no external version number change
667 Modified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mapping
668 of genbank flatfile (format=1) files. This format could be read much
669 more efficiently, however.
672 --> no external version number change
674 Changed the behavior of the options that set the number of high scores
675 (-b) and alignments (-d) that are displayed. Previously, fasta33 -E
676 10.0 -d 10 would show 50 best scores, rather than all the scores with
677 E() < 10.0. To get the -E threshold to limit, -E 10.0 -b 10000 -d 10
678 was required. This is now fixed. Setting "-d 10" does not affect the
679 number of best scores shown.
681 Minor change in mw.h to remove unused defines.
683 fasta3x.me (fasta3x.doc) updated.
688 Corrected bug in memory mapped reads of gcg_binary format files
689 that potentially caused the last 63 residues to be read improperly.
691 Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c to
692 ensure that each thread has its own work_info structure. This solves
693 some minor race conditions that sometimes caused some parameters
694 not to be reported properly.
696 Changes to most of the drop*.c files to correct some minor problems
697 with sequence alphabets. Code in mmgetaa.c (memory mapped code for
698 FASTA, GCG compressed files) reordered to prevent files from being
699 memory mapped if appropriate index files are not available.
701 See readme.pvm_3.3 for updates to the pvm programs.
704 (no version change - modifications largely affect ps3comp*)
706 Modifications to showsum.c to deal with 2 scores/sequence. Modifications
707 to mmgetaa.c for superfamily numbers.
710 (no version change, previous version not released)
712 Corrected problem in mmgetaa.c that caused searches on a memory mapped
713 single long sequence (e.g. Chr22) to fail. Corrected bug in map_db.c
714 that caused it to crash on some architectures if a filename was not
715 specified. Corrected off-by-three error in fasty/tfasty. Corrected
716 indexing error in dropfz2.c.
721 corrected some bugs in inifa.c/initsw.c/doinit.c that caused
722 abbreviated function names to be lost.
724 modify showbest.c, showalign.c to include information on position in
725 library sequence (bbp->cont) to distinguish subsegment of very long
726 sequences. Currently, the new label is available only with -m 6.
729 [t]fastz33 uses v33t02 of fasty function.
731 Replace dropfz.c with dropfz2.c. Dropfz2.c interprets any codons,
732 that include the nucleotide 'N' as the amino 'X'. Previously, 'N' was
733 treated as 'A', so 'NNN' ended up 'K'. This modification, together
734 with the -S option and lower-case pseg'ed databases, should ensure
735 that DNA queries with large numbers of 'N's do not match low
739 (no version change, previous version not released)
741 Modify initfa.c to disply initn, init1 scores for [t]fast[fs].
742 Include "-B" option to show previous z-scores.
745 (no version change, previous version not released)
747 Modify dropfx.c to use saatran(), rather than aatran(). saatran
748 translates any 'N' containing codon as 'X'. aatran() treats 'N' as
749 an 'A'. Although more steps are required for translation, the program
750 appears to run just as fast.
755 Substantial changes to the output format in showbest.c (the list of
756 high scoring sequences) and showalign.c (the alignments). The classic
759 The best scores are: initn init1 opt z-sc E(82014)
760 gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO ( 218) 1497 1497 1497 1761.1 2.3e-91
761 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE ( 218) 1413 1413 1413 1662.9 6.7e-86
763 has been replaced by:
765 The best scores are: opt bits E(82138)
766 gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN ( 218) 1497 354 7.6e-98
767 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF ( 218) 1413 335 5.3e-92
769 This display provides more information and removes the outdated initn
770 and init1 scores, which are no longer used. The "bit" score is
771 comparable to the blast2 bit score. It is calculated as: (lambda*S -
772 ln K)/ln 2, where S is the raw similarity score, lambda and K are
773 statistical parameters estimated from the distribution of unrelated
774 sequence similarity scores. All of the similarity scores, including
775 init1, initn, and z-scores are reported with the alignment data.
776 Z-scores are displayed instead of bit scores in the list of high
777 scores if the command line option "-B" is specified.
779 In addition, the alignment score line has changed from:
781 >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa)
782 initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56
783 Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218)
787 >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa)
788 initn: 954 init1: 954 opt: 958 Z-score: 1130.9 bits: 216.4 E(): 2.8e-56
789 Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218)
791 In addition to the addition of the "bits:" score, the "expect()" label
792 has changed to "E()" to save some space.
794 >>November 4,12, 1999
797 Fixed serious bug in -z 2 lambda/K calculation in scaleswn.c
799 Fixed bugs in llgetaa.c (openlib()) and definition of superfamily
805 Begin using CVS for version control. Correct faulty error message in
806 dropfs.c. Corrected bad "goto loopl;" in dropfz.c. Corrected prss3.rsp
807 for Makefile.tc (Win32 version).
812 Corrected some serious bugs with the various fasta/x/y programs when
813 the -DALLOCN0 was used to save memory. Improvements to fasta3x.me/.doc
819 For this initial release of version 33 of the FASTA programs, the
820 Makefile's have been modified to make "fasta33(_t)", "fastx33(_t)",
821 etc, so that you can test fasta33 while retaining fasta3 (from release
822 v32t08). The FASTA33 programs are somewhat slower than previous
823 releases, but I believe the ability to handle low complexity regions
824 without 'X'ing them out outweighs the slowdown. By (temporarily)
825 changing the names of the programs slightly, it will be easier for you
826 to judge the relative cost and benefit. To "make" the programs as
827 "fasta3(_t)", etc, simply replace "Makefile33.common" with
828 "Makefile.common" in the "Makefile" that you use.
832 ssearch3/fasta3/fastx3/fasty3 have been modified to search databases
833 containing both upper and lower case letters, where lower case letters
834 indicate low-complexity regions. With the modified programs, lower
835 case letters are treated as 'X's' in the initial scan, but are then
836 treated normally in the final alignment. In addition, alignments can
837 contain lower case letters. Lower case letters are treated as
838 low-complexity regions during the seach phase of the program, but as
839 "conventional" residues during the alignment phase, with the "-S"
840 option. Currently, lower case letters are mapped to 'X's during the
841 scan of the entire library. In the future, alternate weights will be
842 available. This is a substantial improvement for very large scale
843 comparison, where one seeks both accurate statistical estimates and
844 accurate %identities and alignments, and for translated DNA:protein
845 comparisons, like "fastx3" and "fasty3", where out-of-frame
846 translations tend to match low complexity regions (see Pearson et
847 al. (1997) Genomics 46:24-36).
849 Protein databases (and query sequences) can be generated in the
850 appropriate format using John Wooton's "pseg" program, available from
851 ftp://ncbi.nlm.nih.gov/pub/seg/pseg. Once you have compiled the "pseg"
852 program, use the command:
854 pseg database.fasta -z 1 -q > database.lc_seg
856 Once you have database.lc_seg, run the command "map_db" to generate
857 a ".xin" file that can be used to efficiently memory map the database.
859 You can then search database.lc_seg with or without the "-S" option.
860 Without "-S", the database is treated as any other FASTA format file -
861 all the residues are present. With "-S", lower case residues will be
862 treated as 'x's' during the initial scan but as normal residues when
863 final alignments are displayed.
865 When the -S option is used, the matrix information line is changed
866 from: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS". The "-S"
867 option is no longer available to provide a scoring matrix offset.
869 Unfortunately, Blast2.0 format files cannot contain lower case
870 letters. We have addressed this problem by providing efficient memory
871 mapped access to Fasta and GCG/PIR, and GCG/compressed-binary files in
872 the last release of fasta32t08. The memory mapped file I/O
873 improvements are provided in fasta33 as well.
875 ================ readme.v32 ================
877 FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
878 search both the forward and reverse strands by default.
880 The documentation in fasta3x.me/fasta3x.doc has been substantially
886 Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
889 --> v32t08 (no version number change)
891 Added "-M low-high" option, where low and high are inclusion limits
892 for library sequences. If a library sequence is shorter than "low" or
893 longer than "high", it will not be considered in the search. Thus,
894 "-M 200-250" limits the database search to proteins between 200 and
895 250 residues in length. This should be particularly useful for fasts3
896 and fastf3. -M -500 searches library sequences < 500; -M 200 -
897 searches sequences > 200. This limit applies only to protein
900 Modified scaleswn.c to fall back to maximum likelihood estimates of
901 lambda, K rather than mean/variance estimates. (This allows MLE
902 estimation to be used instead of proc_hist_n when a limited range of
910 (1) memory mapped (mmap()ed) database reading - other database reading fixes
911 (2) BLAST2 databases supported
912 (3) true maximum likelihood estimates for Lambda, K
913 (4) Misc. minor fixes
915 (1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
916 It is now possible to use mmap()ed access to FASTA format databases,
917 if the "map_db" program has been used to produce an ".xin" file. If
918 USE_MMAP is defined at compile time and a ".xin" file is present, the
919 ".xin" will be used to access sequences directly after the file is
920 mmap()ed. On my 4-processor Alpha, this can reduce elapsed time by
921 50%. It is not quite as efficient as BLAST2 format, but it is close.
923 Currently, memory mapping is supported for type 0 (FASTA), 5
924 (PIR/GCG ascii), and 6 (GCG binary). Memory mapping is used if a
925 ".xin" file is present. ".xin" files are created by the new program
926 "map_db". The syntax for "map_db" is:
928 map_db [-n] "/dir/database.fa"
930 which creates the file /dir/database.fa.xin. Library types can be
931 included in the filename; thus:
933 map_db -n "/gcggenbank/gb_om.seq 6"
935 would be used for a type 6 GCG binary file.
937 The ".xin" file must be updated each time the database file changes.
938 map_db writes the size of the database file into the ".xin" file, so
939 that if the database file changes, making the ".xin" offset
940 information invalid, the ".xin" file is not used. "list_db" is
941 provided to print out the offset information in the ".xin" file.
943 (Oct 2, 1999) The memory mapping routines have been changed to
944 allow several files to be memory mapped simultaneously. Indeed, once a
945 database has been memory mapped, it will not be unmap()ed until the
946 program finishes. This fixes a problem under Digital Unix, and should
947 make re-access to mmap()ed files (as when displaying high scores and
948 alignments) much more efficient. If no more memory is available for
949 mmap()ing, the file will be read using conventional fread/fgets.
951 (Oct 2, 1999) The names of the database reading functions has been
952 changed to allow both Blast1.4 and Blast2.0 databases to be read. In
953 addition, Makefile.common now includes an option to link both
954 ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
955 However, Blast1.4 support has not been tested.
957 The Makefile structure has been improved. Each architecture specific
958 Makefile (Makefile.alpha, Makefile.linux, etc) now includes
959 Makefile.common. Thus, changes to the program structure should be
960 correct for all platforms. "map_db" and "list_db" are not made with
963 The database reading functions in nxgetaa.c can now return a database
964 length of 0, which indicates that no residues were read. Previously,
965 0-length sequences returned a length of 1, which were ignored.
966 Complib.c and comp_thr.c have changed to accommodate this
967 modification. This change was made to ensure that each residue,
968 including the last, of each sequence is read.
970 Corrected bug in nxgetaa.c with FASTA format files with very long
971 (>512 char) definition lines.
973 (2) (September 20, 1999) BLAST2 format databases supported
975 This release supports NCBI Blast2.0 format databases, using either
976 conventional file reading or memory mapped files. The Blast2.0 format
977 can be read very efficiently, so there is only a modest improvement in
978 performance with memory mapping. The decision to use mmap()'ed files
979 is made at compile time, by defining USE_MMAP. My thanks to Eamonn
980 O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
981 providing mmap()'ed modifications to fasta3. On my machines, Blast2.0
982 format reduces search time by about 30%. At the moment, ambiguous DNA
983 sequences are not decoded properly.
985 (3) (September 30, 1999) A new statistical estimation option is
986 available. -z 2 has been changed from ln()-scaling, which never
987 should have been used, to scaling using Maximum Likelihood Estimates
988 (MLEs) of Lambda and K. The MLE estimation routines were written by
989 Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
990 written by Sean Eddy. The MLE estimation examines the middle 95% of
991 scores, if there are fewer than 10000 sequences in the database;
992 otherwise it excludes (censors) the top 250 scores and the bottom 250
993 scores. This approach seems to effectively prevent related sequences
994 from contaminating the estimation process. As with -z 1, -z 12 causes
995 the program to generate a shuffled sequence score for each of the
996 library sequences; in this case, no censoring is done. If the
997 estimation process is reliable, Lambda and K should not vary much with
998 different queries or query lengths. Lambda appears not to vary much
999 with the comparison algorithm, although K does.
1001 (4) Minor changes include fixes to some of the alignment display routines,
1002 individual copies of the pstruct structure for each thread, and some
1003 changes to ensure that every last residue in a library is available
1004 for matching (sometime the last residue could be ignored). This
1005 version has undergone extensive testing with high-throughput sequences
1006 to confirm that long sequences are read properly. Problems with
1007 fastf3/fasts3 alignment display have also been addressed.
1009 >>August 26, 1999 (no version change - not released)
1011 Corrected problem in "apam.c" that prevented scoring matrices from
1012 being imported for [t]fasts3/[t]fastf3.
1017 Corrected problem with opt_cut initialization that only appeared
1018 with pvcomp* programs.
1020 Improved calculation of FASTA optcut threshold for DNA sequence
1021 comparison for match scores much less than +5 (e.g. +3). The previous
1022 optcut theshold was too high when the match penalty was < 4 and
1023 ktup=6; it is now scaled more appropriately.
1025 Optcut thresholds have also been raised slightly for
1026 fastx/y3/tfastx/y3. This should improve performance with minimal
1027 effects on sensitivity.
1030 (no version change - date change)
1032 Corrected various uninitialized variables and buffer overruns
1035 >>July 26, 1999 - new distribution
1036 (no version change - v32t06, previous version not released)
1038 Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
1041 Statistical calculations for tfasta/x/y in unthreaded version
1042 corrected. Statistical estimates for threaded and unthreaded versions
1043 of the tfasta/x/y/s/f programs should be much more consistent.
1045 Substantial modifications in alignment coordinate calculation/
1046 presentation. Minor error in fastx/y/tfastx/y end of alignment
1047 corrected. Major problems with tfasta alignment coordinates
1048 corrected. tfasta and tfastx/y coordinates should now be consistent.
1050 Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
1051 with long query sequences.
1053 Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
1054 to try to avoid "cannot allocate diagonal arrays" error message.
1055 Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
1056 so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
1057 I am still getting this message, so it has not been completely
1058 successful. Makefile.linux now uses -DALLOCN0 to avoid this problem,
1059 at some cost in speed.
1061 The pvcomp* programs have been updated to work properly with
1062 forward/reverse DNA searches. See readme.pvm_3.2.
1064 >>July 7, 1999 - not released
1067 Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
1068 dumps with "-o" option.
1070 Corrected a subtle bug in fastx/y/tfastx/y alignment display.
1072 >>June 30, 1999 - new distribution
1075 Corrected doinit.c to allow DNA substitution matrices with -s matrix
1078 Changed ".gbl" files to ".h" files.
1080 >>June 2 - 9, 1999 - new distribution
1083 Added additional DNA lambda/K/H to alt_param.h. Corrected some
1084 other problems with those table. for the case where (inf,inf)
1085 gap penalties were not included.
1087 Fixed complib.c/comp_thr.c error message to properly report filename
1088 when library file is not found.
1090 Included approximate Lambda/K/H for BL80 in alt_parms.h.
1091 BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
1093 Included some additional perl files for searchfa.cgi, searchnn.cgi
1094 in the distribution (my-cgi.pl, cgi-lib.pl).
1096 >>May 30, 1999, June 2, 1999 - new distribution
1097 (no version number change)
1099 Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h. Changed
1100 zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
1101 when only one sequence is compared and -z 3 is used.
1104 (no version number change)
1106 Corrected bug in alignment numbering on the % identity line
1107 27.4% identity in 234 aa (101-234:110-243)
1108 for reverse complements with offset coordinates (test.aa:101-250)
1111 (no version number change)
1113 Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
1116 (no version number change)
1118 Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
1119 Changes to showsum.c to change off-end reporting. (Neither of these
1120 changes is likely to affect anyone outside my research group.)
1125 Fixed a serious bug in the fastx3/tfastx3 alignment display which
1126 caused t/fastx3 to produce incorrect alignments (and incorrectly low
1127 percent identities). The scores were correct, but the alignment
1128 percent identities were too low and the alignments were wrong.
1130 Numbering errors were also corrected in fastx3/tfastx3 and
1131 fasty3/tfasty3 and when partial query sequences were used.
1135 Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
1136 incorrect Smith-Waterman scores after do_walign() had been called.
1137 This affected only pvcompsw searches with the "-m 9" option.
1141 Modified showalign.c to provide improved alignment information that
1142 includes explicitly the boundaries of the alignment. Default
1145 Smith-Waterman score: 175; 24.645% identity in 211 aa overlap (5:207-7:207)
1149 Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
1150 "not" superfamily annotation for the query sequence only. The
1151 goal is to be able to specify that certain superfamily numbers be
1152 ignored in some of the search summaries. Thus, a description line
1155 >GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
1157 says that GT8.7 belongs to superfamily 40001, but any library
1158 sequences with superfamily number 90043 should be ignored in any
1159 listing or summary of best scores.
1161 In addition, it is now possible to make a fasta3r/prcompfa, which is
1162 the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
1163 unrelated sequences in a search using the superfamily annotation.
1164 fasta3r shows only the scores of related sequences. This might be
1165 used in combination with the -F e_val option to show the scores
1166 obtained by the most distantly related members of a family.
1170 -->v32t04 (not distributed)
1172 Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
1173 (necessary for a more rational Makefile structure). No code changes.
1177 Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
1178 (no version number change).
1184 A serious bug in DNA alignments when the sequence has been broken into
1185 multiple segments that was introduced in version fasta32 has been
1186 fixed. In addition, several minor problems with -z 3 statistics on
1187 DNA sequences were fixed.
1189 Added -m 9 option, which unfortunately does different things in
1190 pvcompfa/sw and fasta3/ssearch3. In both programs, -m 9 provides the
1191 id's of the two sequences, length, E(), %_ident, and start and end of
1192 the alignment in both sequences. pvcompfa/sw provides this
1193 information with the list of high scoring sequences. fasta3/ssearch3
1194 provides the information in lieu of an alignment.
1200 Added information on the algorithm/parameter description line to
1201 report the range of the pam matrices. Useful for matrices like
1202 MD_10, _20, and _40 which require much higher gap penalties.
1204 >>March 13, 1999 (not distributed)
1208 -r results.file has been changed to -R results.file to accomodate
1209 DNA match/mismatch penalties of the form: -r "+1/-3".
1213 Modify functions in scalesw*.c to prevent underflow after exp() on
1214 Alpha Linux machines. The Alpha/LINUX gcc compiler is buggy and
1215 doesn't behave properly with "denormalized" numbers, so "gcc -g -m
1216 ieee" is recommended.
1218 Add "Display alignments also (y/n)[n] "
1220 pvcomplib.c again provides alignments!! In addition, there is a
1221 new "-m 9" option, which reports alignments as:
1223 >>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
1224 HS5 30 HS5 30 1.873e-11 1.000 30 1 30 1 30
1225 HS5 30 HS2249 40 1.061e-07 0.774 31 1 30 7 37
1226 HS5 30 HS2221 38 1.207e-07 0.833 30 1 30 7 35
1227 HS5 30 HS2283 40 1.455e-07 0.774 31 1 30 7 37
1228 HS5 30 HS2239 38 1.939e-07 0.800 30 1 30 7 35
1230 where the columns are:
1232 query-name q-len lib-name lib-len E() %id align-len q-start q-end l-start l-end
1236 Corrected bug in showalign.c that offset reverse complement alignments
1241 Changed the formatting slightly in showbest.c to have columns line up better.
1245 Corrected some bugs introduced into fastf3(_t) in the previous version.
1249 Corrected various problems in dropfz.c affecting alignment scores
1252 Introduced a new program, fasts3(_t), for searching with peptide
1259 Added code to correct problems with coordinate number in long library
1260 sequences with tfastx/tfasty. With this release, sequences should be
1261 numbered properly, and sequence numbers count down with reverse
1262 complement library sequences.
1264 In addition, with this release, fastx/y and tfastx/y translated
1265 protein alignments are numbered as nucleotides (increasing by 3,
1266 labels every 30 nucleotides) rather than codons.