$Name: fa_34_26_5 $ - $Id: readme.v33t0,v 1.45 2001/07/10 18:03:42 wrp Exp $ ================ readme.v33t0 ================ This release includes an MPI implementation of the parallel library-vs-library comparison code. See readme.mpi_3.3 and readme.pvm_3.3 for more information. ===== >>July 9, 2001 Considerable changes to support no-global library functions. (1) Separate ascii/sequence mapping arrays are used by the query-reading (qascii), library-reading (lascii), and sequence comparison function (pascii) routines. As a result, there is no longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both functions. (2) This also allows us to remove all #ifdef TFAST/FASTX conditionals from complib.c/comp_thr.c/p2_complib.c. We no longer need tcomp_thr.o, comp_thrx.o, etc. We still have a variety of p2_complib.o variations to support the different c34.work* files. (3) Because non-global openlib/getlib functions are available, exactly the same open/get functions are available for reading both the query and reference libraries in pv34comp* programs. The host-specific openlib/getlib functions in hxgetaa.c are now provided by nmgetlib.c, etc. This has two effect: (a) it is now possible to compare a query database generated by an SQL query to a library database generated by a different SQL query. (b) pv34comp* has lost (at least in this version) the ability to automatically detect the query sequence type. To search with a DNA query, you MUST use "-n". (4) the resetp() function is now responsible for almost all of the function sepcific (TFAST/FASTX/etc) initializations. All of the function specific code has been removed from complib.c/comp_thr.c and most of it has been moved to initfa.c/resetp(). (5) manageacc.c has been merged into compacc.c (mostly prhist()). (6) Although it may reflect a subtle bug in my code, it is not possible to reliably run threaded/memory mapped versions of the fasta34_t code. I have spent considerable time tracking down the problem, and have determined that, in threaded code, something happens during the thread initialization to corrupt the description offset information used when files are memory mapped. This never occurs when the unthreaded versions of the code are used. And it does not occur under MacOSX, Compaq Tru64Unix, Sun Solaris/Sparc, or SGI IRIX. Thus, I cannot recommend using the threaded code versions (_t) under Linux (RH6.2 or 7.1). ===== >>June 1, 2001 Many changes to accomodate a new - no global variable - strategy for reading sequence databases. Every time a file is opened, a struct lmf_str is allocated which can be used for memory mapped files, ncbl2, files, and mysql files. In addition, an open'ed file has a default sequence type: DNA or protein, or one can open a file in a mode that will allow the sequence type to be changed. ===== >>May 18, 2001 CVS: fa33t09d0 A new compile time parameter - -DGAP_OPEN, is available to change the definition of the "-f gap-open" parameter from the penalty for the first residue in a gap to a true gap-open penalty, as is used in BLAST and many other comparison algorithms. This will probably become the default for fasta in version 3.4. Fixes to conflicts between "-S" and "-s matrix". When a scoring matrix file was specified, lower-case alignments were not displayed with -S (although the scores were calculated properly). More extensive testting of mysql_lib.c (mySQL query-libraries) with the pv4comp* and mp4comp* programs. ===== >>April 5, 2001 CVS: fa33t08d4b3 Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence descriptions for PCOMPLIB (pv4/mp3comp*). Also fix p2_complib.c to request DNA library for translated comparisons. Fix for prss33(_t) to read both sequences from stdin. ===== >>March 27, 2001 CVS: fa33t08d4 --> fa33t08d4 Problems in ncbl2_mlib.c found searching NCBI non-redundant nucleotide database "nt" were fixed. Testing revealed a minor memory leak, which was fixed by modifying showbest.c, showalign.c, comp_thr.c, complib.c, and p2_complib.c to remember the last opened database file more effectively. Modifications to allow 64-bit fseek/ftell on machines like Sun, Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO. Machines with 64-bit long's do not need this option. Machines with 32-bit longs that allow files >2 Gb can do so with 64-bit file access functions, including fseeko() and ftello(), which work with off_t file offsets instead of long's. ===== >>March 3, 2001 CVS: fa33t08d2 Corrected problems in nmgetaa.c and mysql_lib.c with parallel programs, and one serious problem with alternate DNA scoring matrices (initfa.c, initsw.c) not being set properly. A subtle problem with the merge of scaleswn.c and scaleswg.c is fixed. >>February 17, 2001 Modified mysql_lib.c to use "#", rather than "%ld", to indicate the position of the GID. This change was made because sprintf() cannot be used reliably to generate an SQL string, as '"' and '%' are used in such strings. ===== >>January 17, 2001 (no version change, date change) Minro fixes to initfa.c, initsw.c to deal with DNA scoring matrices properly. "-n -s dna.mat" is required for the sequence/matrix to be recognized as DNA. >>January 16, 2001 -->v34t00 Merge of the main CVS trunk - fa33t06 with the latest release branch, fa33t08. In addition, PCOMPLIB mods have been made to mysql_lib.c. Because p2_complib.c gets sequence description information during the first read of the database, the mysql_query must be changed to return: result[0]=GID, result[1]=description, result[2]=sequence. In the PCOMPLIB case, the other SQL queries (for GID description, sequence) are not necessary but must still be provided. ===== >>January 16, 2001 (no version change, previous version not released) changes to p2_complib.c to correct openlib() incompatibility. changes to nmgetaa.c, ncbl2_lib.c to incorporate PCOMPLIB. nxgetaa.c removed. ===== >>January 12, 2001 (no version change, previous version not released) Change to initfa.c to move ktup check from query_parm() to last_init(). ===== >>January 10, 2001 --> v33t08 Fixes to complib.c, comp_thr.c to deal properly with long query protein sequences when a short library chunk (e.g. -N 5000) was given. In the case where the chunk size is too short, it will be reset to a length which allows the search to proceed, by including an amount of new sequence that is equal to the amount of overlap sequence. scaleswn.c and scaleswg.c have been merged. v33t08 includes the initial implementation for mySQL described below for v33t07x. ====== >>Dec. 20, 2000 --> v33t07x Initial implementation of a syntax for mySQL database queries. A new file, mysql_lib.c has been added, and changes have been made to nmgetaa.c (which should now replace nxgetaa.c) and altlib.h. A mySQL database search needs a file with 4 parts: (1) description of the database, user, password (2) a select statement that generates the set of protein sequences as: UID, sequence (3) a select statement that generates a UID, description given a UID (4) a select statement that generats a single UID, sequence given a UID Each of the four parts should be separated by ';'. For example, in the database that we are using for testing, a file "demo.sql" that contains: ================ localhost taxonomy username secret; SELECT proteins.gid, proteins.sequence FROM proteins,swissprot WHERE proteins.gid=swissprot.gid AND swissprot.spid IS NOT NULL; select proteins.gid, concat(swissprot.spid," ",proteins.description) from proteins,swissprot where proteins.gid=%ld AND swissprot.gid=proteins.gid; select gid, sequence from proteins where gid=%ld; ================ will find all the proteins in the BLAST "nr" database that also have SwissProt ID's when given the command line: fasta33 -q query.aa "demo.sql 16" At least for simple queries, there is surprisingly little overhead for the search. For more complex queries involving several tables, the overhead can be significant. At the moment, libraries that need the functions in mysql_lib.c will use library type 16. We may also use file type 17 for SQL queries that return binary sequences. This implementation of mysql_lib.c was written to require a minimal amount of change to the other programs. Only nmgetaa.c and altlib.h needed to be changed to incorporate this new capability. One result of this limitation is that one cannot mix mySQL databases queries with other databases in the same search. Eventually, I would like to make a mySQL database like any other, so that several mysql database queries could be searched in the same run, and mysql databases could be mixed with other (flat file) databases, but this will require some changes in the function calls throughout the code. (Right now, the various programs do not distinguish between an openlib() that is made before searching a large database, and one before retrieving a single sequence. This must be changed for a database query like mySQL to behave like other databases. Several mySQL demo files have been provided: mysql_demo*.sql. (10 January 2001) The mySQL code has been tested on Intel Linux and Compaq/Alpha/Tru64 Unix. >>Dec. 9, 2000 Changes to apam.c that to tie different default gap penalties to alternate scoring matrices. In addition, changes to apam.c, to deal with user-specified matrices with or without '*'. >>Nov. 5, 2000 (date updated) pst.dnaseq can now have 3 values, -1, or 0-> protein, 1->DNA, and 2->other. This becomes important for thing like init_karlin_a, which needs a background frequency of residues. >>Nov. 1, 2000 Significant bug fixes for the -z 6/-z 16 option. An ininitialized variable was fixed in karlin.c, and comp_thr.c did not pass the correct composition argument type in find_zp(). The -z 6/16 option has now been tested and works correctly on Alphas, Linux x86, SGI, Sun and Mac OSX. Another problem was fixed in scaleswn.c (simplex()) that prevented the code from being reused by the pv4/mp4 complib programs. >>Oct. 9, 2000 Several changes made to accomodate Mac OSX. Longer lists of superfamily numbers now supported in p[su]4comp/m[su]4comp programs. >>Sept 25, 2000 All global variables have been removed from scaleswn.c. The last to go, db_struct db, required many edits, because until now, the fasta programs have kept two versions of the db_struct data (entries, length). One version was kept by the main program, which updated entry number and db length as sequences were read; a second copy of this information was kept by the statistical estimation routines. Now there is only one copy, which means that the E() values will be a function of the complete database, not the database with some high scoring sequences removed. >>Sept 23, 2000 Continued removal of global variables from scaleswn.c. Only one global is left, db_struct db, which contains the number of entries in the database and the number of residues. It will be the next to go (changing all the zs_to_*() functions) and scaleswn. will be free of globals. scaleswg.c is gone - scaleswn.c compiles to scaleswg.c with -DNORMAL_DIST. >>Sept 20, 2000 Removal of histogram globals required changes in p2_complib.c as well. p_complib.c has not been updated. scaleswg.c has been modified to reflect the new histogram strategy. >>Sept 19, 2000 Substantial changes to remove globals for printing histogram. m_msg now contains a hist_str, which keeps histogram information. >>Sept. 19, 2000 (no version change, previous version not released) Correct bug introduced into scaleswn.c (inithist()) by changing score2_sums[], score_sums[] from int to double. Reporting of version numbers is more consistent between fasta33, fasta33_t, and pv4compfa/mp4compfa. The programs now report the same numbers/dates in similar places. >>Sept. 15, 2000 --> v33t07 Changes to fix problems with statistical estimates when a large fraction (but not all) of the database is related. Several users reported problems when searching with rRNA genes with version 33t06. In some cases, a 100% identitical match over 1500 nt would not be statistically significant against a search of the bacterial division of Genbank. This problem was not seen with some releases of v33t05. The cause of the problem was a change between v33t05 and v33t06 to allow scoring matrices with unusual scaling to be used. In v33t05, there was a line that excluded all scores > 300 from the statistical estimation procedure. While 300 is a high score with any "normal" scoring matrix, some investigators were using matrices scaled 10X, so that a score of 300 was really a score of 30 with a conventional matrix, and should not be excluded. Unfortunately, removing the test to exclude scores > 300 meant that when a rRNA sequence was used to search the bacterial division, tens of thousands of high scoring related sequences were treated as if they were unrelated, with the result that the variance estimates were much too high, and thus high real scores had low z-scores, and thus were not statistically significant. (There appear to be more than 20,000 rRNA sequences in the bacterial division of Genbank, almost 25% of all sequences). The solution to the problem is a substantial enhancement in the strategies used to exclude high-scoring, related sequences, the -z 1, 4, and 5 parameter estimation strategies. The programs now estimate the expected high scoring sequence by calculating an ungapped Lambda and K, and then use a relatively conservative threshold for excluding scores that are higher than would be expected 0.01 times by chance. By calculating Lambda and K, we can scale the cutoff thresholds to allow scoring matrices with unusual scales. For "normal" searches, there should be little change, but there should be an improvement for searches with large numbers of related sequences in the database. As a result of testing for this change, a bug in the karlin() function used with -z 6 was found and corrected. ======= >>Sept. 9, 2000 Changes to manshowbest.c to include correct display coordinates. Significant changes to structs.h, param.h, p2_complib.c, p2_workcomp.c, to store and use a reliable a_struct for alignment coordinates. Other cosmetic changes. >>Sept. 7, 2000 Minor changes to complib.c, showrss.c, so that prss33 -q uses 200 shuffles and prss33 provides bit scores, rather than z-scores. (no version number change). Modifications to p2_complib.c to include superfamily numbers for ps4comp* ms4comp*. >>Aug 22, 2000 Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX. 00README.1st updated to reflect the current version and correct outdated information on threads. >>Aug. 3, 2000 Modifications to initpam2() in initsw.c to correct a problem with pam_x when the -S option is used. Modifications to compacc.c, scaleswn.c to ensure that residue numbers are calculated properly when more than 2 Gb of sequence is searched. >>July 12, 2000 Modifications to dropnfa.c so that DNA matches to 'N' will be included in the "ungapped %identity". Thus, a sequence that is 100% identical for 100 nt on either side of a 100 nt region that has been masked to 'NNNNN' will be reported as: "67% identical (100% ungapped)". This has been added to deal with masked BAC-end databases. It would be better if masking changed the letters to lowercase, but the mouse BAC-end sequences at TIGR use 'NNNNN'. This is currently available only for the fasta function, not [t]fast[x/y], etc, and only for DNA sequences. mk_n_pam() in apam.c modified to ensure that mismatch scores of -1 remain -1. >>June 25, 2000 Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accession number as part of the descriptive string. >>June 11, 2000 (no version change - not yet released) Modifications to calcons(), calc_id(), showbest(), p_workcomp.c to provide ngap_q (number of alignment gaps in query) , ngap_l (number of gaps in library) information for -m 9 output. >>June 6, 2000 (no version change - not yet released) Modified scaleswn.c to provide better support for unconventional scoring scoring matrices, in particular, scoring matrices where every value is 50-times higher. Previous versions of the MLE estimator (-z 2) started with lambda = 0.2, which is too high for a scoring matrix going from -500:+1500. The initial estimate for lambda is now calculated using the formula: lambda = pi/sqrt(6*variance). For the default -z 1, a restriction to limit scores to a maximum of 300 for the statistical analysis was removed. >>June 3, 2000 Modified aligment output, and -m 9 and -m10, to report an "ungapped" identity as well as the traditional "gapped" identity. The traditional "gapped" identity reports the number of identities divided by the overall length of the alignment, including gaps. The "ungapped" identity does not include gaps in the length of the alignment. This new value is included for alignments that include introns; thus, a tfastx33 search might find the 100% identical genomic sequence but report the gapped percent identity if a short intron were included in the alignment (the alignment probably would not span a long exon) as 66%. The "ungapped" identity would remain 100%. The ungapped identity value is also shown in the "-m 9" output line after the "gapped" fraction identical. >>June 1, 2000 Modified -m 9 output to provide fraction identical, alignment boundary information with the initial list of high scoring sequences, just as the pv3comp and mp_comp versions do. The -m 9 option now shows the same alignment display as -m 0, but the width of the alignment is increased by 40. Thus, by default, -m 9 will show the list of best hits, with percent identity, Smith-Waterman score, and alignment boundaries initially, and then show alignments standard (-m 0) alignments with 100 residues/line. >>May 29, 2000 Correct some problems with reading data files with 's under unix. nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert ('\t') to (' ') in descriptive lines. ======= >>May 3, 2000 Corrected problem with very low mean_var in fit_llen() in scaleswn.c. >>May 2, 2000 (no version number change - previous version not released) Merged fasta33t05d2 with fasta33t06. Also removed restriction on "-M size-range" to proteins - the size range now can be applied to DNA as well. >>May 1, 2000 (changes to v33t05d merged into v33t06) Introduced changes to include '*' as a valid sequence character, which indicates termination. Thus, 'TGA', 'TAG', and 'TAA' are now tranlated to '*' rather than 'X', and the protein PAM matrices have been modified to provide a match score of approximately 1/2 the max identity score for a '*:*' match. Otherise, '*' is the same as 'X'. This change only affects query sequences that include a '*' to indicate an end of sequence, the '*' is not there by default. The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33, and tfasty33, which were fixed today. >>March 28, 2000/April 24, 2000 --> v33t06 (a) -z 6 statistics that factor in composition (b) -smatrix-offset pam-offset parameter (a) This release provides a new statistics option, -z 6, which provides a more sophisticated model that accounts for sequence composition. When -z 6 is used (only for fasta33(_t) and ssearch33(_t)), the program calculates a composition parameter comp=1/lambda using a modified version of the Karlin-Altschul karlin() function. As a result, every sequence in the database has an associated length (n1) and composition (comp). The length n1 and composition comp are used in the maximum likelihood estimation described by Mott (1992) Bull. Math. Biol. 54:59-75. Four parameters are estimated, a0, a1, a2, and b1, and the probability of obtaining a score is then: p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp))) The maximum likelihood estimates of a0, a1, a2, and b1 are calculated using the Nelder-Mead simplex search strategy. The average Lambda is reported for the search using Lambda = 1/(b1*ave_comp). Where ave_comp is the geometric mean of the comp values calculated during the statistical estimates. The "lambda/comp" calculation can fail for sequences with very biased amino acid composition. When this occurs, 'comp' is set to -1.0 (as is 'H', the information content parameter) and the 'ave_comp' value is used to calculate statistical significance. (But obviously 'ave_comp' is not really appropriate, since if the sequence had an average 'comp' value, it would have been calculated.) When -z 6 is used, the alignment display shows the 'comp' and 'H' values for that library sequence. (b) Scoring matrix offsets - The main reason that the "lamdba/comp" calculation fails is that, for the particular query/library sequence pair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0. This problem is reported to 'stderr' when it occurs. The simplest solution to the problem is to provide an offset to the scoring matrix; for example, to use Blosum62 - 1, which ranges from +10 to -5, rather than the standard +11 to -4. This option used to be available with the -S offset option, but -S is now used to specify a lower-case seg-ed database. The offset can now be specified as part of the scoring matrix name. Thus, "-s BL62-1" uses Blosum62 reduced by 1 at each entry. The '-' character is used to indicate an offset, so scoring matrix files must not have a '-' in their name. Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value. nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid string run-off problems after strncpy(). Fixed problem where positive gap extension penalties in ssearch33 were not converted to negative values. >>April 8, 2000 Fixed problem in calculating corrected sequence lengths for Altschul-Gish probabilities. >>March 30, 2000 (no version change, date updated to March 30, 2000) Corrected problem with -m 9 option. The '*' character is now available to allow translated alignments to extend through the termination codon. Thus, if a protein sequence ends with a '*', and matches in to a translated termination codon, the score will be increased. The *:* match score is set to 1/2 the max positive score for the matrix (see upam.h). This strategy can also be used to upweight a match that extends all the way to the end of a full-length sequence by putting '*' at the end of both the query and library protein sequences. Recognition of '*' will probably become a command line option. >>March 21, 2000 (no version change, previous version not distributed) Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate large sequence files. Long (64-bit on some systems) variables are now used to specify file and memory position for the memory mapped functions. As a result, there are now two *.xin (memory mapped index) file formats: MP0, which uses 32-bit longs, and MP1, which uses 64-bit longs. On 64-bit machines, MP0 32-bit indices are read properly, but limit the database size to 2 or 4 Gb; MP1 64-bit indices allow very large databases. Blast2.0 formatdb databases are still limited to 4Gb. To compile map_db.c to generate 64-bit index files, include the compile time option -DBIG_LIB64 in the Makefile. (Currently this option has been tested only on the DEC Alpha and SGI platforms, and will work only with Unix versions that provide 64-bit longs and 64-bit ftell()'s.) The -R results file now uses sfn_cmp() to report a matching superfamily number, if one exists, and '0' otherwise. >>March 12, 2000 (no version change, previous version not distributed) Provide new strategy for specifying library abbreviations. In addition to: fasta33 query.aa %anr one can also specify: fasta33 query.aa %pir1+sp+nr or fasta33 query.aa +pir1+sp+nr or fasta33 query.aa %+pir1+sp+nr where the + anywhere in the library name string indicates that variable length library names, separated by '+', are being used (the last '+' is optional). The FASTLIBS file then becomes: ================ PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lseg NBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.nam NRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5 NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lseg NCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg ================ The two abbreviation types, single letter and +word+, cannot be intermixed, and at least initially, +word+ specifiers are case-sensitive (single letter abbreviations are not) and will not be available interactively, only on the command line. Removed 'K' estimate for Expectation_n, Expectation_i fits to the distribution of unrelated similarity scores. 'K' cannot be calculated from the data available. 'Lamdba' can be calculated, it is 1.28255/sqrt(mean_var), and is still available. >>March 3, 2000 (no version change) changed Makefile33.common, Makefile.common, to incorporate $(NRAND) rather than "rand48". Provide nrandom.c which uses random(), as replacement for nrand.c, which uses rand48(). >>February 8, 2000 --> v33t05 Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly. Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n), and -z 1/14 (Expectation_i) statistical estimates. Modifications to calc_id() to correct bug in counting identities. Modified showalign() to use calc_id() with -m 9, for simpler debugging. Additional modifications to dropfa*.c files to deal properly with 'n's and 'x's. Added new option: -x #, which allows one to override the penalty for a match against 'x' (or 'N') provided by the scoring matrix. This option is particularly useful in fast[x/y] searches, where out of frame low complexity regions can generate high scores. The old function of '-x' - to specify an alternate coordinate system, is now available as '-X # #'. Updated scaleswn.c to provide window shuffle information for -z 12. Updated compacc.c, workacc.c, to fix serious bug in wshuffle() that destroyed aa1[n1]=0. >>January 25, 2000 --> v33t04 A serious bug in all of the fasta related programs has been corrected. The new code in fasta33 which ignores certain residues failed to initialize one of the arrays properly. As a result, in pathological situations, a very strong match could be missed. Corrected minor bug in initsw.c that cause misplaced "ktup" command line argument, which should be ingnored by ssearch, to be read as -d ktup. Improved error message for 0 length query sequence. >>January 17, 2000 --> no external version number change Modified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mapping of genbank flatfile (format=1) files. This format could be read much more efficiently, however. >>January 12, 2000 --> no external version number change Changed the behavior of the options that set the number of high scores (-b) and alignments (-d) that are displayed. Previously, fasta33 -E 10.0 -d 10 would show 50 best scores, rather than all the scores with E() < 10.0. To get the -E threshold to limit, -E 10.0 -b 10000 -d 10 was required. This is now fixed. Setting "-d 10" does not affect the number of best scores shown. Minor change in mw.h to remove unused defines. fasta3x.me (fasta3x.doc) updated. >>January 6, 2000 --> v33t03 Corrected bug in memory mapped reads of gcg_binary format files that potentially caused the last 63 residues to be read improperly. Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c to ensure that each thread has its own work_info structure. This solves some minor race conditions that sometimes caused some parameters not to be reported properly. Changes to most of the drop*.c files to correct some minor problems with sequence alphabets. Code in mmgetaa.c (memory mapped code for FASTA, GCG compressed files) reordered to prevent files from being memory mapped if appropriate index files are not available. See readme.pvm_3.3 for updates to the pvm programs. >>December 10, 1999 (no version change - modifications largely affect ps3comp*) Modifications to showsum.c to deal with 2 scores/sequence. Modifications to mmgetaa.c for superfamily numbers. >>December 7, 1999 (no version change, previous version not released) Corrected problem in mmgetaa.c that caused searches on a memory mapped single long sequence (e.g. Chr22) to fail. Corrected bug in map_db.c that caused it to crash on some architectures if a filename was not specified. Corrected off-by-three error in fasty/tfasty. Corrected indexing error in dropfz2.c. >>December 5, 1999 --> v33t02 corrected some bugs in inifa.c/initsw.c/doinit.c that caused abbreviated function names to be lost. modify showbest.c, showalign.c to include information on position in library sequence (bbp->cont) to distinguish subsegment of very long sequences. Currently, the new label is available only with -m 6. >>November 29, 1999 [t]fastz33 uses v33t02 of fasty function. Replace dropfz.c with dropfz2.c. Dropfz2.c interprets any codons, that include the nucleotide 'N' as the amino 'X'. Previously, 'N' was treated as 'A', so 'NNN' ended up 'K'. This modification, together with the -S option and lower-case pseg'ed databases, should ensure that DNA queries with large numbers of 'N's do not match low complexity regions. >>November 20, 1999 (no version change, previous version not released) Modify initfa.c to disply initn, init1 scores for [t]fast[fs]. Include "-B" option to show previous z-scores. >>November 17, 1999 (no version change, previous version not released) Modify dropfx.c to use saatran(), rather than aatran(). saatran translates any 'N' containing codon as 'X'. aatran() treats 'N' as an 'A'. Although more steps are required for translation, the program appears to run just as fast. >>November 7, 1999 --> v33t01 Substantial changes to the output format in showbest.c (the list of high scoring sequences) and showalign.c (the alignments). The classic list of best scores: The best scores are: initn init1 opt z-sc E(82014) gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO ( 218) 1497 1497 1497 1761.1 2.3e-91 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE ( 218) 1413 1413 1413 1662.9 6.7e-86 has been replaced by: The best scores are: opt bits E(82138) gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN ( 218) 1497 354 7.6e-98 gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF ( 218) 1413 335 5.3e-92 This display provides more information and removes the outdated initn and init1 scores, which are no longer used. The "bit" score is comparable to the blast2 bit score. It is calculated as: (lambda*S - ln K)/ln 2, where S is the raw similarity score, lambda and K are statistical parameters estimated from the distribution of unrelated sequence similarity scores. All of the similarity scores, including init1, initn, and z-scores are reported with the alignment data. Z-scores are displayed instead of bit scores in the list of high scores if the command line option "-B" is specified. In addition, the alignment score line has changed from: >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa) initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56 Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218) to: >>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa) initn: 954 init1: 954 opt: 958 Z-score: 1130.9 bits: 216.4 E(): 2.8e-56 Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218) In addition to the addition of the "bits:" score, the "expect()" label has changed to "E()" to save some space. >>November 4,12, 1999 (no version change) Fixed serious bug in -z 2 lambda/K calculation in scaleswn.c Fixed bugs in llgetaa.c (openlib()) and definition of superfamily numbers. >>October 21, 1999 (no version change) Begin using CVS for version control. Correct faulty error message in dropfs.c. Corrected bad "goto loopl;" in dropfz.c. Corrected prss3.rsp for Makefile.tc (Win32 version). >>October 18, 1999 --> v33t0 Corrected some serious bugs with the various fasta/x/y programs when the -DALLOCN0 was used to save memory. Improvements to fasta3x.me/.doc documentation. >>October 12, 1999 --> v33tx For this initial release of version 33 of the FASTA programs, the Makefile's have been modified to make "fasta33(_t)", "fastx33(_t)", etc, so that you can test fasta33 while retaining fasta3 (from release v32t08). The FASTA33 programs are somewhat slower than previous releases, but I believe the ability to handle low complexity regions without 'X'ing them out outweighs the slowdown. By (temporarily) changing the names of the programs slightly, it will be easier for you to judge the relative cost and benefit. To "make" the programs as "fasta3(_t)", etc, simply replace "Makefile33.common" with "Makefile.common" in the "Makefile" that you use. >>September 30, 1999 ssearch3/fasta3/fastx3/fasty3 have been modified to search databases containing both upper and lower case letters, where lower case letters indicate low-complexity regions. With the modified programs, lower case letters are treated as 'X's' in the initial scan, but are then treated normally in the final alignment. In addition, alignments can contain lower case letters. Lower case letters are treated as low-complexity regions during the seach phase of the program, but as "conventional" residues during the alignment phase, with the "-S" option. Currently, lower case letters are mapped to 'X's during the scan of the entire library. In the future, alternate weights will be available. This is a substantial improvement for very large scale comparison, where one seeks both accurate statistical estimates and accurate %identities and alignments, and for translated DNA:protein comparisons, like "fastx3" and "fasty3", where out-of-frame translations tend to match low complexity regions (see Pearson et al. (1997) Genomics 46:24-36). Protein databases (and query sequences) can be generated in the appropriate format using John Wooton's "pseg" program, available from ftp://ncbi.nlm.nih.gov/pub/seg/pseg. Once you have compiled the "pseg" program, use the command: pseg database.fasta -z 1 -q > database.lc_seg Once you have database.lc_seg, run the command "map_db" to generate a ".xin" file that can be used to efficiently memory map the database. You can then search database.lc_seg with or without the "-S" option. Without "-S", the database is treated as any other FASTA format file - all the residues are present. With "-S", lower case residues will be treated as 'x's' during the initial scan but as normal residues when final alignments are displayed. When the -S option is used, the matrix information line is changed from: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS". The "-S" option is no longer available to provide a scoring matrix offset. Unfortunately, Blast2.0 format files cannot contain lower case letters. We have addressed this problem by providing efficient memory mapped access to Fasta and GCG/PIR, and GCG/compressed-binary files in the last release of fasta32t08. The memory mapped file I/O improvements are provided in fasta33 as well. ================ readme.v32 ================ FASTX/Y and FASTA (DNA) are now half as fast, because the programs now search both the forward and reverse strands by default. The documentation in fasta3x.me/fasta3x.doc has been substantially revised. >>October 20, 1999 (no version change) Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character. >>October 9, 1999 --> v32t08 (no version number change) Added "-M low-high" option, where low and high are inclusion limits for library sequences. If a library sequence is shorter than "low" or longer than "high", it will not be considered in the search. Thus, "-M 200-250" limits the database search to proteins between 200 and 250 residues in length. This should be particularly useful for fasts3 and fastf3. -M -500 searches library sequences < 500; -M 200 - searches sequences > 200. This limit applies only to protein sequences. Modified scaleswn.c to fall back to maximum likelihood estimates of lambda, K rather than mean/variance estimates. (This allows MLE estimation to be used instead of proc_hist_n when a limited range of scores is examined.) >>October 2, 1999 --> v32t08 Many changes: (1) memory mapped (mmap()ed) database reading - other database reading fixes (2) BLAST2 databases supported (3) true maximum likelihood estimates for Lambda, K (4) Misc. minor fixes (1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access. It is now possible to use mmap()ed access to FASTA format databases, if the "map_db" program has been used to produce an ".xin" file. If USE_MMAP is defined at compile time and a ".xin" file is present, the ".xin" will be used to access sequences directly after the file is mmap()ed. On my 4-processor Alpha, this can reduce elapsed time by 50%. It is not quite as efficient as BLAST2 format, but it is close. Currently, memory mapping is supported for type 0 (FASTA), 5 (PIR/GCG ascii), and 6 (GCG binary). Memory mapping is used if a ".xin" file is present. ".xin" files are created by the new program "map_db". The syntax for "map_db" is: map_db [-n] "/dir/database.fa" which creates the file /dir/database.fa.xin. Library types can be included in the filename; thus: map_db -n "/gcggenbank/gb_om.seq 6" would be used for a type 6 GCG binary file. The ".xin" file must be updated each time the database file changes. map_db writes the size of the database file into the ".xin" file, so that if the database file changes, making the ".xin" offset information invalid, the ".xin" file is not used. "list_db" is provided to print out the offset information in the ".xin" file. (Oct 2, 1999) The memory mapping routines have been changed to allow several files to be memory mapped simultaneously. Indeed, once a database has been memory mapped, it will not be unmap()ed until the program finishes. This fixes a problem under Digital Unix, and should make re-access to mmap()ed files (as when displaying high scores and alignments) much more efficient. If no more memory is available for mmap()ing, the file will be read using conventional fread/fgets. (Oct 2, 1999) The names of the database reading functions has been changed to allow both Blast1.4 and Blast2.0 databases to be read. In addition, Makefile.common now includes an option to link both ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries. However, Blast1.4 support has not been tested. The Makefile structure has been improved. Each architecture specific Makefile (Makefile.alpha, Makefile.linux, etc) now includes Makefile.common. Thus, changes to the program structure should be correct for all platforms. "map_db" and "list_db" are not made with "make all". The database reading functions in nxgetaa.c can now return a database length of 0, which indicates that no residues were read. Previously, 0-length sequences returned a length of 1, which were ignored. Complib.c and comp_thr.c have changed to accommodate this modification. This change was made to ensure that each residue, including the last, of each sequence is read. Corrected bug in nxgetaa.c with FASTA format files with very long (>512 char) definition lines. (2) (September 20, 1999) BLAST2 format databases supported This release supports NCBI Blast2.0 format databases, using either conventional file reading or memory mapped files. The Blast2.0 format can be read very efficiently, so there is only a modest improvement in performance with memory mapping. The decision to use mmap()'ed files is made at compile time, by defining USE_MMAP. My thanks to Eamonn O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for providing mmap()'ed modifications to fasta3. On my machines, Blast2.0 format reduces search time by about 30%. At the moment, ambiguous DNA sequences are not decoded properly. (3) (September 30, 1999) A new statistical estimation option is available. -z 2 has been changed from ln()-scaling, which never should have been used, to scaling using Maximum Likelihood Estimates (MLEs) of Lambda and K. The MLE estimation routines were written by Aaron Mackey, based on a discussion of MLE estimates of Lambda and K written by Sean Eddy. The MLE estimation examines the middle 95% of scores, if there are fewer than 10000 sequences in the database; otherwise it excludes (censors) the top 250 scores and the bottom 250 scores. This approach seems to effectively prevent related sequences from contaminating the estimation process. As with -z 1, -z 12 causes the program to generate a shuffled sequence score for each of the library sequences; in this case, no censoring is done. If the estimation process is reliable, Lambda and K should not vary much with different queries or query lengths. Lambda appears not to vary much with the comparison algorithm, although K does. (4) Minor changes include fixes to some of the alignment display routines, individual copies of the pstruct structure for each thread, and some changes to ensure that every last residue in a library is available for matching (sometime the last residue could be ignored). This version has undergone extensive testing with high-throughput sequences to confirm that long sequences are read properly. Problems with fastf3/fasts3 alignment display have also been addressed. >>August 26, 1999 (no version change - not released) Corrected problem in "apam.c" that prevented scoring matrices from being imported for [t]fasts3/[t]fastf3. >>August 17, 1999 --> v32t07 Corrected problem with opt_cut initialization that only appeared with pvcomp* programs. Improved calculation of FASTA optcut threshold for DNA sequence comparison for match scores much less than +5 (e.g. +3). The previous optcut theshold was too high when the match penalty was < 4 and ktup=6; it is now scaled more appropriately. Optcut thresholds have also been raised slightly for fastx/y3/tfastx/y3. This should improve performance with minimal effects on sensitivity. >>July 29, 1999 (no version change - date change) Corrected various uninitialized variables and buffer overruns detected. >>July 26, 1999 - new distribution (no version change - v32t06, previous version not released) Changed the location of "(reverse complement)" label in tfasta/x/y/s/f programs. Statistical calculations for tfasta/x/y in unthreaded version corrected. Statistical estimates for threaded and unthreaded versions of the tfasta/x/y/s/f programs should be much more consistent. Substantial modifications in alignment coordinate calculation/ presentation. Minor error in fastx/y/tfastx/y end of alignment corrected. Major problems with tfasta alignment coordinates corrected. tfasta and tfastx/y coordinates should now be consistent. Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered with long query sequences. Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize to try to avoid "cannot allocate diagonal arrays" error message. Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2, so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size). I am still getting this message, so it has not been completely successful. Makefile.linux now uses -DALLOCN0 to avoid this problem, at some cost in speed. The pvcomp* programs have been updated to work properly with forward/reverse DNA searches. See readme.pvm_3.2. >>July 7, 1999 - not released --> v32t06 Corrected bug in complib.c (fasta3, fastx3, etc) that caused core dumps with "-o" option. Corrected a subtle bug in fastx/y/tfastx/y alignment display. >>June 30, 1999 - new distribution (no version change) Corrected doinit.c to allow DNA substitution matrices with -s matrix option. Changed ".gbl" files to ".h" files. >>June 2 - 9, 1999 - new distribution (no version change) Added additional DNA lambda/K/H to alt_param.h. Corrected some other problems with those table. for the case where (inf,inf) gap penalties were not included. Fixed complib.c/comp_thr.c error message to properly report filename when library file is not found. Included approximate Lambda/K/H for BL80 in alt_parms.h. BL80 scoring matrix changed from 1/3 bit to 1/2 bit units. Included some additional perl files for searchfa.cgi, searchnn.cgi in the distribution (my-cgi.pl, cgi-lib.pl). >>May 30, 1999, June 2, 1999 - new distribution (no version number change) Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h. Changed zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value when only one sequence is compared and -z 3 is used. >>May 27, 1999 (no version number change) Corrected bug in alignment numbering on the % identity line 27.4% identity in 234 aa (101-234:110-243) for reverse complements with offset coordinates (test.aa:101-250) >>May 23, 1999 (no version number change) Correction to Makefile.linux (tgetaa.o : failed to -DTFAST). >>May 19, 1999 (no version number change) Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1. Changes to showsum.c to change off-end reporting. (Neither of these changes is likely to affect anyone outside my research group.) >>May 12, 1999 --> v32t05 Fixed a serious bug in the fastx3/tfastx3 alignment display which caused t/fastx3 to produce incorrect alignments (and incorrectly low percent identities). The scores were correct, but the alignment percent identities were too low and the alignments were wrong. Numbering errors were also corrected in fastx3/tfastx3 and fasty3/tfasty3 and when partial query sequences were used. >>May 7, 1999 Fixed a subtle bug in dropgsw.c that caused do_work() to calculate incorrect Smith-Waterman scores after do_walign() had been called. This affected only pvcompsw searches with the "-m 9" option. >>May 5, 1999 Modified showalign.c to provide improved alignment information that includes explicitly the boundaries of the alignment. Default alignments now say: Smith-Waterman score: 175; 24.645% identity in 211 aa overlap (5:207-7:207) >>May 3, 1999 Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a "not" superfamily annotation for the query sequence only. The goal is to be able to specify that certain superfamily numbers be ignored in some of the search summaries. Thus, a description line of the form: >GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675 says that GT8.7 belongs to superfamily 40001, but any library sequences with superfamily number 90043 should be ignored in any listing or summary of best scores. In addition, it is now possible to make a fasta3r/prcompfa, which is the converse of fasta3u/pucompfa. fasta3u reports the highest scoring unrelated sequences in a search using the superfamily annotation. fasta3r shows only the scores of related sequences. This might be used in combination with the -F e_val option to show the scores obtained by the most distantly related members of a family. >>April 25, 1999 -->v32t04 (not distributed) Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA (necessary for a more rational Makefile structure). No code changes. >>April 19, 1999 Fixed a bug in showalign.c that displayed incorrect alignment coordinates. (no version number change). >>April 17, 1999 --> v32t03 A serious bug in DNA alignments when the sequence has been broken into multiple segments that was introduced in version fasta32 has been fixed. In addition, several minor problems with -z 3 statistics on DNA sequences were fixed. Added -m 9 option, which unfortunately does different things in pvcompfa/sw and fasta3/ssearch3. In both programs, -m 9 provides the id's of the two sequences, length, E(), %_ident, and start and end of the alignment in both sequences. pvcompfa/sw provides this information with the list of high scoring sequences. fasta3/ssearch3 provides the information in lieu of an alignment. >>March 18, 1999 --> v32t02 Added information on the algorithm/parameter description line to report the range of the pam matrices. Useful for matrices like MD_10, _20, and _40 which require much higher gap penalties. >>March 13, 1999 (not distributed) --> v32t01 -r results.file has been changed to -R results.file to accomodate DNA match/mismatch penalties of the form: -r "+1/-3". >>February 10, 1999 Modify functions in scalesw*.c to prevent underflow after exp() on Alpha Linux machines. The Alpha/LINUX gcc compiler is buggy and doesn't behave properly with "denormalized" numbers, so "gcc -g -m ieee" is recommended. Add "Display alignments also (y/n)[n] " pvcomplib.c again provides alignments!! In addition, there is a new "-m 9" option, which reports alignments as: >>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library HS5 30 HS5 30 1.873e-11 1.000 30 1 30 1 30 HS5 30 HS2249 40 1.061e-07 0.774 31 1 30 7 37 HS5 30 HS2221 38 1.207e-07 0.833 30 1 30 7 35 HS5 30 HS2283 40 1.455e-07 0.774 31 1 30 7 37 HS5 30 HS2239 38 1.939e-07 0.800 30 1 30 7 35 where the columns are: query-name q-len lib-name lib-len E() %id align-len q-start q-end l-start l-end >>February 9, 1999 Corrected bug in showalign.c that offset reverse complement alignments by one. >>Febrary 2, 1999 Changed the formatting slightly in showbest.c to have columns line up better. >>January 11, 1999 Corrected some bugs introduced into fastf3(_t) in the previous version. >>December 28, 1998 Corrected various problems in dropfz.c affecting alignment scores and coordinates. Introduced a new program, fasts3(_t), for searching with peptide sequences. >>November 11, 1998 --> v32t0 Added code to correct problems with coordinate number in long library sequences with tfastx/tfasty. With this release, sequences should be numbered properly, and sequence numbers count down with reverse complement library sequences. In addition, with this release, fastx/y and tfastx/y translated protein alignments are numbered as nucleotides (increasing by 3, labels every 30 nucleotides) rather than codons.