$Name: fa_34_26_5 $ - $Id: readme.v33t0,v 1.45 2001/07/10 18:03:42 wrp Exp $

================ readme.v33t0 ================

This release includes an MPI implementation of the parallel
library-vs-library comparison code.  See readme.mpi_3.3 and
readme.pvm_3.3 for more information.

=====
>>July 9, 2001

Considerable changes to support no-global library functions. 

(1) Separate ascii/sequence mapping arrays are used by the
    query-reading (qascii), library-reading (lascii), and sequence
    comparison function (pascii) routines.  As a result, there is no
    longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both
    functions.

(2) This also allows us to remove all #ifdef TFAST/FASTX conditionals
    from complib.c/comp_thr.c/p2_complib.c.  We no longer need
    tcomp_thr.o, comp_thrx.o, etc.  We still have a variety of
    p2_complib.o variations to support the different c34.work* files.

(3) Because non-global openlib/getlib functions are available, exactly
    the same open/get functions are available for reading both the
    query and reference libraries in pv34comp* programs.  The
    host-specific openlib/getlib functions in hxgetaa.c are now
    provided by nmgetlib.c, etc. This has two effect:

    (a) it is now possible to compare a query database generated by an
        SQL query to a library database generated by a different SQL
        query.

    (b) pv34comp* has lost (at least in this version) the ability to
        automatically detect the query sequence type. To search with a
        DNA query, you MUST use "-n".

(4) the resetp() function is now responsible for almost all of the
    function sepcific (TFAST/FASTX/etc) initializations.  All of the
    function specific code has been removed from complib.c/comp_thr.c
    and most of it has been moved to initfa.c/resetp().

(5) manageacc.c has been merged into compacc.c (mostly prhist()).

(6) Although it may reflect a subtle bug in my code, it is not
    possible to reliably run threaded/memory mapped versions of the
    fasta34_t code.  I have spent considerable time tracking down the
    problem, and have determined that, in threaded code, something
    happens during the thread initialization to corrupt the
    description offset information used when files are memory mapped.
    This never occurs when the unthreaded versions of the code are
    used.  And it does not occur under MacOSX, Compaq Tru64Unix, Sun
    Solaris/Sparc, or SGI IRIX.

    Thus, I cannot recommend using the threaded code versions (_t)
    under Linux (RH6.2 or 7.1).

=====
>>June 1, 2001

Many changes to accomodate a new - no global variable - strategy for
reading sequence databases.  Every time a file is opened, a struct
lmf_str is allocated which can be used for memory mapped files, ncbl2,
files, and mysql files.

In addition, an open'ed file has a default sequence type: DNA or
protein, or one can open a file in a mode that will allow the sequence
type to be changed.

=====
>>May 18, 2001		CVS: fa33t09d0

A new compile time parameter - -DGAP_OPEN, is available to change the
definition of the "-f gap-open" parameter from the penalty for the
first residue in a gap to a true gap-open penalty, as is used in BLAST
and many other comparison algorithms.  This will probably become the
default for fasta in version 3.4.

Fixes to conflicts between "-S" and "-s matrix".  When a scoring
matrix file was specified, lower-case alignments were not displayed
with -S (although the scores were calculated properly).

More extensive testting of mysql_lib.c (mySQL query-libraries) with
the pv4comp* and mp4comp* programs.

=====
>>April 5, 2001		CVS: fa33t08d4b3

Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence
descriptions for PCOMPLIB (pv4/mp3comp*).  Also fix p2_complib.c to
request DNA library for translated comparisons.

Fix for prss33(_t) to read both sequences from stdin.

=====
>>March 27, 2001	CVS: fa33t08d4
 --> fa33t08d4

Problems in ncbl2_mlib.c found searching NCBI non-redundant nucleotide
database "nt" were fixed.  Testing revealed a minor memory leak, which
was fixed by modifying showbest.c, showalign.c, comp_thr.c, complib.c,
and p2_complib.c to remember the last opened database file more
effectively.

Modifications to allow 64-bit fseek/ftell on machines like Sun,
Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE
off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO.  Machines
with 64-bit long's do not need this option.  Machines with 32-bit
longs that allow files >2 Gb can do so with 64-bit file access
functions, including fseeko() and ftello(), which work with off_t file
offsets instead of long's.

=====
>>March 3, 2001		CVS: fa33t08d2

Corrected problems in nmgetaa.c and mysql_lib.c with parallel
programs, and one serious problem with alternate DNA scoring matrices
(initfa.c, initsw.c) not being set properly.  A subtle problem with
the merge of scaleswn.c and scaleswg.c is fixed.

>>February 17, 2001

Modified mysql_lib.c to use "#", rather than "%ld", to indicate the
position of the GID.  This change was made because sprintf() cannot be
used reliably to generate an SQL string, as '"' and '%' are used in 
such strings.

=====
>>January 17, 2001
(no version change, date change)

Minro fixes to initfa.c, initsw.c to deal with DNA scoring matrices
properly. "-n -s dna.mat" is required for the sequence/matrix to be
recognized as DNA.

>>January 16, 2001
-->v34t00

Merge of the main CVS trunk - fa33t06 with the latest release branch,
fa33t08.

In addition, PCOMPLIB mods have been made to mysql_lib.c.  Because
p2_complib.c gets sequence description information during the first
read of the database, the mysql_query must be changed to return:
result[0]=GID, result[1]=description, result[2]=sequence.  In the
PCOMPLIB case, the other SQL queries (for GID description, sequence)
are not necessary but must still be provided.

=====
>>January 16, 2001
(no version change, previous version not released)

changes to p2_complib.c to correct openlib() incompatibility.

changes to nmgetaa.c, ncbl2_lib.c to incorporate PCOMPLIB.  nxgetaa.c
removed.

=====
>>January 12, 2001
(no version change, previous version not released)

Change to initfa.c to move ktup check from query_parm() to last_init().

=====
>>January 10, 2001
--> v33t08

Fixes to complib.c, comp_thr.c to deal properly with long query
protein sequences when a short library chunk (e.g. -N 5000) was given.
In the case where the chunk size is too short, it will be reset to a
length which allows the search to proceed, by including an amount of
new sequence that is equal to the amount of overlap sequence.

scaleswn.c and scaleswg.c have been merged.

v33t08 includes the initial implementation for mySQL described below
for v33t07x.

======
>>Dec. 20, 2000
--> v33t07x

Initial implementation of a syntax for mySQL database queries.  A new
file, mysql_lib.c has been added, and changes have been made to
nmgetaa.c (which should now replace nxgetaa.c) and altlib.h.  A mySQL
database search needs a file with 4 parts:

(1) description of the database, user, password
(2) a select statement that generates the set of protein sequences
    as: UID, sequence
(3) a select statement that generates a UID, description given a UID
(4) a select statement that generats a single UID, sequence given a UID
    	
Each of the four parts should be separated by ';'.  For example, in
the database that we are using for testing, a file "demo.sql" that
contains:

================
localhost taxonomy username secret;
SELECT proteins.gid, proteins.sequence FROM proteins,swissprot WHERE proteins.gid=swissprot.gid AND swissprot.spid IS NOT NULL;
select proteins.gid, concat(swissprot.spid," ",proteins.description) from proteins,swissprot where proteins.gid=%ld AND swissprot.gid=proteins.gid;
select gid, sequence from proteins where gid=%ld;
================

will find all the proteins in the BLAST "nr" database that also have
SwissProt ID's when given the command line:

	fasta33 -q query.aa "demo.sql 16"

At least for simple queries, there is surprisingly little overhead for the
search.  For more complex queries involving several tables, the overhead
can be significant.

At the moment, libraries that need the functions in mysql_lib.c will
use library type 16.  We may also use file type 17 for SQL queries
that return binary sequences.

This implementation of mysql_lib.c was written to require a minimal
amount of change to the other programs.  Only nmgetaa.c and altlib.h
needed to be changed to incorporate this new capability.  One result
of this limitation is that one cannot mix mySQL databases queries with
other databases in the same search.  Eventually, I would like to make
a mySQL database like any other, so that several mysql database
queries could be searched in the same run, and mysql databases could
be mixed with other (flat file) databases, but this will require some
changes in the function calls throughout the code.  (Right now, the
various programs do not distinguish between an openlib() that is made
before searching a large database, and one before retrieving a single
sequence.  This must be changed for a database query like mySQL to
behave like other databases.

Several mySQL demo files have been provided: mysql_demo*.sql.

(10 January 2001) The mySQL code has been tested on Intel Linux and
Compaq/Alpha/Tru64 Unix.

>>Dec. 9, 2000

Changes to apam.c that to tie different default gap penalties to
alternate scoring matrices.  In addition, changes to apam.c, to deal
with user-specified matrices with or without '*'.

>>Nov. 5, 2000 (date updated)

pst.dnaseq can now have 3 values, -1, or 0-> protein, 1->DNA, and 2->other.
This becomes important for thing like init_karlin_a, which needs a
background frequency of residues.

>>Nov. 1, 2000

Significant bug fixes for the -z 6/-z 16 option.  An ininitialized
variable was fixed in karlin.c, and comp_thr.c did not pass the
correct composition argument type in find_zp().  The -z 6/16 option
has now been tested and works correctly on Alphas, Linux x86, SGI, Sun
and Mac OSX. Another problem was fixed in scaleswn.c (simplex()) that
prevented the code from being reused by the pv4/mp4 complib programs.

>>Oct. 9, 2000

Several changes made to accomodate Mac OSX.  Longer lists of superfamily
numbers now supported in p[su]4comp/m[su]4comp programs.

>>Sept 25, 2000

All global variables have been removed from scaleswn.c. The last to
go, db_struct db, required many edits, because until now, the fasta
programs have kept two versions of the db_struct data (entries,
length). One version was kept by the main program, which updated entry
number and db length as sequences were read; a second copy of this
information was kept by the statistical estimation routines.  Now
there is only one copy, which means that the E() values will be a
function of the complete database, not the database with some high
scoring sequences removed.

>>Sept 23, 2000

Continued removal of global variables from scaleswn.c.  Only one
global is left, db_struct db, which contains the number of entries in
the database and the number of residues.  It will be the next to go
(changing all the zs_to_*() functions) and scaleswn. will be free
of globals.  scaleswg.c is gone - scaleswn.c compiles to scaleswg.c
with -DNORMAL_DIST.

>>Sept 20, 2000

Removal of histogram globals required changes in p2_complib.c as well.
p_complib.c has not been updated.  scaleswg.c has been modified to
reflect the new histogram strategy.

>>Sept 19, 2000

Substantial changes to remove globals for printing histogram.  m_msg
now contains a hist_str, which keeps histogram information.

>>Sept. 19, 2000
(no version change, previous version not released)

Correct bug introduced into scaleswn.c (inithist()) by changing
score2_sums[], score_sums[] from int to double.

Reporting of version numbers is more consistent between fasta33,
fasta33_t, and pv4compfa/mp4compfa.  The programs now report the same
numbers/dates in similar places.

>>Sept. 15, 2000
--> v33t07

Changes to fix problems with statistical estimates when a large
fraction (but not all) of the database is related.  Several users
reported problems when searching with rRNA genes with version 33t06.
In some cases, a 100% identitical match over 1500 nt would not be
statistically significant against a search of the bacterial division
of Genbank.  This problem was not seen with some releases of v33t05.

The cause of the problem was a change between v33t05 and v33t06 to
allow scoring matrices with unusual scaling to be used.  In v33t05,
there was a line that excluded all scores > 300 from the statistical
estimation procedure.  While 300 is a high score with any "normal"
scoring matrix, some investigators were using matrices scaled 10X, so
that a score of 300 was really a score of 30 with a conventional
matrix, and should not be excluded.  Unfortunately, removing the test
to exclude scores > 300 meant that when a rRNA sequence was used to
search the bacterial division, tens of thousands of high scoring
related sequences were treated as if they were unrelated, with the
result that the variance estimates were much too high, and thus high
real scores had low z-scores, and thus were not statistically
significant.  (There appear to be more than 20,000 rRNA sequences in
the bacterial division of Genbank, almost 25% of all sequences).

The solution to the problem is a substantial enhancement in the
strategies used to exclude high-scoring, related sequences, the -z 1,
4, and 5 parameter estimation strategies.  The programs now estimate
the expected high scoring sequence by calculating an ungapped Lambda
and K, and then use a relatively conservative threshold for excluding
scores that are higher than would be expected 0.01 times by chance.
By calculating Lambda and K, we can scale the cutoff thresholds to
allow scoring matrices with unusual scales.  For "normal" searches,
there should be little change, but there should be an improvement for
searches with large numbers of related sequences in the database.

As a result of testing for this change, a bug in the karlin() function
used with -z 6 was found and corrected.

=======
>>Sept. 9, 2000

Changes to manshowbest.c to include correct display coordinates.

Significant changes to structs.h, param.h, p2_complib.c,
p2_workcomp.c, to store and use a reliable a_struct for alignment
coordinates.

Other cosmetic changes.

>>Sept. 7, 2000

Minor changes to complib.c, showrss.c, so that prss33 -q uses 200
shuffles and prss33 provides bit scores, rather than z-scores.
(no version number change).

Modifications to p2_complib.c to include superfamily numbers for
ps4comp* ms4comp*.

>>Aug 22, 2000

Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX.
00README.1st updated to reflect the current version and correct
outdated information on threads.

>>Aug. 3, 2000

Modifications to initpam2() in initsw.c to correct a problem with pam_x
when the -S option is used.

Modifications to compacc.c, scaleswn.c to ensure that residue numbers
are calculated properly when more than 2 Gb of sequence is searched.

>>July 12, 2000

Modifications to dropnfa.c so that DNA matches to 'N' will be included
in the "ungapped %identity".  Thus, a sequence that is 100% identical
for 100 nt on either side of a 100 nt region that has been masked to
'NNNNN' will be reported as: "67% identical (100% ungapped)".  This
has been added to deal with masked BAC-end databases.  It would be
better if masking changed the letters to lowercase, but the mouse
BAC-end sequences at TIGR use 'NNNNN'.  This is currently available
only for the fasta function, not [t]fast[x/y], etc, and only for DNA
sequences.

mk_n_pam() in apam.c modified to ensure that mismatch scores of -1
remain -1.

>>June 25, 2000

Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accession
number as part of the descriptive string.

>>June 11, 2000

(no version change - not yet released)

Modifications to calcons(), calc_id(), showbest(), p_workcomp.c to
provide ngap_q (number of alignment gaps in query) , ngap_l (number
of gaps in library) information for -m 9 output.

>>June 6, 2000

(no version change - not yet released)

Modified scaleswn.c to provide better support for unconventional
scoring scoring matrices, in particular, scoring matrices where every
value is 50-times higher.  Previous versions of the MLE estimator (-z
2) started with lambda = 0.2, which is too high for a scoring matrix
going from -500:+1500. The initial estimate for lambda is now
calculated using the formula: lambda = pi/sqrt(6*variance).  For the
default -z 1, a restriction to limit scores to a maximum of 300 for
the statistical analysis was removed.

>>June 3, 2000

Modified aligment output, and -m 9 and -m10, to report an "ungapped"
identity as well as the traditional "gapped" identity.  The
traditional "gapped" identity reports the number of identities divided
by the overall length of the alignment, including gaps.  The
"ungapped" identity does not include gaps in the length of the
alignment.  This new value is included for alignments that include
introns; thus, a tfastx33 search might find the 100% identical genomic
sequence but report the gapped percent identity if a short intron were
included in the alignment (the alignment probably would not span a
long exon) as 66%.  The "ungapped" identity would remain 100%.  The
ungapped identity value is also shown in the "-m 9" output line after
the "gapped" fraction identical.

>>June 1, 2000

Modified -m 9 output to provide fraction identical, alignment boundary
information with the initial list of high scoring sequences, just as
the pv3comp and mp_comp versions do.  The -m 9 option now shows the
same alignment display as -m 0, but the width of the alignment is
increased by 40.  Thus, by default, -m 9 will show the list of best
hits, with percent identity, Smith-Waterman score, and alignment
boundaries initially, and then show alignments standard (-m 0)
alignments with 100 residues/line.

>>May 29, 2000

Correct some problems with reading data files with <CR>'s under unix.

nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert <TAB>
('\t') to <SPC> (' ') in descriptive lines.

=======

>>May 3, 2000

  Corrected problem with very low mean_var in fit_llen() in scaleswn.c.

>>May 2, 2000
  (no version number change - previous version not released)

  Merged fasta33t05d2 with fasta33t06.  Also removed restriction on
"-M size-range" to proteins - the size range now can be applied to DNA
as well.

>>May 1, 2000
 (changes to v33t05d merged into v33t06) 

Introduced changes to include '*' as a valid sequence character, which
indicates termination.  Thus, 'TGA', 'TAG', and 'TAA' are now
tranlated to '*' rather than 'X', and the protein PAM matrices have
been modified to provide a match score of approximately 1/2 the max
identity score for a '*:*' match.  Otherise, '*' is the same as 'X'.
This change only affects query sequences that include a '*' to
indicate an end of sequence, the '*' is not there by default.

The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33,
and tfasty33, which were fixed today.

>>March 28, 2000/April 24, 2000
 --> v33t06

(a) -z 6 statistics that factor in composition
(b) -smatrix-offset pam-offset parameter

(a) This release provides a new statistics option, -z 6, which
provides a more sophisticated model that accounts for sequence
composition.  When -z 6 is used (only for fasta33(_t) and
ssearch33(_t)), the program calculates a composition parameter
comp=1/lambda using a modified version of the Karlin-Altschul karlin()
function.  As a result, every sequence in the database has an
associated length (n1) and composition (comp).

The length n1 and composition comp are used in the maximum likelihood
estimation described by Mott (1992) Bull. Math. Biol. 54:59-75.  Four
parameters are estimated, a0, a1, a2, and b1, and the probability of
obtaining a score is then:

p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp)))

The maximum likelihood estimates of a0, a1, a2, and b1 are calculated
using the Nelder-Mead simplex search strategy.

The average Lambda is reported for the search using Lambda =
1/(b1*ave_comp).  Where ave_comp is the geometric mean of the comp values
calculated during the statistical estimates.

The "lambda/comp" calculation can fail for sequences with very biased
amino acid composition.  When this occurs, 'comp' is set to -1.0 (as
is 'H', the information content parameter) and the 'ave_comp' value is
used to calculate statistical significance.  (But obviously 'ave_comp'
is not really appropriate, since if the sequence had an average 'comp'
value, it would have been calculated.)  When -z 6 is used, the
alignment display shows the 'comp' and 'H' values for that library
sequence.

(b) Scoring matrix offsets - The main reason that the "lamdba/comp"
calculation fails is that, for the particular query/library sequence
pair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0.
This problem is reported to 'stderr' when it occurs.  The simplest
solution to the problem is to provide an offset to the scoring matrix;
for example, to use Blosum62 - 1, which ranges from +10 to -5, rather
than the standard +11 to -4.  This option used to be available with
the -S offset option, but -S is now used to specify a lower-case
seg-ed database.  The offset can now be specified as part of the
scoring matrix name.  Thus, "-s BL62-1" uses Blosum62 reduced by 1 at
each entry.  The '-' character is used to indicate an offset, so
scoring matrix files must not have a '-' in their name.
Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value.

nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid string
run-off problems after strncpy().

Fixed problem where positive gap extension penalties in ssearch33
were not converted to negative values.

>>April 8, 2000

Fixed problem in calculating corrected sequence lengths for
Altschul-Gish probabilities.

>>March 30, 2000
  (no version change, date updated to March 30, 2000)

Corrected problem with -m 9 option.

The '*' character is now available to allow translated alignments to
extend through the termination codon. Thus, if a protein sequence ends
with a '*', and matches in to a translated termination codon, the
score will be increased.  The *:* match score is set to 1/2 the max
positive score for the matrix (see upam.h).  This strategy can also be
used to upweight a match that extends all the way to the end of a
full-length sequence by putting '*' at the end of both the query and
library protein sequences.  Recognition of '*' will probably become a
command line option.

>>March 21, 2000
  (no version change, previous version not distributed)

Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate large
sequence files.  Long (64-bit on some systems) variables are now used
to specify file and memory position for the memory mapped functions.
As a result, there are now two *.xin (memory mapped index) file
formats: MP0, which uses 32-bit longs, and MP1, which uses 64-bit
longs. On 64-bit machines, MP0 32-bit indices are read properly, but
limit the database size to 2 or 4 Gb; MP1 64-bit indices allow very
large databases.  Blast2.0 formatdb databases are still limited to
4Gb.  To compile map_db.c to generate 64-bit index files, include the
compile time option -DBIG_LIB64 in the Makefile.  (Currently this
option has been tested only on the DEC Alpha and SGI platforms, and
will work only with Unix versions that provide 64-bit longs and 64-bit
ftell()'s.)

The -R results file now uses sfn_cmp() to report a matching
superfamily number, if one exists, and '0' otherwise.

>>March 12, 2000
  (no version change, previous version not distributed)

Provide new strategy for specifying library abbreviations.  In
addition to:

	fasta33 query.aa %anr

one can also specify:

	fasta33 query.aa %pir1+sp+nr
or
	fasta33 query.aa +pir1+sp+nr
or 
	fasta33 query.aa %+pir1+sp+nr

where the + anywhere in the library name string indicates that
variable length library names, separated by '+', are being used (the
last '+' is optional).  The FASTLIBS file then becomes:

================
PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lseg
NBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.nam
NRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5
NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lseg
NCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg
================

The two abbreviation types, single letter and +word+, cannot be
intermixed, and at least initially, +word+ specifiers are
case-sensitive (single letter abbreviations are not) and will not be
available interactively, only on the command line.

Removed 'K' estimate for Expectation_n, Expectation_i fits to the
distribution of unrelated similarity scores.  'K' cannot be calculated
from the data available.  'Lamdba' can be calculated, it is
1.28255/sqrt(mean_var), and is still available.

>>March 3, 2000 
 (no version change)

changed Makefile33.common, Makefile.common, to incorporate $(NRAND)
rather than "rand48".  Provide nrandom.c which uses random(), as
replacement for nrand.c, which uses rand48().

>>February 8, 2000
  --> v33t05

Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly.
Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n),
and -z 1/14 (Expectation_i) statistical estimates.

Modifications to calc_id() to correct bug in counting identities.
Modified showalign() to use calc_id() with -m 9, for simpler
debugging.

Additional modifications to dropfa*.c files to deal properly with 'n's
and 'x's.

Added new option: -x #, which allows one to override the penalty for a
match against 'x' (or 'N') provided by the scoring matrix.  This
option is particularly useful in fast[x/y] searches, where out of
frame low complexity regions can generate high scores.

The old function of '-x' - to specify an alternate coordinate system,
is now available as '-X # #'.

Updated scaleswn.c to provide window shuffle information for -z 12.

Updated compacc.c, workacc.c, to fix serious bug in wshuffle()
that destroyed aa1[n1]=0.

>>January 25, 2000
  --> v33t04

  A serious bug in all of the fasta related programs has been
corrected.  The new code in fasta33 which ignores certain residues
failed to initialize one of the arrays properly.  As a result, in
pathological situations, a very strong match could be missed.

  Corrected minor bug in initsw.c that cause misplaced "ktup" command
line argument, which should be ingnored by ssearch, to be read as -d
ktup.

  Improved error message for 0 length query sequence.

>>January 17, 2000
  --> no external version number change

Modified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mapping
of genbank flatfile (format=1) files.  This format could be read much
more efficiently, however.

>>January 12, 2000
  --> no external version number change

Changed the behavior of the options that set the number of high scores
(-b) and alignments (-d) that are displayed.  Previously, fasta33 -E
10.0 -d 10 would show 50 best scores, rather than all the scores with
E() < 10.0.  To get the -E threshold to limit, -E 10.0 -b 10000 -d 10
was required. This is now fixed. Setting "-d 10" does not affect the
number of best scores shown.

Minor change in mw.h to remove unused defines.

fasta3x.me (fasta3x.doc) updated.

>>January 6, 2000
  --> v33t03

Corrected bug in memory mapped reads of gcg_binary format files
that potentially caused the last 63 residues to be read improperly.

Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c to
ensure that each thread has its own work_info structure. This solves
some minor race conditions that sometimes caused some parameters
not to be reported properly.

Changes to most of the drop*.c files to correct some minor problems
with sequence alphabets. Code in mmgetaa.c (memory mapped code for
FASTA, GCG compressed files) reordered to prevent files from being
memory mapped if appropriate index files are not available.

See readme.pvm_3.3 for updates to the pvm programs.

>>December 10, 1999
  (no version change - modifications largely affect ps3comp*)

Modifications to showsum.c to deal with 2 scores/sequence.  Modifications
to mmgetaa.c for superfamily numbers.

>>December 7, 1999
 (no version change, previous version not released)

Corrected problem in mmgetaa.c that caused searches on a memory mapped
single long sequence (e.g. Chr22) to fail.  Corrected bug in map_db.c
that caused it to crash on some architectures if a filename was not
specified.  Corrected off-by-three error in fasty/tfasty.  Corrected
indexing error in dropfz2.c.

>>December 5, 1999
 --> v33t02 
 
corrected some bugs in inifa.c/initsw.c/doinit.c that caused
abbreviated function names to be lost.

modify showbest.c, showalign.c to include information on position in
library sequence (bbp->cont) to distinguish subsegment of very long
sequences.  Currently, the new label is available only with -m 6.

>>November 29, 1999
 [t]fastz33 uses v33t02 of fasty function.

Replace dropfz.c with dropfz2.c.  Dropfz2.c interprets any codons,
that include the nucleotide 'N' as the amino 'X'. Previously, 'N' was
treated as 'A', so 'NNN' ended up 'K'.  This modification, together
with the -S option and lower-case pseg'ed databases, should ensure
that DNA queries with large numbers of 'N's do not match low
complexity regions.

>>November 20, 1999
 (no version change, previous version not released)

Modify initfa.c to disply initn, init1 scores for [t]fast[fs].
Include "-B" option to show previous z-scores.

>>November 17, 1999
 (no version change, previous version not released)
 
Modify dropfx.c to use saatran(), rather than aatran().  saatran
translates any 'N' containing codon as 'X'.  aatran() treats 'N' as
an 'A'.  Although more steps are required for translation, the program
appears to run just as fast.

>>November 7, 1999
 --> v33t01

Substantial changes to the output format in showbest.c (the list of
high scoring sequences) and showalign.c (the alignments).  The classic
list of best scores:

The best scores are:                             initn init1 opt z-sc E(82014)
gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO  ( 218) 1497 1497 1497 1761.1 2.3e-91
gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE  ( 218) 1413 1413 1413 1662.9 6.7e-86

has been replaced by:

The best scores are:                                       opt bits E(82138)
gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN  ( 218) 1497 354 7.6e-98
gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF  ( 218) 1413 335 5.3e-92

This display provides more information and removes the outdated initn
and init1 scores, which are no longer used. The "bit" score is
comparable to the blast2 bit score.  It is calculated as: (lambda*S -
ln K)/ln 2, where S is the raw similarity score, lambda and K are
statistical parameters estimated from the distribution of unrelated
sequence similarity scores.  All of the similarity scores, including
init1, initn, and z-scores are reported with the alignment data.
Z-scores are displayed instead of bit scores in the list of high
scores if the command line option "-B" is specified.

In addition, the alignment score line has changed from:

>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa)
 initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56
Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)

to:

>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa)
 initn: 954 init1: 954 opt: 958  Z-score: 1130.9  bits: 216.4 E(): 2.8e-56
Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)

In addition to the addition of the "bits:" score, the "expect()" label
has changed to "E()" to save some space.

>>November 4,12, 1999
(no version change)

Fixed serious bug in -z 2 lambda/K calculation in scaleswn.c

Fixed bugs in llgetaa.c (openlib()) and definition of superfamily
numbers.

>>October 21, 1999
(no version change)

Begin using CVS for version control. Correct faulty error message in
dropfs.c.  Corrected bad "goto loopl;" in dropfz.c.  Corrected prss3.rsp
for Makefile.tc (Win32 version).

>>October 18, 1999
 --> v33t0

Corrected some serious bugs with the various fasta/x/y programs when
the -DALLOCN0 was used to save memory.  Improvements to fasta3x.me/.doc
documentation.

>>October 12, 1999
 --> v33tx

For this initial release of version 33 of the FASTA programs, the
Makefile's have been modified to make "fasta33(_t)", "fastx33(_t)",
etc, so that you can test fasta33 while retaining fasta3 (from release
v32t08).  The FASTA33 programs are somewhat slower than previous
releases, but I believe the ability to handle low complexity regions
without 'X'ing them out outweighs the slowdown.  By (temporarily)
changing the names of the programs slightly, it will be easier for you
to judge the relative cost and benefit.  To "make" the programs as
"fasta3(_t)", etc, simply replace "Makefile33.common" with
"Makefile.common" in the "Makefile" that you use.

>>September 30, 1999

ssearch3/fasta3/fastx3/fasty3 have been modified to search databases
containing both upper and lower case letters, where lower case letters
indicate low-complexity regions.  With the modified programs, lower
case letters are treated as 'X's' in the initial scan, but are then
treated normally in the final alignment.  In addition, alignments can
contain lower case letters.  Lower case letters are treated as
low-complexity regions during the seach phase of the program, but as
"conventional" residues during the alignment phase, with the "-S"
option.  Currently, lower case letters are mapped to 'X's during the
scan of the entire library.  In the future, alternate weights will be
available. This is a substantial improvement for very large scale
comparison, where one seeks both accurate statistical estimates and
accurate %identities and alignments, and for translated DNA:protein
comparisons, like "fastx3" and "fasty3", where out-of-frame
translations tend to match low complexity regions (see Pearson et
al. (1997) Genomics 46:24-36).

Protein databases (and query sequences) can be generated in the
appropriate format using John Wooton's "pseg" program, available from
ftp://ncbi.nlm.nih.gov/pub/seg/pseg.  Once you have compiled the "pseg"
program, use the command:

	pseg database.fasta -z 1 -q  > database.lc_seg

Once you have database.lc_seg, run the command "map_db" to generate
a ".xin" file that can be used to efficiently memory map the database.

You can then search database.lc_seg with or without the "-S" option.
Without "-S", the database is treated as any other FASTA format file -
all the residues are present.  With "-S", lower case residues will be
treated as 'x's' during the initial scan but as normal residues when
final alignments are displayed.

When the -S option is used, the matrix information line is changed
from: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS".  The "-S"
option is no longer available to provide a scoring matrix offset.

Unfortunately, Blast2.0 format files cannot contain lower case
letters.  We have addressed this problem by providing efficient memory
mapped access to Fasta and GCG/PIR, and GCG/compressed-binary files in
the last release of fasta32t08. The memory mapped file I/O
improvements are provided in fasta33 as well.

================ readme.v32 ================

FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
search both the forward and reverse strands by default.

The documentation in fasta3x.me/fasta3x.doc has been substantially
revised.

>>October 20, 1999
(no version change)

Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.

>>October 9, 1999
 --> v32t08 (no version number change)

Added "-M low-high" option, where low and high are inclusion limits
for library sequences.  If a library sequence is shorter than "low" or
longer than "high", it will not be considered in the search.  Thus,
"-M 200-250" limits the database search to proteins between 200 and
250 residues in length.  This should be particularly useful for fasts3
and fastf3.  -M -500 searches library sequences < 500; -M 200 -
searches sequences > 200.  This limit applies only to protein
sequences.

Modified scaleswn.c to fall back to maximum likelihood estimates of
lambda, K rather than mean/variance estimates. (This allows MLE
estimation to be used instead of proc_hist_n when a limited range of
scores is examined.)

>>October 2, 1999
 --> v32t08

Many changes:

(1) memory mapped (mmap()ed) database reading - other database reading fixes
(2) BLAST2 databases supported
(3) true maximum likelihood estimates for Lambda, K
(4) Misc. minor fixes

(1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
It is now possible to use mmap()ed access to FASTA format databases,
if the "map_db" program has been used to produce an ".xin" file.  If
USE_MMAP is defined at compile time and a ".xin" file is present, the
".xin" will be used to access sequences directly after the file is
mmap()ed.  On my 4-processor Alpha, this can reduce elapsed time by
50%. It is not quite as efficient as BLAST2 format, but it is close.

Currently, memory mapping is supported for type 0 (FASTA), 5
(PIR/GCG ascii), and 6 (GCG binary).  Memory mapping is used if a
".xin" file is present. ".xin" files are created by the new program
"map_db".  The syntax for "map_db" is:

	map_db [-n] "/dir/database.fa"

which creates the file /dir/database.fa.xin.  Library types can be
included in the filename; thus:

	map_db -n "/gcggenbank/gb_om.seq 6"

would be used for a type 6 GCG binary file. 

The ".xin" file must be updated each time the database file changes.
map_db writes the size of the database file into the ".xin" file, so
that if the database file changes, making the ".xin" offset
information invalid, the ".xin" file is not used. "list_db" is
provided to print out the offset information in the ".xin" file.

(Oct 2, 1999) The memory mapping routines have been changed to
allow several files to be memory mapped simultaneously. Indeed, once a
database has been memory mapped, it will not be unmap()ed until the
program finishes.  This fixes a problem under Digital Unix, and should
make re-access to mmap()ed files (as when displaying high scores and
alignments) much more efficient.  If no more memory is available for
mmap()ing, the file will be read using conventional fread/fgets.

(Oct 2, 1999) The names of the database reading functions has been
changed to allow both Blast1.4 and Blast2.0 databases to be read.  In
addition, Makefile.common now includes an option to link both
ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
However, Blast1.4 support has not been tested.

The Makefile structure has been improved.  Each architecture specific
Makefile (Makefile.alpha, Makefile.linux, etc) now includes
Makefile.common.  Thus, changes to the program structure should be
correct for all platforms.  "map_db" and "list_db" are not made with
"make all".

The database reading functions in nxgetaa.c can now return a database
length of 0, which indicates that no residues were read.  Previously,
0-length sequences returned a length of 1, which were ignored.
Complib.c and comp_thr.c have changed to accommodate this
modification.  This change was made to ensure that each residue,
including the last, of each sequence is read.

Corrected bug in nxgetaa.c with FASTA format files with very long
(>512 char) definition lines.

(2) (September 20, 1999) BLAST2 format databases supported

This release supports NCBI Blast2.0 format databases, using either
conventional file reading or memory mapped files.  The Blast2.0 format
can be read very efficiently, so there is only a modest improvement in
performance with memory mapping.  The decision to use mmap()'ed files
is made at compile time, by defining USE_MMAP.  My thanks to Eamonn
O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
providing mmap()'ed modifications to fasta3.  On my machines, Blast2.0
format reduces search time by about 30%.  At the moment, ambiguous DNA
sequences are not decoded properly.

(3) (September 30, 1999) A new statistical estimation option is
available.  -z 2 has been changed from ln()-scaling, which never
should have been used, to scaling using Maximum Likelihood Estimates
(MLEs) of Lambda and K.  The MLE estimation routines were written by
Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
written by Sean Eddy.  The MLE estimation examines the middle 95% of
scores, if there are fewer than 10000 sequences in the database;
otherwise it excludes (censors) the top 250 scores and the bottom 250
scores.  This approach seems to effectively prevent related sequences
from contaminating the estimation process.  As with -z 1, -z 12 causes
the program to generate a shuffled sequence score for each of the
library sequences; in this case, no censoring is done.  If the
estimation process is reliable, Lambda and K should not vary much with
different queries or query lengths.  Lambda appears not to vary much
with the comparison algorithm, although K does.

(4) Minor changes include fixes to some of the alignment display routines,
individual copies of the pstruct structure for each thread, and some
changes to ensure that every last residue in a library is available
for matching (sometime the last residue could be ignored).  This
version has undergone extensive testing with high-throughput sequences
to confirm that long sequences are read properly.  Problems with
fastf3/fasts3 alignment display have also been addressed.

>>August 26, 1999 (no version change - not released)

Corrected problem in "apam.c" that prevented scoring matrices from
being imported for [t]fasts3/[t]fastf3.

>>August 17, 1999
 --> v32t07

Corrected problem with opt_cut initialization that only appeared
with pvcomp* programs.

Improved calculation of FASTA optcut threshold for DNA sequence
comparison for match scores much less than +5 (e.g. +3).  The previous
optcut theshold was too high when the match penalty was < 4 and
ktup=6; it is now scaled more appropriately.

Optcut thresholds have also been raised slightly for
fastx/y3/tfastx/y3.  This should improve performance with minimal
effects on sensitivity.

>>July 29, 1999
(no version change - date change)

Corrected various uninitialized variables and buffer overruns
detected.

>>July 26, 1999 - new distribution
(no version change - v32t06, previous version not released)

Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
programs.

Statistical calculations for tfasta/x/y in unthreaded version
corrected.  Statistical estimates for threaded and unthreaded versions
of the tfasta/x/y/s/f programs should be much more consistent.

Substantial modifications in alignment coordinate calculation/
presentation.  Minor error in fastx/y/tfastx/y end of alignment
corrected.  Major problems with tfasta alignment coordinates
corrected.  tfasta and tfastx/y coordinates should now be consistent.

Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
with long query sequences.

Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
to try to avoid "cannot allocate diagonal arrays" error message.
Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
I am still getting this message, so it has not been completely
successful.  Makefile.linux now uses -DALLOCN0 to avoid this problem,
at some cost in speed.

The pvcomp* programs have been updated to work properly with
forward/reverse DNA searches.  See readme.pvm_3.2.

>>July 7, 1999 - not released
 --> v32t06

Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
dumps with "-o" option.

Corrected a subtle bug in fastx/y/tfastx/y alignment display.

>>June 30, 1999 - new distribution
(no version change)

Corrected doinit.c to allow DNA substitution matrices with -s matrix
option.

Changed ".gbl" files to ".h" files.

>>June 2 - 9, 1999 - new distribution
(no version change)

Added additional DNA lambda/K/H to alt_param.h.  Corrected some
other problems with those table. for the case where (inf,inf)
gap penalties were not included.

Fixed complib.c/comp_thr.c error message to properly report filename
when library file is not found.

Included approximate Lambda/K/H for BL80 in alt_parms.h.
BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.

Included some additional perl files for searchfa.cgi, searchnn.cgi
in the distribution (my-cgi.pl, cgi-lib.pl).

>>May 30, 1999, June 2, 1999 - new distribution
(no version number change)

Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h.  Changed
zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
when only one sequence is compared and -z 3 is used.

>>May 27, 1999
(no version number change)

Corrected bug in alignment numbering on the % identity line
	27.4% identity in 234 aa (101-234:110-243)
for reverse complements with offset coordinates (test.aa:101-250)

>>May 23, 1999
(no version number change)

Correction to Makefile.linux (tgetaa.o : failed to -DTFAST). 

>>May 19, 1999
(no version number change)

Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
Changes to showsum.c to change off-end reporting.  (Neither of these
changes is likely to affect anyone outside my research group.)

>>May 12, 1999
 --> v32t05

Fixed a serious bug in the fastx3/tfastx3 alignment display which
caused t/fastx3 to produce incorrect alignments (and incorrectly low
percent identities).  The scores were correct, but the alignment
percent identities were too low and the alignments were wrong.

Numbering errors were also corrected in fastx3/tfastx3 and
fasty3/tfasty3 and when partial query sequences were used.

>>May 7, 1999

Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
incorrect Smith-Waterman scores after do_walign() had been called.
This affected only pvcompsw searches with the "-m 9" option.

>>May 5, 1999

Modified showalign.c to provide improved alignment information that
includes explicitly the boundaries of the alignment.  Default
alignments now say:

Smith-Waterman score: 175;  24.645% identity in 211 aa overlap (5:207-7:207)

>>May 3, 1999

Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
"not" superfamily annotation for the query sequence only.  The
goal is to be able to specify that certain superfamily numbers be
ignored in some of the search summaries.  Thus, a description line
of the form:

>GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675

says that GT8.7 belongs to superfamily 40001, but any library
sequences with superfamily number 90043 should be ignored in any
listing or summary of best scores.

In addition, it is now possible to make a fasta3r/prcompfa, which is
the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
unrelated sequences in a search using the superfamily annotation.
fasta3r shows only the scores of related sequences.  This might be
used in combination with the -F e_val option to show the scores
obtained by the most distantly related members of a family.

>>April 25, 1999

 -->v32t04 (not distributed)

Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
(necessary for a more rational Makefile structure).  No code changes.

>>April 19, 1999

Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
(no version number change).

>>April 17, 1999

 --> v32t03

A serious bug in DNA alignments when the sequence has been broken into
multiple segments that was introduced in version fasta32 has been
fixed.  In addition, several minor problems with -z 3 statistics on
DNA sequences were fixed.

Added -m 9 option, which unfortunately does different things in
pvcompfa/sw and fasta3/ssearch3.  In both programs, -m 9 provides the
id's of the two sequences, length, E(), %_ident, and start and end of
the alignment in both sequences.  pvcompfa/sw provides this
information with the list of high scoring sequences.  fasta3/ssearch3
provides the information in lieu of an alignment.

>>March 18, 1999

 --> v32t02

Added information on the algorithm/parameter description line to
report the range of the pam matrices.  Useful for matrices like
MD_10, _20, and _40 which require much higher gap penalties.

>>March 13, 1999 (not distributed)

 --> v32t01 

 -r results.file  has been changed to -R results.file to accomodate
 DNA match/mismatch penalties of the form: -r "+1/-3".

>>February 10, 1999

Modify functions in scalesw*.c to prevent underflow after exp() on
Alpha Linux machines.  The Alpha/LINUX gcc compiler is buggy and
doesn't behave properly with "denormalized" numbers, so "gcc -g -m
ieee" is recommended.

Add "Display alignments also (y/n)[n] "

pvcomplib.c again provides alignments!!  In addition, there is a
new "-m 9" option, which reports alignments as:

>>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
HS5         	  30	HS5         	  30	1.873e-11	1.000	  30	   1	  30	   1	  30
HS5         	  30	HS2249      	  40	1.061e-07	0.774	  31	   1	  30	   7	  37
HS5         	  30	HS2221      	  38	1.207e-07	0.833	  30	   1	  30	   7	  35
HS5         	  30	HS2283      	  40	1.455e-07	0.774	  31	   1	  30	   7	  37
HS5         	  30	HS2239      	  38	1.939e-07	0.800	  30	   1	  30	   7	  35

where the columns are:

query-name      q-len   lib-name      lib-len   E()             %id    align-len  q-start q-end   l-start l-end

>>February 9, 1999

Corrected bug in showalign.c that offset reverse complement alignments
by one.

>>Febrary 2, 1999

Changed the formatting slightly in showbest.c to have columns line up better.

>>January 11, 1999

Corrected some bugs introduced into fastf3(_t) in the previous version.

>>December 28, 1998

Corrected various problems in dropfz.c affecting alignment scores
and coordinates.

Introduced a new program, fasts3(_t), for searching with peptide
sequences.

>>November 11, 1998

  --> v32t0

Added code to correct problems with coordinate number in long library
sequences with tfastx/tfasty.  With this release, sequences should be
numbered properly, and sequence numbers count down with reverse
complement library sequences.

In addition, with this release, fastx/y and tfastx/y translated
protein alignments are numbered as nucleotides (increasing by 3,
labels every 30 nucleotides) rather than codons.