Mac binaries

[jabaws.git] / website / archive / binaries / mac / src / fasta34 / fasta20.doc
diff --git a/website/archive/binaries/mac/src/fasta34/fasta20.doc b/website/archive/binaries/mac/src/fasta34/fasta20.doc

new file mode 100644 (file)

index 0000000..34a5095
--- /dev/null
+++ b/website/archive/binaries/mac/src/fasta34/fasta20.doc
@@ -0,0 +1,1144 @@
+
+                        COPYRIGHT NOTICE
+
+Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
+Pearson and the University of Virginia.  All rights reserved. The
+FASTA program and documentation may not be sold or incorporated
+into a commercial product, in whole or in part, without written
+consent of William R. Pearson and the University of Virginia.
+For further information regarding permission for use or
+reproduction, please contact: David Hudson, Assistant Provost for
+Research, University of Virginia, P.O. Box 9025, Charlottesville,
+VA 22906-9025, (434) 924-6853
+
+
+The FASTA program package
+
+Introduction
+
+     This documentation describes the version 2.0x of the FASTA
+program package (see W. R. Pearson and D. J. Lipman (1988),
+"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
+2448, and W. R.  Pearson (1990) "Rapid and Sensitive Sequence
+Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
+98). Version 2.0 modifies version 1.8 to include explicit
+statistical estimates for similarity scores based on the extreme
+value distribution.  In addition, FASTA protein alignments now
+use the Smith-Waterman algorithm with no limitation on gap size.
+FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
+options to change gap penalties on the command line. Version 1.7
+replaces rdf2 and rss with prdf and prss, which use the extreme-
+value distribution to calculate accurate probability estimates.
+
+
+Although there are a large number of programs in this package,
+they belong to four groups:
+
+
+    Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
+
+    Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
+
+    Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
+
+    Global alignment: ALIGN
+
+
+
+In addition, I have included several programs for protein
+sequence analysis, including a Kyte-Doolittle hydropathicity
+plotting program (GREASE, TGREASE), and a secondary structure
+prediction package (GARNIER).
+
+     The FASTA sequence comparison programs on this disk are
+improved versions of the FASTP program, originally described in
+Science (Lipman and Pearson, (1985) Science 227:1435-1441).  We
+have made several improvements.  First, the library search
+programs use a more sensitive method for the initial comparison
+of two sequences which allows the scores of several similar
+regions to be combined.  As a result, the results of a library
+search are now given with three scores, initn (the new initial
+score which may include several similar regions), init1 (the old
+fastp initial score from the best initial region), and opt (the
+old fastp optimized score allowing gaps in a 32 residue wide
+band).
+
+     These programs have also been modified to become "universal"
+(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
+FAST-N (nucleotides)); by changing the environment variable
+SMATRIX, the programs can be used to search protein sequences,
+DNA sequences, or whatever you like.  By default, FASTA, LFASTA,
+and the PRDF programs automatically recognize protein and DNA
+sequences.  Sequences are first read as amino acids, and then
+converted to nucleotides if the sequence is greater than 85%
+A,C,G,T (the '-n' option can be used to indicate DNA sequences).
+TFASTA compares protein sequences to a translated DNA sequence.
+Alternative scoring matrices can also be used.  In addition to
+the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
+based on simple identities or the genetic code can also be used
+for sequence comparisons or evaluation of significance.  Several
+different protein sequence matrices have been included;
+instructions for constructing your own scoring matrix are
+included in the file FORMAT.DOC.
+
+
+The remainder of this document is divided into three sections:
+(1) a brief history of the changes to the FASTA package; (2) A
+guide to installing the programs and databases; (3) A guide to
+using the FASTA programs. The programs are very easy to use, so
+if you are using them on a machine that is administered by
+someone else, you may want to skip to section (3) to learn how to
+use the programs, and then read section (1) to look at some of
+the more recent changes.  If you are installing the programs on
+your own machine, you will need to read section (2) carefully.
+
+
+1.  Revision History
+
+1.1.  Changes with version 2.0u
+
+     Version 2.0u provides several major improvements over
+previous versions of FASTA (and SSEARCH).  The most important is
+the incorporation of explicit statistical estimates and
+appropriate normalization of similarity scores. This improvement
+is discussed in more detail below in the section entitled
+Statistical Significance.  In addition, all of the protein
+comparison programs now use the BLOSUM50 matrix, with gap
+penalties of -12, -2, by default.  BLOSUM50 performs
+significantly better than the older PAM250 matrix.  PAM250 can
+still be used with the command line option: -s 250.  (DNA
+sequence comparisons use a more stringent gap penalty of -16, -4,
+which produces excellent statistical estimates when optimized
+scores are used. TFASTA uses -16, -4 as well.)
+
+     The quality of the fit of the extreme value distribution to
+the actual distribution of similarity scores is summarized with
+the Kolmogorov-Smirnov statistic.  The acceptance limits for this
+statistic can be found in many statistics books.  In general,
+values <0.10 (N=30) indicate excellent agreement between the
+actual and theoretical distributions.  If this statistic is >
+0.2, consider using a higher (more stringent) gap penalty, e.g.
+-16, -4 rather than -12, -2.  The default scoring matrix for DNA
+has been changed to score +5 for an identity and -4 for a
+mismatch.  These are the same scores used by BLASTN.
+
+     With explicit expectation calculations, the program now
+shows all scores and alignments with expectations less than 10.0
+(with optimized scores, 2.0 without optimization) when the "-Q"
+(quiet) mode is used.  The expectation threshold can be changed
+with the "-E" option.
+
+     Finally, the algorithm used to produce the final alignments
+of protein sequences is now a full Smith-Waterman, with unlimited
+gaps.  (The older band-limited alignments are used for DNA
+sequences and TFASTA by default, because Smith-Waterman
+alignments are very slow for long sequences.)  Both the optimized
+and Smith-Waterman scores are reported; if the Smith-Waterman
+score is higher, then additional gaps allowed a better alignment
+and similarity score to be calculated.
+
+     FASTA searches now optimize similarity scores by default
+(this slows searches about 2-fold (worst case) for ktup=2). Thus,
+the meaning of the "-o" option has been reversed; "-o" now turns
+off optimization and reports results sorted by "initn" scores.
+Optimization significantly improves the sensitivity of FASTA, so
+that it almost matches Smith-Waterman.  With version 2.0, the
+default band width used for optimized calculations can be varied
+with the "-y" option.  For proteins with ktup=2, a width of 16
+(-y 16) is used; 16 is also used for DNA sequences.  For proteins
+and ktup=1, a width of 32 is used. Searches that disable
+optimization with the "-o" option will work fine for sequences
+that share 25% or more identity in general, but to detect
+evolutionary relationships with 20% - 25% identity, the more
+sensitive default optimization is often required.  Optimization
+is required for accurate statistical estimates with either
+protein or DNA sequences.
+
+     The FASTA package now includes FASTX, a program that
+compares a DNA sequence to a protein sequence database by
+translating the DNA sequence in three frames (the reverse frames
+are selected with the -i option) and aligning the three-frame
+translation with the sequences in the protein database.
+Alignment scores allow frameshifts so that a cDNA or EST sequence
+with insertion/deletion errors can be aligned with its homologues
+from beginning to end.
+
+     With release 20u6, there is also a TFASTX program, which is
+a replacement for TFASTA.  TFASTA treats each of the six reading
+frames of a DNA library sequence as a different sequence; TFASTX
+compares a protein sequence against only two sequences from each
+DNA sequence - the forward and reverse orientation.  For a given
+orientation, TFASTX calculates a similarity score for alignments
+that allow frameshifts, thus considering all possible reading
+frames.
+
+     Another new program is included - randseq - which will
+produce a randomly shuffled (uniform or local shuffle) from an
+input sequence.  This randomly shuffled sequence can be used to
+evaluate the statistical estimates produced by FASTA, SSEARCH, or
+BLAST.
+
+1.2.  Changes with version 1.7
+Version 1.7 has been released to provide the PRDF and PRSS
+programs for shuffling sequences and estimating accurately the
+probabilities of the unshuffled-sequence scores.
+
+PRDF      a version of RDF2 that uses calculates the probability
+          of a similarity score more accurately by using a fit to
+          an extreme value distribution.  Code to fit the extreme
+          value distribution parameters and the impetus to update
+          RDF2 was provided by Phil Green, U. of Washington.
+
+PRSS      a version of PRDF that uses a rigorous Smith-Waterman
+          calculation to score similarities
+
+1.3.  Changes with version 1.6
+
+     FASTA version 1.6 uses a new method for calculating optimal
+scores in a band (the optimization or last step in the FASTA
+algorithm). In addition, it uses a linear-space method for
+calculating the actual alignments.  FASTA v1.6 package includes
+several new programs:
+
+SSEARCH   a program to search a sequence database using the
+          rigorous Smith-Waterman algorithm (this program is
+          about 100-fold slower than FASTA with ktup=2 (for
+          proteins).
+
+LALIGN    A rigorous local sequence alignment program that will
+          display the N-best local alignments (N=10 by default).
+
+PLALIGN   a version of lalign that plots the local alignments to
+          a tektronix display.
+
+FLALIGN   a version of lalign that plots the local alignments to
+          a GCG Figure file.
+
+     The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
+algorithm described by Huang and Miller (1991) Adv. Appl. Math.
+12:337-357.  The SSEARCH and PRSS programs incorporate algorithms
+described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
+
+     LFASTA and PLFASTA now calculate a different number of local
+similarities; they now behave more like LALIGN/PLALIGN.  Since
+local alignments of identical sequences produce "mirror-image"
+alignments, lalign and lfasta consider only one-half of the
+potential alignments between sequences from identical file names.
+Thus
+
+    lfasta mchu.aa mchu.aa
+
+Displays only two alignments, with earlier versions of the
+program, it would have displayed five, including the identity
+alignment.  PLFASTA does display five alignments; when two
+identical filenames are given, it draws the identity alignment,
+calculates the two unique local alignments, draws them, and draws
+their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
+filenames, rather than the actual sequences, to determine whether
+sequences are identical; you can "trick" the programs into
+behaving the old way by putting the same sequence in two
+different files.
+
+1.4.  Changes with version 1.5
+
+     FASTA version 1.5 includes a number of substantial revisions
+to improve the performance and sensitivity of the program.  It is
+now possible to tell the program to optimize all of the initn
+scores greater than a threshold.  The threshold is set at the
+same value as the old FASTA cutoff score.  Alternatively, you can
+tell FASTA to sort the results by the init1, rather than the
+initn, score by using the -1 option.  FASTA -1 ... will report
+the results the way the older FASTP program did.
+
+     A new method has been provided for selecting libraries. In
+the past, one could enter the name of a sequence file to be
+searched or a single letter that would specify a library from the
+list included in the $FASTLIBS file. Now, you can specify a set
+of library files with a string of letters preceded by a '%'.
+Thus, if the FASTLIBS file has the lines:
+
+    Genbank 70 primates$1P/seqlib/gbpri.seq 1
+    Genbank 70 rodents$1R/seqlib/gbrod.seq 1
+    Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
+    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
+
+Then the string: "%PRMB" would tell FASTA to search the four
+libraries listed above.  The %PRMB string can be entered either
+on the command line or when the program asks for a filename or
+library letter.
+
+     FASTA1.5 also provides additional flexibility for specifying
+the number of results and alignments to be displayed with the -Q
+(quiet) option.  The -b number option allows you to specify the
+number of sequence scores to show when the search is finished.
+Thus
+
+
+    FASTA -b 100 ...
+
+
+tells the program to display the top 100 sequence scores. In the
+past, if you displayed 100 scores (in -Q mode), you would also
+have store 100 alignments. The -d option allows you to limit the
+number of alignments shown.  FASTA -b 100 -d 20 would show 100
+scores and 20 alignments.
+
+     Finally, FASTA can provide a complete list of all of the
+sequences and scores calculated to a file with the -r (results)
+option.  FASTA -r results.out ... creates a file with a list of
+scores for every sequence in the library.  The list is not
+sorted, and only includes those scores calculated during the
+initial scan of the library.
+
+2.  Installing the FASTA package
+
+2.1.  Installing the programs
+
+2.1.1.  Unix version
+
+     The FASTA distribution comes with several makefile's that
+can be used to compile the FASTA programs.  Over the years, as
+ATT Unix System 5 and BSD unix have converged, these files have
+become very similar. To begin with, I recommend using the
+standard Makefile.  There are two values in the makefile that
+should be checked against the values used on your system: the HZ
+value, which is the frequency in ticks per second used by the
+times() system call, this value can usually be found by running:
+
+    grep HZ /usr/include/sys/*
+
+and the functions available to return random numbers.  If you
+have a rand48() function that returns a 32-bit random number, use
+it and use the lines:
+
+    NRAND=nrand48
+    RANFLG= -DRAND32
+
+If not, you will need to use the rand() function call and
+determine whether it returns a 16-bit or a 32-bit value.  These
+functions are used by PRDF and PRSS.  If you have problems
+compiling the programs, you may want to examine the makefile.unx
+and makefile.sun files, to look for differences.  I have tried to
+use very standard unix functions in these programs, and they have
+been successfully compiled, with very small changes to the
+Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
+machines (under the BSD environment).
+
+2.1.2.  IBM-PC/DOS version
+
+     For the IBM-PC/DOS version, the FASTA source code disk
+contains the complete source code to all of the programs on the
+other disks.  The programs were compiled with Borland's Turbo
+'C++', using Borland's MAKE utility.  The graphics programs
+(PLFASTA, TGREASE) use the graphics device drivers supplied with
+the Turbo 'C' V2.0 package.  Also included are the documentation
+files PROGRAMS.DOC and FORMAT.DOC.  You do not need any of the
+files the source code disk to run the programs.  The files on
+this disk are identical to the UNIX and VMS versions that run on
+larger machines.  Also included is the code to compile
+ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalize
+for end-gaps.
+
+     If you have the DOS or Macintosh version of the FASTA
+package, to install the programs you should:
+
+ (1)   Make a new directory (folder) for the FASTA programs.
+       This need not be the same as the directory for your
+       sequence databases.
+
+ (2)   Copy the files from the FASTA source disk to the new
+       directory.
+
+ (3)   (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
+       PATH command to include the FASTA directory and (b) add
+       the line:
+
+           set FASTLIBS=c:\yourfastadirectory\fastgbs
+
+       On the Macintosh, you may need to edit the "environment"
+       file and change the line that reads:
+
+           FASTLIBS=fastgbs
+
+       to indicate the full directory path for the fastgbs file,
+       for example:
+
+           FASTLIBS=Q105:FASTA:fastgbs
+
+
+ (4)   Finally, you will need to edit the fastgbs file.  This is
+       usually the most confusing part of the installation.  An
+       example of this file is shown below; to customize this
+       file for your machine, you will need to change the file
+       names from those provided in the fastgbs file to ones that
+       reflect the directory names and file names you use on your
+       machine. This is explained in more detail below.  In
+       addition, some entries in the fastgbs file refer to other
+       files of file names.  These files of file names (as
+       opposed to actual database files) may also need to be
+       edited.
+
+2.2.  Installing the libraries
+
+2.2.1.  The NBRF protein sequence library
+
+     The FASTA program package does not include any protein or
+DNA sequence libraries.  You can obtain the PIR protein sequence
+database from:
+
+    National  Biomedical Research Foundation
+    Georgetown  University  Medical  Center
+    3900 Reservoir Rd, N.W.
+    Washington, D.C. 20007
+
+In addition, this database is available via anonymous ftp from
+the host "ftp.bchs.uh.edu". It is available in two formats, VMS
+and CODATA format.  The "VMS" format (library type 5 below) can
+be searched much faster, can be easily reformatted for use by the
+"BLAST" rapid searching program, and is compatible with the
+Genetics Computer Group package of programs.  The CODATA format
+is used by the EUGENE/MBIR computing package from Baylor (library
+type 2).
+
+2.2.2.  The GENBANK DNA sequence library
+
+     FASTA, and TFASTA search sequences from the GENBANK
+"flatfile" (not ASN.1) DNA sequence library in the flat-file
+format distributed by the National Center for Biotechnology
+Information and the PIR format used by EBI/EMBL.  CD-ROMs can be
+obtained from:
+
+    Genbank
+    National Center for Biotechnology Information
+    National Library of Medicine
+    National Institutes of Health
+    8600 Rockville Pike
+    Bethesda, MD  20894
+
+
+     The GenBank DNA sequence library is also available via
+anonymous FTP from ncbi.nlm.nih.gov.
+
+2.2.3.  The EBI/EMBL CD-ROM libraries
+
+     The European Bioinformatics Institute (EBI) is now
+distributing the EMBL CD-ROM that contains both the complete EMBL
+DNA sequence database (which should be essentially identical to
+the GenBank DNA sequence database) and the SWISS-PROT protein
+sequence database. SWISS-PROT is derived from the NBRF Protein
+sequence database with additions from the EBI/EMBL DNA sequence
+database.  This CD-ROM is a "best-buy," since it provides both
+DNA and protein sequence libraries.  It is available from:
+
+
+    European Bioinformatics Institute
+    Hinxton Genome Campus, Hinxton Hall
+    Hinxton, Cambridge CB10 1RQ,
+    United Kingdom
+    Tel: +44 1223 4944
+    Fax: +44 1223 494468
+    Email: DATALIB@ebi.ac.uk
+
+
+
+     In addition, the SWISS-PROT protein sequence database is
+available via anonymous FTP from ncbi.nlm.nih.gov.
+
+2.3.  Finding the libraries: FASTLIBS
+
+     FASTA and TFASTA use the environment variable FASTLIBS to
+find the protein and DNA sequence libraries.  The FASTLIBS
+variable contains the name of a file that has the actual
+filenames of the libraries.  The FASTGBS file on is an example of
+a file that can be referred to by FASTLIBS. To use the FASTGBS
+file, type:
+
+    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
+    or
+    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
+
+Then edit the FASTGBS file to indicate where the protein and DNA
+sequence libraries can be found.  If you have a hard disk and
+your protein sequence library is kept in the file
+/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
+in the directory: /usr/lib/genbank, then fastgbs might contain:
+
+    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
+    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
+    GB Primate$1P@/usr/lib/genbank/gpri.nam
+    GB Rodent$1R@/usr/lib/genbank/grod.nam
+    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
+    ^   1    ^^^^       4                   ^     ^
+              23                             (5)
+
+The first line of this file says that there is a copy of the NBRF
+protein sequence database (which is a protein database) that can
+be selected by typing "P" on the command line or when the
+database menu is presented in the file /usr/lib/seq/aabank.lib.
+
+     Note that there are 4 or 5 fields in the lines in fastgbs.
+The first field is the description of the library which will be
+displayed by FASTA; it ends with a '$'.  The second field (1
+character), is a 0 if the library is a protein library and 1 if
+it is a DNA library.  The third field (1 character) is the
+character to be typed to select the library.
+
+     The fourth field is the name of the library file.  In the
+example above, the /usr/lib/seq/aabank.lib file contains the
+entire protein sequence library.  However the DNA library file
+names are preceded by a '@', because these files (gpri.nam,
+grod.nam, gmammal.nam) do not contain the sequences; instead they
+contain the names of the files which contain the sequences.  This
+is done because the GENBANK DNA database is broken down in to a
+large number of smaller files.  In order to search the entire
+primate database, you must search more than a dozen files.
+
+     In addition, an optional fifth field can be used to specify
+the format of the library file.  Alternatively, you can specify
+the library format in a file of file names (a file preceded by an
+'@').  This field must be separated from the file name by a space
+character (' ') from the filename.  In the example above, the
+aabank.lib file is in Pearson/FASTA format, while the swiss.seq
+file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
+FASTA can read the following formats:
+
+    0 Pearson/FASTA (>SEQID - comment/sequence)
+    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
+    2 NBRF CODATA (ENTRY/SEQUENCE)
+    3 EMBL/SWISS-PROT (ID/DE/SQ)
+    4 Intelligenetics (;comment/SEQID/sequence)
+    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
+    6 GCG (version 8.0) Unix Protein and DNA (compressed)
+    11 NCBI Blast1.3.2 format  (unix only)
+
+In particular, this version will work with the EMBL and PIR VMS
+formats that are distributed on the EMBL CD-ROM. The latter
+format (PIR VMS) is much faster to search than EMBL format.  This
+release also works with the protein and DNA database formats
+created for the BLASTP and BLASTN programs by SETDB and PRESSDB
+and with the new NCBI search format.  If a library format is not
+specified, for example, because you are just comparing two
+sequences, Pearson/FASTA (format 0) is used by default.  To
+change this default, you may set the LIBTYPE environment variable
+to a number.  For example,
+
+    setenv LIBTYPE 1
+
+would cause the program to use the GenBank LOCUS format by
+default for libraries (or the second sequence file), but the
+Pearson/FASTA format would still be used for the query sequence.
+
+     You can specify a group of library files by putting a '@'
+symbol before a file that contains a list of file names to be
+searched.  For example, if @gmam.nam is in the fastgbs file, the
+file "gmam.nam" might contain the lines:
+
+    </usr/lib/genbank
+    gbpri.seq 1
+    gbrod.seq 1
+    gbmam.seq 1
+
+In this case, the line beginning with a '<' indicates the
+directory the files will be found in.  The remaining lines name
+the actual sequence files.  So the first sequence file to be
+searched would be:
+
+    /usr/lib/genbank/gbpri.seq
+
+The notation "<PIRNAQ:" might be used under the VAX/VMS operating
+system. Under UNIX, the trailing '/' is left off, so the library
+directory might be written as "</usr/seqlib".
+
+     With version 1.4 of the FASTA package, the FASTA and TFASTA
+programs can search a library composed of different files in
+different sequence formats.  For example, you may wish to search
+the Genbank files (in GenBank flat file format) and the EMBL DNA
+sequence database on CD-ROM.  To do this, you simply list the
+names and filetypes of the files to be searched in a file of
+filenames.  For example, to search the mammalian portion of
+Genbank, the unannotated portion of Genbank, and the unannotated
+portion of the EMBL library, you could use the file:
+
+    </usr/lib/DNA
+    gbpri.seq 1
+    #  (this '#' causes the program to display the size of the library)
+    gbrod.seq 1
+    gbmam.seq 1
+    gbuna.seq 1
+    unanno.seq 5
+    #
+
+    You do not need to include library format numbers if  you
+    only use the Pearson/FASTA version of the PIR protein se-
+    quence library.  If no library  type  is  specified,  the
+    program  assumes  that  type  0 is being used (unless you
+    have set LIBTYPE).
+
+Support for the old compressed GenBank files, which have not been
+distributed for more than four years, has been removed from
+programs in the FASTA package.
+
+
+     Test the setup by running FASTA.  Enter the sequence file
+'MUSPLFM.AA' when the program requests it (this file is included
+with the programs).  The program should then ask you to select a
+protein sequence library.  Alternatively, if you run the TFASTA
+program and use the MUSPLFM.AA query sequence, the program should
+show you a selection of DNA sequence libraries.  Once the fastgbs
+file has been set up correctly, you can set FASTLIBS=fastgbs in
+your AUTOEXEC.BAT file, and you will not need to remember where
+the libraries are kept or how they are named.
+
+     FASTA and TFASTA must open a large number of files when
+searching and reporting the results of a GENBANK floppy disk
+format library search.  You may have problems with the large
+number of files under DOS on IBM-PC's (Unix and VMS users will
+not have these problems).  If you are going to search the GENBANK
+floppy disk format DNA sequence library under DOS, you should add
+the line:
+
+    FILES=16
+
+to your CONFIG.SYS file.  (Typically this is already done for
+programs like Windows or WordPerfect.)
+
+3.  Using the FASTA Package
+
+3.1.  Overview
+
+     The FASTA sequence comparison programs all require similar
+information, the name of a query sequence file, a library file,
+and the ktup parameter.  All of the programs can accept arguments
+on the command line, or they will prompt for the file names and
+ktup value.
+
+To use FASTA, simply type:
+
+    FASTA
+    and you will be prompted for :
+         the name of the test sequence file
+         the name of the library file
+         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
+
+             ktup of 2 is about 5 times faster than ktup = 1.
+             For  a  200  aa sequence against a 10,000,000 aa
+             library, the program takes  about  30  min  with
+             ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
+             IBM-PC.
+
+
+The program can also be run by typing
+
+    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
+
+
+Included with the package are the test files, MUSPLFM.AA,
+LCBO.AA, MCHU.AA and BOVPRL.SEQ.  To check to make certain that
+everything is working, you can try:
+
+    fasta musplfm.aa lcbo.aa
+    and
+    tfasta musplfm.aa bovprl.seq
+
+To test the local similarity programs LFASTA and PLFASTA, try:
+
+    lfasta mchu.aa mchu.aa
+    and
+    plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
+    or on a Tektronix terminal under UNIX or VMS)
+
+MCHU (calmodulin) has four duplicated calcium binding sites that
+are clearly detected by LFASTA.  For a more complicated example,
+try MWRTC1.aa, myosin heavy chain.
+
+3.2.  Sequence files
+
+     The FASTA programs know about three kinds of sequence files
+(four under VMS): (1) plain sequence files that can only be used
+as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
+library files.  These are the same as plain sequence files, each
+sequence is preceded by a comment line with a '>' in the first
+column. (3) distributed sequence libraries (this is a broad class
+that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
+flat-file format, EMBL flat-file format, and Intelligenetics
+format.  All of the files that you create should be of type (1)
+or (2).  Type (2) files (ones with a be used as query or library
+sequence files by all of the programs.
+
+     I have included several sample test files, *.AA.  The first
+line may begin with a '>'  or ';' followed by a comment.  The
+text after ';' in other lines will  be  ignored.   Spaces  and
+tabs  (and anything else that  is  not  an amino-acid code) are
+ignored.
+
+     Library files should have the form:
+
+    >Sequence name and identifier
+    A F A S Y T .... actual sequence.
+    F S S       .... second line of sequence.
+    >Next sequence name and identifier
+
+This is often referred to as "FASTA" or "Pearson" format.  You
+can build your own library by concatenating several sequence
+files.  Just be sure that each sequence is preceded by a line
+beginning with a '>' with a sequence name.
+
+     The test file should not have lines longer than 120
+characters, and sequences entered with word processors should use
+a document mode, with normal carriage returns at the end of
+lines.
+
+Program Summary
+
+3.3.  Sequence search programs
+
+FASTA     universal sequence comparison. Defaults to comparing
+          protein sequences; if the sequences are > 85% A+C+G+T
+          or the -n option is used, a DNA sequence is assumed.
+
+FASTX     Search a protein sequence library using amino acid
+          sequence comparison to the forward three frames of a
+          translated DNA query sequence. (The reverse frames are
+          specified with the -i option.) Alignment scores allow
+          frameshifts; the final alignment uses a Smith-Waterman
+          type alignment routine (no limit on gaps) that allows
+          frameshifts.
+
+TFASTA    Search DNA library for a protein sequence by
+          translating the DNA sequence to protein in all six
+          frames (three forward frames with the -3 command line
+          option). TFASTA with ktup=2 is about as fast as a DNA
+          FASTA with ktup=4, and is substantially more sensitive.
+          (also reads the GENBANK library)
+
+TFASTX    Search DNA library for a protein sequence by
+          translating the DNA sequence to protein in all six
+          frames (three forward frames with the -3 command line
+          option) calculating similarity scores that allow
+          frameshifts. TFASTX produces an optimal Smith-Waterman
+          alignment of the query and translated-library sequence.
+
+SSEARCH   Universal sequence comparison using the Smith-Waterman
+          algorithm ( T. F. Smith and M. S. Waterman (1981) J.
+          Mol. Biol. 147:195-197).  This program uses code
+          developed by Huang and Miller (X. Huang, R. C.
+          Hardison, W. Miller (1990) CABIOS 6:373-381) for
+          calculating the local similarity score and code from
+          the ALIGN program (see below) for calculating the local
+          alignment.  SSEARCH is about 50-times slower than FASTA
+          with ktup=2 (for proteins).
+
+ALIGN     optimal global alignment of two sequences with no
+          short-cuts.  This program is a slightly modified
+          version of one taken from E.  Myers and W. Miller. The
+          algorithm is described in E. Myers and W.  Miller,
+          "Optimal Alignments in Linear Space" (CABIOS (1988)
+          4:11-17).
+
+3.4.  Local similarity programs
+
+LFASTA    local similarity searches showing local alignments.
+          The algorithm used to calculate the local alignment in
+          a band has been improved (Chao, Pearson, and Miller,
+          submitted).
+
+PLFASTA   local similarity searches with plot output (on the IBM,
+          this program requires that the environment variable
+          BGIDIR be set).
+
+PCLFASTA  (unix only) local similarity searches with plot output
+          using pic commands.
+
+LALIGN    Calculates the N-best local alignments using a rigorous
+          algorithm.  (N=10 by default.) The algorithm was
+          developed by Huang and Miller (X.  Huang and W.  Miller
+          (1991) Adv. Appl. Math. 12:337-357), which is a
+          linear-space version of an algorithm described by M. S.
+          Waterman and M. Eggert (J.  Mol. Biol. 197:723-728).
+          Like SSEARCH, LALIGN is rigorous, but also very slow.
+
+PLALIGN   A version of LALIGN that plots its output to a screen
+          or to a Tektronix terminal emulator.
+
+3.5.  Statistical Significance
+
+     With version 2.0 of the FASTA program distribution, FASTA,
+TFASTA, and SSEARCH now provide estimates of statistical
+significance for library searches.  Work by Altschul, Arratia,
+Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
+Nature Genetics 6:119 for an excellent review) suggests that
+local sequence similarity scores follow the extreme value
+distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
+u = ln(Kmn)/lambda and m,m are the lengths of the query and
+library sequence. This formula can be rewritten as: 1 - exp(-Kmn
+exp(-lambda x), which shows that the average score for an
+unrelated library sequence increases with the logarithm of the
+length of the library sequence.  FASTA and SSEARCH use simple
+linear regression against the the log of the library sequence
+length to calculate a normalized "z-score" with mean 50,
+regardless of library sequence length, and variance 10.  These
+z-scores can then be used with the extreme value distribution and
+the poisson distribution (to account for the fact that each
+library sequence comparison is an independent test) to calculate
+the number of library sequences to obtain a score greater than or
+equal to the score obtained in the search. The original idea and
+routines to do the linear regression on library sequence length
+were provided Phil Green, U. Washington.  This version of FASTA
+and SSEARCH uses a slightly different strategy for fitting the
+data than those originally provided by Dr. Green.
+
+     The expected number of sequences is plotted in the histogram
+using an "*". Since the parameters for the extreme value
+distribution are not calculated directly from the distribution of
+similarity scores, the pattern of "*'s" in the histogram gives a
+qualitative view of how well the statistical theory fits the
+similarity scores calculated by FASTA and SSEARCH.  For FASTA, if
+optimized scores are calculated for each sequence in the database
+(the default), the agreement between the actual distribution of
+"z-scores" and the expected distribution based on the length
+dependence of the score and the extreme value distribution is
+usually very good.  Likewise, the distribution of SSEARCH Smith-
+Waterman scores typically agrees closely with the actual
+distribution of "z-scores."  The agreement with unoptimized
+scores, ktup=2, is often not very good, with too many high
+scoring sequences and too few low scoring sequences compared with
+the predicted relationship between sequence length and similarity
+score.  In those cases, the expectation values may be
+overestimates.
+
+     The statistical routines assume that the library contains a
+large sample of unrelated sequences.  If this is not the case,
+then the expectation values are meaningless.  Likewise, if there
+are fewer than 20 sequences in the library, the statistical
+calculations are not done.
+
+     For protein searches, library sequences with E() values <
+0.01 for searches of a 10,000 entry protein database are almost
+always homologous. Frequently sequences with E()-values from 1 -
+10 are related as well. Remember, however, that these E() values
+also reflect differences between the amino acid composition of
+the query sequence and that of the "average" library sequence.
+Thus, when searches are done with query sequences with "biased"
+amino-acid composition, unrelated sequences may have
+"significant" scores because of sequence bias.  The programs
+below, PRDF and PRSS, can address this problem by calculating
+similarity scores for random sequences with the same length and
+amino acid composition.
+
+     If optimization is not used ("-o"), E-values for DNA
+sequences overestimate the significance of the scores that are
+obtained and unrelated sequences frequently have E()-values <
+0.0005. With optimization, the agreement between E()-value
+compares favorably with protein sequence comparison.  This is in
+part due to the use of more stringent gap penalties for DNA
+sequence comparison, -16, -4 rather than -12, -2.  With the
+latter penalties, many unrelated sequences appear to have
+significant similarity. Nevertheless, since protein sequence
+comparison is much more sensitive, DNA sequence comparison should
+not be used to identify sequences that encode protein.  Even with
+ktup=6, optimization rarely increases run-times more than 50%
+with mRNA-size query sequences.  Optimization should be used
+whenever possible.
+
+     Similar comments apply to TFASTA, where  higher gap
+penalties (-16,-4) are required for accurate statistical
+estimates.  Because TFASTA produces so many artificial "coding"
+sequences with atypical amino acid compositions, the statistical
+estimates with TFASTA are often over estimates.  With optimized
+scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
+will sometimes have E() values of 0.1.  If initn scores are used,
+unrelated sequences may have have E() values < 0.01.
+
+PRDF      improved version of RDF program that includes accurate
+          probability estimates for all three scoring methods
+          (includes local or window shuffle routine)
+
+PRSS      A version of PRDF that uses the rigorous Smith-Waterman
+          calculation used by SSEARCH.
+
+RANDSEQ   produces a randomly shuffled sequence from a query
+          sequence.
+
+RELATE    significance program described by Dayhoff (Atlas of
+          Protein Sequence and Structure, Vol. 5, Supplement 3).
+          Each chunk of 25 residues in one sequence is compared
+          to every 25 residue fragment of the second sequence.
+          Sequences which are genuinely related will have a large
+          number of scores greater than 3 standard deviations
+          above the mean score of all of the comparisons.
+
+3.6.  Other analysis programs
+
+AACOMP    calculate the amino acid composition and molecular
+          weight of a sequence.
+
+BESTSCOR  calculate the best self-comparison score.
+
+GREASE    Kyte-Doolittle hydropathicity profile
+
+TGREASE   graphic plot of Kyte-Doolittle profile
+
+FROMGB    convert from GenBank LOCUS format (also used by the
+          IBI-Pustell programs) to Pearson/FASTA format.
+
+GARNIER   A secondary structure prediction program using the
+          method of Garnier, Osgusthorpe, and Robson, J. Mol.
+          Biol., (1978) 120:97-120.
+
+3.7.  Options
+
+     These programs have a number of output options, which are
+invoked by the environment variables LINLEN, SHOWALL, and MARKX.
+Alternatively, these values can be controlled by command line
+options.  The number of sequence residues per output line is now
+adjustable by setting the environment variable LINLEN, or the
+command line option -w.  LINLEN is normally 60, to change it set
+LINLEN=80 before running the program or add -w 80 to the command
+line.  LINLEN can be set up to 200.  SHOWALL (-a) determines
+whether all, or just a portion, of the aligned sequences are
+displayed.  Previously, FASTP would show the entire length of
+both sequences in an alignment while FASTN would only show the
+portions of the two sequences that overlapped. Now the default is
+to show only the overlap between the two sequences, to show
+complete sequences, set SHOWALL=1, or use the -a option on the
+command line.
+
+     The differences between the two aligned sequences can be
+highlighted in three different ways by changing the environment
+variable MARKX or the -m option.  Normally (MARKX=0) the program
+uses ':' do denote identities and '.' to denote conservative
+replacements.  If MARKX=1, the program will not mark identities;
+instead conservative replacements are denoted by a 'x' and non-
+conservative substitutions by a 'X'.  If MARKX=2, the residues in
+the second sequence are only shown if they are different from the
+first. MARKX=3 displays the aligned library sequences without the
+query sequence; these can be used to build a primitive multiple
+alignment.  MARKX=4 provides a graphical display of the
+boundaries of the alignments. Thus the five options are:
+
+
+     MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4
+
+    MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
+    ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
+    MWKSCGYPYT   MWKSCGYPYT
+
+
+(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
+format for use with other programs.  See the file"readme.v20u4"
+for a more complete description.
+
+3.8.  Command line options
+
+     It is now possible to specify  several options on the
+command line, instead of using environment variables.  The
+command line options are preceded by a dash; the following
+options are available:
+
+-a        same as showall=1
+
+-A        force Smith-Waterman alignments for DNA sequences and
+          TFASA.  By default, only FASTA protein sequence
+          comparisons use Smith-Waterman alignments.
+
+-b #      Number of sequence scores to be shown on output.  In
+          the absence of this option, fasta (and tfasta and
+          ssearch) display all library sequences obtaining
+          similarity scores with expectations less than 10.0 if
+          optimized score are used, or 2.0 if they are not. The
+          -b option can limit the display further, but it will
+          not cause additional sequences to be displayed.
+
+-c #      Threshold score for optimization (OPTCUT).  Set "-c 1"
+          to optimize every sequence in a database.  (This slows
+          the program down about 5-fold).
+
+-E #      Limit the number of scores and alignments shown based
+          on the expected number of scores.  Used to override the
+          expectation value of 10.0 used by default.  When used
+          with -Q, -E 2.0 will show all library sequences with
+          scores with an expectation value <= 2.0.
+
+-d #      Number of alignments to be reported by default. (Used
+          in conjunction with -Q).  No longer necessary, see "-b"
+          above.
+
+-f        Penalty for the first residue in a gap (-12 by default
+          for proteins, -16 for DNA or for TFASTA).
+
+-g        Penalty for additional residues in a gap (-2 by default
+          for proteins, -4 for DNA and TFASTA ).
+
+-h        Penalty for frameshift (FASTX, TFASTX only).
+
+-H        Omit histogram.
+
+-i        Invert (reverse complement) the query sequence if it is
+          DNA.  For TFASTX, search the reverse complement of the
+          library sequence only.
+
+-k #      Threshold for joining init1 segments to build an initn
+          score (GAPCUT).
+
+-l file   Location of library menu file (FASTLIBS).
+
+-L        Display more information about the library sequence in
+          the alignment.
+
+-m #      MARKX = # (0, 1, 2, 3, 4, 10)
+
+-n        Force the query sequence to be treated as a DNA
+          sequence.  This is particularly useful for query
+          sequences that contain a large number of ambiguous
+          residues, e.g. transcription factor binding sites.
+
+-O        Send copy of results to "filename."  Helpful for
+          environments without STDOUT.
+
+-o        Turn off default optimization of all scores greater
+          than OPTCUT. Sort results by "initn" scores.
+
+-Q,-q     Quiet - does not prompt for any input.  Writes scores
+          and alignments to the terminal or standard output file.
+
+-r file   Save a results summary line for every sequence in the
+          sequence library.  The summary line includes the
+          sequence identifier, superfamily number (if available)
+          position in the library, and the similarity scores
+          calculated.  This option can be used to evaluate the
+          sensitivity and selectivity of different search
+          strategies (see W. R. Pearson (1991) Genomics 11:635-
+          650.)
+
+-s file   SMATRIX is read from file.  Several SMATRIX files are
+          provided with the standard distribution.  For protein
+          sequences: codaa.mat - based on minimum mutation
+          matrix; idnaa.mat - identity matrix; pam250.mat - the
+          PAM250 matrix developed by Dayhoff et al (Atlas of
+          Protein Sequence and Structure, vol. 5, suppl. 3,
+          1978); pam120.mat - a PAM120 matrix.  The default
+          scoring matrix is BLOSUM50, PAM250 is available with
+          "-s 250", BLOSUM62 ("-s BL62") is also available.
+
+-v        (LINEVAL) values used for line styles in plfasta
+
+-w #      Line length (width) = number (<200)
+
+-x        Specifies offsets for the beginning of the query and
+          library sequence.  For example, if you are comparing
+          upstream regions for two genes, and the first sequence
+          contains 500 nt of upstream sequence while the second
+          contains 300 nt of upstream sequence, you might try:
+
+              fasta -x "-500 -300" seq1.nt seq2.nt
+
+          If the -x option is not used, FASTA assumes numbering
+          starts with 1.  This option will not work properly with
+          the translated library sequence with tfasta.  (You
+          should double check to be certain the negative
+          numbering works properly.)
+
+-y        Set the width of the band used for calculating
+          "optimized" scores.  For proteins and ktup=2, the width
+          is 16.  For proteins with ktup=1, the width is 32 by
+          default.  For DNA the width is 16.
+
+-z        Turn off statistical calculations.
+
+-1        sort output by init1 score (as FASTP used to do).
+
+-3        (TFASTA, TFASTX only) translate only three forward
+          frames
+
+
+For example:
+
+    fasta -w 80 -a seq1.aa seq.aa
+
+would compare the sequence in seq1.aa to that in seq2.aa and
+display the results with 80 residues on an output line, showing
+all of the residues in both sequences.  Be sure to enter the
+options before entering the file names, or just enter the options
+on the command line, and the program will prompt for the file
+names.
+
+     Not all of these options are appropriate for all of the
+programs.  The options above are used by FASTA and TFASTA. RELATE
+uses the -s option, ALIGN uses the -w, -m, and -s options, and
+the PRDF program uses -c, -f, -k, and -s.
+
+4.  Environment variable summary
+
+     Environment variables allow you to set search parameters
+that will be used frequently when you run a program; for example,
+if you prefer to use the PAM250 scoring matrix, you might "set
+SMATRIX=250."  Command line parameters, if used, always override
+environment variable settings. The following environment
+variables are used by this program:
+
+AABANK    the file name  of the default sequence library.
+
+FASTLIBS  the location of the file which contains the list of
+          library files to be searched.
+
+GAPCUT    threshold used for joining init1 regions in the second
+          step of FASTA.  Normally set based on sequence length
+          and ktup.
+
+LIBTYPE   used to specify the format of the library sequence for
+          FASTA and TFASTA.
+
+LINLEN    output line length - can go up to 200
+
+LINEVAL   used by plfasta to determine the relationship between
+          line style and similarity score (-v).  This should be a
+          string of three numbers, e.g.  "200 100 50"
+
+MARKX     symbol for denoting matches, mismatches. Note that this
+          symbol is only used across the optimized local region;
+          sequences that are outside this region are not marked.
+
+OPTCUT    Set the threshold to be used for optimization in a band
+          around the best initial region.  Normally the OPTCUT
+          value is calculated from the length of the sequence and
+          the ktup value (for a 200 residue sequence, it is about
+          28).  If OPTCUT=1, every sequence in the database will
+          be optimized.  This is the most sensitive option.
+
+PAMFACT   This version of fasta uses a more sensitive method for
+          identifying initial regions. Instead of using a
+          constant factor (fact) for each match in a ktup, it
+          uses the scoring matrix (PAM) scores.  While this works
+          well for protein sequences, it has not been as
+          carefully tested for DNA sequences, so by default, this
+          modification is used for proteins but not for DNA.
+          Setting the PAMFACT environment variable to 1 forces
+          the option on; PAMFACT=0 turns it off.
+
+SHOWALL   on output, show the complete sequence instead of just
+          the overlap of the two aligned sequences.
+
+SMATRIX   alternative scoring matrix file.
+
+TEKPLOT   (IBM-PC only, Unix and VMS versions generate Tektronix
+          graphics by default) Generate Tektronix output.
+          Normally, PLFASTA and TGREASE plot graphs using the
+          Turbo C graphics library.  Unfortunately, often these
+          plots cannot be printed out without special programs.
+          However, if you set TEKPLOT=1, tektronix graphics
+          commands will be used.  Tektronix commands can be used
+          together with the PLOTDEV program, available from
+          Microplot Systems.  They no lonter sell this program,
+          but it can be downloaded from
+          http://iquest.com/~microplt/index1.html.  PLOTDEV also
+          allows you to print out graphics on the screen.
+
+As always, please inform me of bugs as soon as possible.
+
+William R. Pearson
+Department of Biochemistry
+Box 440, Jordan Hall
+U. of Virginia
+Charlottesville, VA
+
+wrp@virginia.EDU