--- /dev/null
+
+ COPYRIGHT NOTICE
+
+Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
+Pearson and the University of Virginia. All rights reserved. The
+FASTA program and documentation may not be sold or incorporated
+into a commercial product, in whole or in part, without written
+consent of William R. Pearson and the University of Virginia.
+For further information regarding permission for use or
+reproduction, please contact: David Hudson, Assistant Provost for
+Research, University of Virginia, P.O. Box 9025, Charlottesville,
+VA 22906-9025, (434) 924-6853
+
+
+The FASTA program package
+
+Introduction
+
+ This documentation describes the version 2.0x of the FASTA
+program package (see W. R. Pearson and D. J. Lipman (1988),
+"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
+2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence
+Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
+98). Version 2.0 modifies version 1.8 to include explicit
+statistical estimates for similarity scores based on the extreme
+value distribution. In addition, FASTA protein alignments now
+use the Smith-Waterman algorithm with no limitation on gap size.
+FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
+options to change gap penalties on the command line. Version 1.7
+replaces rdf2 and rss with prdf and prss, which use the extreme-
+value distribution to calculate accurate probability estimates.
+
+
+Although there are a large number of programs in this package,
+they belong to four groups:
+
+
+ Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
+
+ Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
+
+ Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
+
+ Global alignment: ALIGN
+
+
+
+In addition, I have included several programs for protein
+sequence analysis, including a Kyte-Doolittle hydropathicity
+plotting program (GREASE, TGREASE), and a secondary structure
+prediction package (GARNIER).
+
+ The FASTA sequence comparison programs on this disk are
+improved versions of the FASTP program, originally described in
+Science (Lipman and Pearson, (1985) Science 227:1435-1441). We
+have made several improvements. First, the library search
+programs use a more sensitive method for the initial comparison
+of two sequences which allows the scores of several similar
+regions to be combined. As a result, the results of a library
+search are now given with three scores, initn (the new initial
+score which may include several similar regions), init1 (the old
+fastp initial score from the best initial region), and opt (the
+old fastp optimized score allowing gaps in a 32 residue wide
+band).
+
+ These programs have also been modified to become "universal"
+(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
+FAST-N (nucleotides)); by changing the environment variable
+SMATRIX, the programs can be used to search protein sequences,
+DNA sequences, or whatever you like. By default, FASTA, LFASTA,
+and the PRDF programs automatically recognize protein and DNA
+sequences. Sequences are first read as amino acids, and then
+converted to nucleotides if the sequence is greater than 85%
+A,C,G,T (the '-n' option can be used to indicate DNA sequences).
+TFASTA compares protein sequences to a translated DNA sequence.
+Alternative scoring matrices can also be used. In addition to
+the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
+based on simple identities or the genetic code can also be used
+for sequence comparisons or evaluation of significance. Several
+different protein sequence matrices have been included;
+instructions for constructing your own scoring matrix are
+included in the file FORMAT.DOC.
+
+
+The remainder of this document is divided into three sections:
+(1) a brief history of the changes to the FASTA package; (2) A
+guide to installing the programs and databases; (3) A guide to
+using the FASTA programs. The programs are very easy to use, so
+if you are using them on a machine that is administered by
+someone else, you may want to skip to section (3) to learn how to
+use the programs, and then read section (1) to look at some of
+the more recent changes. If you are installing the programs on
+your own machine, you will need to read section (2) carefully.
+
+
+1. Revision History
+
+1.1. Changes with version 2.0u
+
+ Version 2.0u provides several major improvements over
+previous versions of FASTA (and SSEARCH). The most important is
+the incorporation of explicit statistical estimates and
+appropriate normalization of similarity scores. This improvement
+is discussed in more detail below in the section entitled
+Statistical Significance. In addition, all of the protein
+comparison programs now use the BLOSUM50 matrix, with gap
+penalties of -12, -2, by default. BLOSUM50 performs
+significantly better than the older PAM250 matrix. PAM250 can
+still be used with the command line option: -s 250. (DNA
+sequence comparisons use a more stringent gap penalty of -16, -4,
+which produces excellent statistical estimates when optimized
+scores are used. TFASTA uses -16, -4 as well.)
+
+ The quality of the fit of the extreme value distribution to
+the actual distribution of similarity scores is summarized with
+the Kolmogorov-Smirnov statistic. The acceptance limits for this
+statistic can be found in many statistics books. In general,
+values <0.10 (N=30) indicate excellent agreement between the
+actual and theoretical distributions. If this statistic is >
+0.2, consider using a higher (more stringent) gap penalty, e.g.
+-16, -4 rather than -12, -2. The default scoring matrix for DNA
+has been changed to score +5 for an identity and -4 for a
+mismatch. These are the same scores used by BLASTN.
+
+ With explicit expectation calculations, the program now
+shows all scores and alignments with expectations less than 10.0
+(with optimized scores, 2.0 without optimization) when the "-Q"
+(quiet) mode is used. The expectation threshold can be changed
+with the "-E" option.
+
+ Finally, the algorithm used to produce the final alignments
+of protein sequences is now a full Smith-Waterman, with unlimited
+gaps. (The older band-limited alignments are used for DNA
+sequences and TFASTA by default, because Smith-Waterman
+alignments are very slow for long sequences.) Both the optimized
+and Smith-Waterman scores are reported; if the Smith-Waterman
+score is higher, then additional gaps allowed a better alignment
+and similarity score to be calculated.
+
+ FASTA searches now optimize similarity scores by default
+(this slows searches about 2-fold (worst case) for ktup=2). Thus,
+the meaning of the "-o" option has been reversed; "-o" now turns
+off optimization and reports results sorted by "initn" scores.
+Optimization significantly improves the sensitivity of FASTA, so
+that it almost matches Smith-Waterman. With version 2.0, the
+default band width used for optimized calculations can be varied
+with the "-y" option. For proteins with ktup=2, a width of 16
+(-y 16) is used; 16 is also used for DNA sequences. For proteins
+and ktup=1, a width of 32 is used. Searches that disable
+optimization with the "-o" option will work fine for sequences
+that share 25% or more identity in general, but to detect
+evolutionary relationships with 20% - 25% identity, the more
+sensitive default optimization is often required. Optimization
+is required for accurate statistical estimates with either
+protein or DNA sequences.
+
+ The FASTA package now includes FASTX, a program that
+compares a DNA sequence to a protein sequence database by
+translating the DNA sequence in three frames (the reverse frames
+are selected with the -i option) and aligning the three-frame
+translation with the sequences in the protein database.
+Alignment scores allow frameshifts so that a cDNA or EST sequence
+with insertion/deletion errors can be aligned with its homologues
+from beginning to end.
+
+ With release 20u6, there is also a TFASTX program, which is
+a replacement for TFASTA. TFASTA treats each of the six reading
+frames of a DNA library sequence as a different sequence; TFASTX
+compares a protein sequence against only two sequences from each
+DNA sequence - the forward and reverse orientation. For a given
+orientation, TFASTX calculates a similarity score for alignments
+that allow frameshifts, thus considering all possible reading
+frames.
+
+ Another new program is included - randseq - which will
+produce a randomly shuffled (uniform or local shuffle) from an
+input sequence. This randomly shuffled sequence can be used to
+evaluate the statistical estimates produced by FASTA, SSEARCH, or
+BLAST.
+
+1.2. Changes with version 1.7
+Version 1.7 has been released to provide the PRDF and PRSS
+programs for shuffling sequences and estimating accurately the
+probabilities of the unshuffled-sequence scores.
+
+PRDF a version of RDF2 that uses calculates the probability
+ of a similarity score more accurately by using a fit to
+ an extreme value distribution. Code to fit the extreme
+ value distribution parameters and the impetus to update
+ RDF2 was provided by Phil Green, U. of Washington.
+
+PRSS a version of PRDF that uses a rigorous Smith-Waterman
+ calculation to score similarities
+
+1.3. Changes with version 1.6
+
+ FASTA version 1.6 uses a new method for calculating optimal
+scores in a band (the optimization or last step in the FASTA
+algorithm). In addition, it uses a linear-space method for
+calculating the actual alignments. FASTA v1.6 package includes
+several new programs:
+
+SSEARCH a program to search a sequence database using the
+ rigorous Smith-Waterman algorithm (this program is
+ about 100-fold slower than FASTA with ktup=2 (for
+ proteins).
+
+LALIGN A rigorous local sequence alignment program that will
+ display the N-best local alignments (N=10 by default).
+
+PLALIGN a version of lalign that plots the local alignments to
+ a tektronix display.
+
+FLALIGN a version of lalign that plots the local alignments to
+ a GCG Figure file.
+
+ The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
+algorithm described by Huang and Miller (1991) Adv. Appl. Math.
+12:337-357. The SSEARCH and PRSS programs incorporate algorithms
+described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
+
+ LFASTA and PLFASTA now calculate a different number of local
+similarities; they now behave more like LALIGN/PLALIGN. Since
+local alignments of identical sequences produce "mirror-image"
+alignments, lalign and lfasta consider only one-half of the
+potential alignments between sequences from identical file names.
+Thus
+
+ lfasta mchu.aa mchu.aa
+
+Displays only two alignments, with earlier versions of the
+program, it would have displayed five, including the identity
+alignment. PLFASTA does display five alignments; when two
+identical filenames are given, it draws the identity alignment,
+calculates the two unique local alignments, draws them, and draws
+their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
+filenames, rather than the actual sequences, to determine whether
+sequences are identical; you can "trick" the programs into
+behaving the old way by putting the same sequence in two
+different files.
+
+1.4. Changes with version 1.5
+
+ FASTA version 1.5 includes a number of substantial revisions
+to improve the performance and sensitivity of the program. It is
+now possible to tell the program to optimize all of the initn
+scores greater than a threshold. The threshold is set at the
+same value as the old FASTA cutoff score. Alternatively, you can
+tell FASTA to sort the results by the init1, rather than the
+initn, score by using the -1 option. FASTA -1 ... will report
+the results the way the older FASTP program did.
+
+ A new method has been provided for selecting libraries. In
+the past, one could enter the name of a sequence file to be
+searched or a single letter that would specify a library from the
+list included in the $FASTLIBS file. Now, you can specify a set
+of library files with a string of letters preceded by a '%'.
+Thus, if the FASTLIBS file has the lines:
+
+ Genbank 70 primates$1P/seqlib/gbpri.seq 1
+ Genbank 70 rodents$1R/seqlib/gbrod.seq 1
+ Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
+ Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
+
+Then the string: "%PRMB" would tell FASTA to search the four
+libraries listed above. The %PRMB string can be entered either
+on the command line or when the program asks for a filename or
+library letter.
+
+ FASTA1.5 also provides additional flexibility for specifying
+the number of results and alignments to be displayed with the -Q
+(quiet) option. The -b number option allows you to specify the
+number of sequence scores to show when the search is finished.
+Thus
+
+
+ FASTA -b 100 ...
+
+
+tells the program to display the top 100 sequence scores. In the
+past, if you displayed 100 scores (in -Q mode), you would also
+have store 100 alignments. The -d option allows you to limit the
+number of alignments shown. FASTA -b 100 -d 20 would show 100
+scores and 20 alignments.
+
+ Finally, FASTA can provide a complete list of all of the
+sequences and scores calculated to a file with the -r (results)
+option. FASTA -r results.out ... creates a file with a list of
+scores for every sequence in the library. The list is not
+sorted, and only includes those scores calculated during the
+initial scan of the library.
+
+2. Installing the FASTA package
+
+2.1. Installing the programs
+
+2.1.1. Unix version
+
+ The FASTA distribution comes with several makefile's that
+can be used to compile the FASTA programs. Over the years, as
+ATT Unix System 5 and BSD unix have converged, these files have
+become very similar. To begin with, I recommend using the
+standard Makefile. There are two values in the makefile that
+should be checked against the values used on your system: the HZ
+value, which is the frequency in ticks per second used by the
+times() system call, this value can usually be found by running:
+
+ grep HZ /usr/include/sys/*
+
+and the functions available to return random numbers. If you
+have a rand48() function that returns a 32-bit random number, use
+it and use the lines:
+
+ NRAND=nrand48
+ RANFLG= -DRAND32
+
+If not, you will need to use the rand() function call and
+determine whether it returns a 16-bit or a 32-bit value. These
+functions are used by PRDF and PRSS. If you have problems
+compiling the programs, you may want to examine the makefile.unx
+and makefile.sun files, to look for differences. I have tried to
+use very standard unix functions in these programs, and they have
+been successfully compiled, with very small changes to the
+Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
+machines (under the BSD environment).
+
+2.1.2. IBM-PC/DOS version
+
+ For the IBM-PC/DOS version, the FASTA source code disk
+contains the complete source code to all of the programs on the
+other disks. The programs were compiled with Borland's Turbo
+'C++', using Borland's MAKE utility. The graphics programs
+(PLFASTA, TGREASE) use the graphics device drivers supplied with
+the Turbo 'C' V2.0 package. Also included are the documentation
+files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the
+files the source code disk to run the programs. The files on
+this disk are identical to the UNIX and VMS versions that run on
+larger machines. Also included is the code to compile
+ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize
+for end-gaps.
+
+ If you have the DOS or Macintosh version of the FASTA
+package, to install the programs you should:
+
+ (1) Make a new directory (folder) for the FASTA programs.
+ This need not be the same as the directory for your
+ sequence databases.
+
+ (2) Copy the files from the FASTA source disk to the new
+ directory.
+
+ (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
+ PATH command to include the FASTA directory and (b) add
+ the line:
+
+ set FASTLIBS=c:\yourfastadirectory\fastgbs
+
+ On the Macintosh, you may need to edit the "environment"
+ file and change the line that reads:
+
+ FASTLIBS=fastgbs
+
+ to indicate the full directory path for the fastgbs file,
+ for example:
+
+ FASTLIBS=Q105:FASTA:fastgbs
+
+
+ (4) Finally, you will need to edit the fastgbs file. This is
+ usually the most confusing part of the installation. An
+ example of this file is shown below; to customize this
+ file for your machine, you will need to change the file
+ names from those provided in the fastgbs file to ones that
+ reflect the directory names and file names you use on your
+ machine. This is explained in more detail below. In
+ addition, some entries in the fastgbs file refer to other
+ files of file names. These files of file names (as
+ opposed to actual database files) may also need to be
+ edited.
+
+2.2. Installing the libraries
+
+2.2.1. The NBRF protein sequence library
+
+ The FASTA program package does not include any protein or
+DNA sequence libraries. You can obtain the PIR protein sequence
+database from:
+
+ National Biomedical Research Foundation
+ Georgetown University Medical Center
+ 3900 Reservoir Rd, N.W.
+ Washington, D.C. 20007
+
+In addition, this database is available via anonymous ftp from
+the host "ftp.bchs.uh.edu". It is available in two formats, VMS
+and CODATA format. The "VMS" format (library type 5 below) can
+be searched much faster, can be easily reformatted for use by the
+"BLAST" rapid searching program, and is compatible with the
+Genetics Computer Group package of programs. The CODATA format
+is used by the EUGENE/MBIR computing package from Baylor (library
+type 2).
+
+2.2.2. The GENBANK DNA sequence library
+
+ FASTA, and TFASTA search sequences from the GENBANK
+"flatfile" (not ASN.1) DNA sequence library in the flat-file
+format distributed by the National Center for Biotechnology
+Information and the PIR format used by EBI/EMBL. CD-ROMs can be
+obtained from:
+
+ Genbank
+ National Center for Biotechnology Information
+ National Library of Medicine
+ National Institutes of Health
+ 8600 Rockville Pike
+ Bethesda, MD 20894
+
+
+ The GenBank DNA sequence library is also available via
+anonymous FTP from ncbi.nlm.nih.gov.
+
+2.2.3. The EBI/EMBL CD-ROM libraries
+
+ The European Bioinformatics Institute (EBI) is now
+distributing the EMBL CD-ROM that contains both the complete EMBL
+DNA sequence database (which should be essentially identical to
+the GenBank DNA sequence database) and the SWISS-PROT protein
+sequence database. SWISS-PROT is derived from the NBRF Protein
+sequence database with additions from the EBI/EMBL DNA sequence
+database. This CD-ROM is a "best-buy," since it provides both
+DNA and protein sequence libraries. It is available from:
+
+
+ European Bioinformatics Institute
+ Hinxton Genome Campus, Hinxton Hall
+ Hinxton, Cambridge CB10 1RQ,
+ United Kingdom
+ Tel: +44 1223 4944
+ Fax: +44 1223 494468
+ Email: DATALIB@ebi.ac.uk
+
+
+
+ In addition, the SWISS-PROT protein sequence database is
+available via anonymous FTP from ncbi.nlm.nih.gov.
+
+2.3. Finding the libraries: FASTLIBS
+
+ FASTA and TFASTA use the environment variable FASTLIBS to
+find the protein and DNA sequence libraries. The FASTLIBS
+variable contains the name of a file that has the actual
+filenames of the libraries. The FASTGBS file on is an example of
+a file that can be referred to by FASTLIBS. To use the FASTGBS
+file, type:
+
+ setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
+ or
+ export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
+
+Then edit the FASTGBS file to indicate where the protein and DNA
+sequence libraries can be found. If you have a hard disk and
+your protein sequence library is kept in the file
+/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
+in the directory: /usr/lib/genbank, then fastgbs might contain:
+
+ NBRF Protein$0P/usr/lib/seq/aabank.lib 0
+ SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
+ GB Primate$1P@/usr/lib/genbank/gpri.nam
+ GB Rodent$1R@/usr/lib/genbank/grod.nam
+ GB Mammal$1M@/usr/lib/genbank/gmammal.nam
+ ^ 1 ^^^^ 4 ^ ^
+ 23 (5)
+
+The first line of this file says that there is a copy of the NBRF
+protein sequence database (which is a protein database) that can
+be selected by typing "P" on the command line or when the
+database menu is presented in the file /usr/lib/seq/aabank.lib.
+
+ Note that there are 4 or 5 fields in the lines in fastgbs.
+The first field is the description of the library which will be
+displayed by FASTA; it ends with a '$'. The second field (1
+character), is a 0 if the library is a protein library and 1 if
+it is a DNA library. The third field (1 character) is the
+character to be typed to select the library.
+
+ The fourth field is the name of the library file. In the
+example above, the /usr/lib/seq/aabank.lib file contains the
+entire protein sequence library. However the DNA library file
+names are preceded by a '@', because these files (gpri.nam,
+grod.nam, gmammal.nam) do not contain the sequences; instead they
+contain the names of the files which contain the sequences. This
+is done because the GENBANK DNA database is broken down in to a
+large number of smaller files. In order to search the entire
+primate database, you must search more than a dozen files.
+
+ In addition, an optional fifth field can be used to specify
+the format of the library file. Alternatively, you can specify
+the library format in a file of file names (a file preceded by an
+'@'). This field must be separated from the file name by a space
+character (' ') from the filename. In the example above, the
+aabank.lib file is in Pearson/FASTA format, while the swiss.seq
+file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
+FASTA can read the following formats:
+
+ 0 Pearson/FASTA (>SEQID - comment/sequence)
+ 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
+ 2 NBRF CODATA (ENTRY/SEQUENCE)
+ 3 EMBL/SWISS-PROT (ID/DE/SQ)
+ 4 Intelligenetics (;comment/SEQID/sequence)
+ 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
+ 6 GCG (version 8.0) Unix Protein and DNA (compressed)
+ 11 NCBI Blast1.3.2 format (unix only)
+
+In particular, this version will work with the EMBL and PIR VMS
+formats that are distributed on the EMBL CD-ROM. The latter
+format (PIR VMS) is much faster to search than EMBL format. This
+release also works with the protein and DNA database formats
+created for the BLASTP and BLASTN programs by SETDB and PRESSDB
+and with the new NCBI search format. If a library format is not
+specified, for example, because you are just comparing two
+sequences, Pearson/FASTA (format 0) is used by default. To
+change this default, you may set the LIBTYPE environment variable
+to a number. For example,
+
+ setenv LIBTYPE 1
+
+would cause the program to use the GenBank LOCUS format by
+default for libraries (or the second sequence file), but the
+Pearson/FASTA format would still be used for the query sequence.
+
+ You can specify a group of library files by putting a '@'
+symbol before a file that contains a list of file names to be
+searched. For example, if @gmam.nam is in the fastgbs file, the
+file "gmam.nam" might contain the lines:
+
+ </usr/lib/genbank
+ gbpri.seq 1
+ gbrod.seq 1
+ gbmam.seq 1
+
+In this case, the line beginning with a '<' indicates the
+directory the files will be found in. The remaining lines name
+the actual sequence files. So the first sequence file to be
+searched would be:
+
+ /usr/lib/genbank/gbpri.seq
+
+The notation "<PIRNAQ:" might be used under the VAX/VMS operating
+system. Under UNIX, the trailing '/' is left off, so the library
+directory might be written as "</usr/seqlib".
+
+ With version 1.4 of the FASTA package, the FASTA and TFASTA
+programs can search a library composed of different files in
+different sequence formats. For example, you may wish to search
+the Genbank files (in GenBank flat file format) and the EMBL DNA
+sequence database on CD-ROM. To do this, you simply list the
+names and filetypes of the files to be searched in a file of
+filenames. For example, to search the mammalian portion of
+Genbank, the unannotated portion of Genbank, and the unannotated
+portion of the EMBL library, you could use the file:
+
+ </usr/lib/DNA
+ gbpri.seq 1
+ # (this '#' causes the program to display the size of the library)
+ gbrod.seq 1
+ gbmam.seq 1
+ gbuna.seq 1
+ unanno.seq 5
+ #
+
+ You do not need to include library format numbers if you
+ only use the Pearson/FASTA version of the PIR protein se-
+ quence library. If no library type is specified, the
+ program assumes that type 0 is being used (unless you
+ have set LIBTYPE).
+
+Support for the old compressed GenBank files, which have not been
+distributed for more than four years, has been removed from
+programs in the FASTA package.
+
+
+ Test the setup by running FASTA. Enter the sequence file
+'MUSPLFM.AA' when the program requests it (this file is included
+with the programs). The program should then ask you to select a
+protein sequence library. Alternatively, if you run the TFASTA
+program and use the MUSPLFM.AA query sequence, the program should
+show you a selection of DNA sequence libraries. Once the fastgbs
+file has been set up correctly, you can set FASTLIBS=fastgbs in
+your AUTOEXEC.BAT file, and you will not need to remember where
+the libraries are kept or how they are named.
+
+ FASTA and TFASTA must open a large number of files when
+searching and reporting the results of a GENBANK floppy disk
+format library search. You may have problems with the large
+number of files under DOS on IBM-PC's (Unix and VMS users will
+not have these problems). If you are going to search the GENBANK
+floppy disk format DNA sequence library under DOS, you should add
+the line:
+
+ FILES=16
+
+to your CONFIG.SYS file. (Typically this is already done for
+programs like Windows or WordPerfect.)
+
+3. Using the FASTA Package
+
+3.1. Overview
+
+ The FASTA sequence comparison programs all require similar
+information, the name of a query sequence file, a library file,
+and the ktup parameter. All of the programs can accept arguments
+on the command line, or they will prompt for the file names and
+ktup value.
+
+To use FASTA, simply type:
+
+ FASTA
+ and you will be prompted for :
+ the name of the test sequence file
+ the name of the library file
+ and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
+
+ ktup of 2 is about 5 times faster than ktup = 1.
+ For a 200 aa sequence against a 10,000,000 aa
+ library, the program takes about 30 min with
+ ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
+ IBM-PC.
+
+
+The program can also be run by typing
+
+ FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
+
+
+Included with the package are the test files, MUSPLFM.AA,
+LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that
+everything is working, you can try:
+
+ fasta musplfm.aa lcbo.aa
+ and
+ tfasta musplfm.aa bovprl.seq
+
+To test the local similarity programs LFASTA and PLFASTA, try:
+
+ lfasta mchu.aa mchu.aa
+ and
+ plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
+ or on a Tektronix terminal under UNIX or VMS)
+
+MCHU (calmodulin) has four duplicated calcium binding sites that
+are clearly detected by LFASTA. For a more complicated example,
+try MWRTC1.aa, myosin heavy chain.
+
+3.2. Sequence files
+
+ The FASTA programs know about three kinds of sequence files
+(four under VMS): (1) plain sequence files that can only be used
+as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
+library files. These are the same as plain sequence files, each
+sequence is preceded by a comment line with a '>' in the first
+column. (3) distributed sequence libraries (this is a broad class
+that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
+flat-file format, EMBL flat-file format, and Intelligenetics
+format. All of the files that you create should be of type (1)
+or (2). Type (2) files (ones with a be used as query or library
+sequence files by all of the programs.
+
+ I have included several sample test files, *.AA. The first
+line may begin with a '>' or ';' followed by a comment. The
+text after ';' in other lines will be ignored. Spaces and
+tabs (and anything else that is not an amino-acid code) are
+ignored.
+
+ Library files should have the form:
+
+ >Sequence name and identifier
+ A F A S Y T .... actual sequence.
+ F S S .... second line of sequence.
+ >Next sequence name and identifier
+
+This is often referred to as "FASTA" or "Pearson" format. You
+can build your own library by concatenating several sequence
+files. Just be sure that each sequence is preceded by a line
+beginning with a '>' with a sequence name.
+
+ The test file should not have lines longer than 120
+characters, and sequences entered with word processors should use
+a document mode, with normal carriage returns at the end of
+lines.
+
+Program Summary
+
+3.3. Sequence search programs
+
+FASTA universal sequence comparison. Defaults to comparing
+ protein sequences; if the sequences are > 85% A+C+G+T
+ or the -n option is used, a DNA sequence is assumed.
+
+FASTX Search a protein sequence library using amino acid
+ sequence comparison to the forward three frames of a
+ translated DNA query sequence. (The reverse frames are
+ specified with the -i option.) Alignment scores allow
+ frameshifts; the final alignment uses a Smith-Waterman
+ type alignment routine (no limit on gaps) that allows
+ frameshifts.
+
+TFASTA Search DNA library for a protein sequence by
+ translating the DNA sequence to protein in all six
+ frames (three forward frames with the -3 command line
+ option). TFASTA with ktup=2 is about as fast as a DNA
+ FASTA with ktup=4, and is substantially more sensitive.
+ (also reads the GENBANK library)
+
+TFASTX Search DNA library for a protein sequence by
+ translating the DNA sequence to protein in all six
+ frames (three forward frames with the -3 command line
+ option) calculating similarity scores that allow
+ frameshifts. TFASTX produces an optimal Smith-Waterman
+ alignment of the query and translated-library sequence.
+
+SSEARCH Universal sequence comparison using the Smith-Waterman
+ algorithm ( T. F. Smith and M. S. Waterman (1981) J.
+ Mol. Biol. 147:195-197). This program uses code
+ developed by Huang and Miller (X. Huang, R. C.
+ Hardison, W. Miller (1990) CABIOS 6:373-381) for
+ calculating the local similarity score and code from
+ the ALIGN program (see below) for calculating the local
+ alignment. SSEARCH is about 50-times slower than FASTA
+ with ktup=2 (for proteins).
+
+ALIGN optimal global alignment of two sequences with no
+ short-cuts. This program is a slightly modified
+ version of one taken from E. Myers and W. Miller. The
+ algorithm is described in E. Myers and W. Miller,
+ "Optimal Alignments in Linear Space" (CABIOS (1988)
+ 4:11-17).
+
+3.4. Local similarity programs
+
+LFASTA local similarity searches showing local alignments.
+ The algorithm used to calculate the local alignment in
+ a band has been improved (Chao, Pearson, and Miller,
+ submitted).
+
+PLFASTA local similarity searches with plot output (on the IBM,
+ this program requires that the environment variable
+ BGIDIR be set).
+
+PCLFASTA (unix only) local similarity searches with plot output
+ using pic commands.
+
+LALIGN Calculates the N-best local alignments using a rigorous
+ algorithm. (N=10 by default.) The algorithm was
+ developed by Huang and Miller (X. Huang and W. Miller
+ (1991) Adv. Appl. Math. 12:337-357), which is a
+ linear-space version of an algorithm described by M. S.
+ Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
+ Like SSEARCH, LALIGN is rigorous, but also very slow.
+
+PLALIGN A version of LALIGN that plots its output to a screen
+ or to a Tektronix terminal emulator.
+
+3.5. Statistical Significance
+
+ With version 2.0 of the FASTA program distribution, FASTA,
+TFASTA, and SSEARCH now provide estimates of statistical
+significance for library searches. Work by Altschul, Arratia,
+Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
+Nature Genetics 6:119 for an excellent review) suggests that
+local sequence similarity scores follow the extreme value
+distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
+u = ln(Kmn)/lambda and m,m are the lengths of the query and
+library sequence. This formula can be rewritten as: 1 - exp(-Kmn
+exp(-lambda x), which shows that the average score for an
+unrelated library sequence increases with the logarithm of the
+length of the library sequence. FASTA and SSEARCH use simple
+linear regression against the the log of the library sequence
+length to calculate a normalized "z-score" with mean 50,
+regardless of library sequence length, and variance 10. These
+z-scores can then be used with the extreme value distribution and
+the poisson distribution (to account for the fact that each
+library sequence comparison is an independent test) to calculate
+the number of library sequences to obtain a score greater than or
+equal to the score obtained in the search. The original idea and
+routines to do the linear regression on library sequence length
+were provided Phil Green, U. Washington. This version of FASTA
+and SSEARCH uses a slightly different strategy for fitting the
+data than those originally provided by Dr. Green.
+
+ The expected number of sequences is plotted in the histogram
+using an "*". Since the parameters for the extreme value
+distribution are not calculated directly from the distribution of
+similarity scores, the pattern of "*'s" in the histogram gives a
+qualitative view of how well the statistical theory fits the
+similarity scores calculated by FASTA and SSEARCH. For FASTA, if
+optimized scores are calculated for each sequence in the database
+(the default), the agreement between the actual distribution of
+"z-scores" and the expected distribution based on the length
+dependence of the score and the extreme value distribution is
+usually very good. Likewise, the distribution of SSEARCH Smith-
+Waterman scores typically agrees closely with the actual
+distribution of "z-scores." The agreement with unoptimized
+scores, ktup=2, is often not very good, with too many high
+scoring sequences and too few low scoring sequences compared with
+the predicted relationship between sequence length and similarity
+score. In those cases, the expectation values may be
+overestimates.
+
+ The statistical routines assume that the library contains a
+large sample of unrelated sequences. If this is not the case,
+then the expectation values are meaningless. Likewise, if there
+are fewer than 20 sequences in the library, the statistical
+calculations are not done.
+
+ For protein searches, library sequences with E() values <
+0.01 for searches of a 10,000 entry protein database are almost
+always homologous. Frequently sequences with E()-values from 1 -
+10 are related as well. Remember, however, that these E() values
+also reflect differences between the amino acid composition of
+the query sequence and that of the "average" library sequence.
+Thus, when searches are done with query sequences with "biased"
+amino-acid composition, unrelated sequences may have
+"significant" scores because of sequence bias. The programs
+below, PRDF and PRSS, can address this problem by calculating
+similarity scores for random sequences with the same length and
+amino acid composition.
+
+ If optimization is not used ("-o"), E-values for DNA
+sequences overestimate the significance of the scores that are
+obtained and unrelated sequences frequently have E()-values <
+0.0005. With optimization, the agreement between E()-value
+compares favorably with protein sequence comparison. This is in
+part due to the use of more stringent gap penalties for DNA
+sequence comparison, -16, -4 rather than -12, -2. With the
+latter penalties, many unrelated sequences appear to have
+significant similarity. Nevertheless, since protein sequence
+comparison is much more sensitive, DNA sequence comparison should
+not be used to identify sequences that encode protein. Even with
+ktup=6, optimization rarely increases run-times more than 50%
+with mRNA-size query sequences. Optimization should be used
+whenever possible.
+
+ Similar comments apply to TFASTA, where higher gap
+penalties (-16,-4) are required for accurate statistical
+estimates. Because TFASTA produces so many artificial "coding"
+sequences with atypical amino acid compositions, the statistical
+estimates with TFASTA are often over estimates. With optimized
+scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
+will sometimes have E() values of 0.1. If initn scores are used,
+unrelated sequences may have have E() values < 0.01.
+
+PRDF improved version of RDF program that includes accurate
+ probability estimates for all three scoring methods
+ (includes local or window shuffle routine)
+
+PRSS A version of PRDF that uses the rigorous Smith-Waterman
+ calculation used by SSEARCH.
+
+RANDSEQ produces a randomly shuffled sequence from a query
+ sequence.
+
+RELATE significance program described by Dayhoff (Atlas of
+ Protein Sequence and Structure, Vol. 5, Supplement 3).
+ Each chunk of 25 residues in one sequence is compared
+ to every 25 residue fragment of the second sequence.
+ Sequences which are genuinely related will have a large
+ number of scores greater than 3 standard deviations
+ above the mean score of all of the comparisons.
+
+3.6. Other analysis programs
+
+AACOMP calculate the amino acid composition and molecular
+ weight of a sequence.
+
+BESTSCOR calculate the best self-comparison score.
+
+GREASE Kyte-Doolittle hydropathicity profile
+
+TGREASE graphic plot of Kyte-Doolittle profile
+
+FROMGB convert from GenBank LOCUS format (also used by the
+ IBI-Pustell programs) to Pearson/FASTA format.
+
+GARNIER A secondary structure prediction program using the
+ method of Garnier, Osgusthorpe, and Robson, J. Mol.
+ Biol., (1978) 120:97-120.
+
+3.7. Options
+
+ These programs have a number of output options, which are
+invoked by the environment variables LINLEN, SHOWALL, and MARKX.
+Alternatively, these values can be controlled by command line
+options. The number of sequence residues per output line is now
+adjustable by setting the environment variable LINLEN, or the
+command line option -w. LINLEN is normally 60, to change it set
+LINLEN=80 before running the program or add -w 80 to the command
+line. LINLEN can be set up to 200. SHOWALL (-a) determines
+whether all, or just a portion, of the aligned sequences are
+displayed. Previously, FASTP would show the entire length of
+both sequences in an alignment while FASTN would only show the
+portions of the two sequences that overlapped. Now the default is
+to show only the overlap between the two sequences, to show
+complete sequences, set SHOWALL=1, or use the -a option on the
+command line.
+
+ The differences between the two aligned sequences can be
+highlighted in three different ways by changing the environment
+variable MARKX or the -m option. Normally (MARKX=0) the program
+uses ':' do denote identities and '.' to denote conservative
+replacements. If MARKX=1, the program will not mark identities;
+instead conservative replacements are denoted by a 'x' and non-
+conservative substitutions by a 'X'. If MARKX=2, the residues in
+the second sequence are only shown if they are different from the
+first. MARKX=3 displays the aligned library sequences without the
+query sequence; these can be used to build a primitive multiple
+alignment. MARKX=4 provides a graphical display of the
+boundaries of the alignments. Thus the five options are:
+
+
+ MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4
+
+ MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
+ ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
+ MWKSCGYPYT MWKSCGYPYT
+
+
+(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
+format for use with other programs. See the file"readme.v20u4"
+for a more complete description.
+
+3.8. Command line options
+
+ It is now possible to specify several options on the
+command line, instead of using environment variables. The
+command line options are preceded by a dash; the following
+options are available:
+
+-a same as showall=1
+
+-A force Smith-Waterman alignments for DNA sequences and
+ TFASA. By default, only FASTA protein sequence
+ comparisons use Smith-Waterman alignments.
+
+-b # Number of sequence scores to be shown on output. In
+ the absence of this option, fasta (and tfasta and
+ ssearch) display all library sequences obtaining
+ similarity scores with expectations less than 10.0 if
+ optimized score are used, or 2.0 if they are not. The
+ -b option can limit the display further, but it will
+ not cause additional sequences to be displayed.
+
+-c # Threshold score for optimization (OPTCUT). Set "-c 1"
+ to optimize every sequence in a database. (This slows
+ the program down about 5-fold).
+
+-E # Limit the number of scores and alignments shown based
+ on the expected number of scores. Used to override the
+ expectation value of 10.0 used by default. When used
+ with -Q, -E 2.0 will show all library sequences with
+ scores with an expectation value <= 2.0.
+
+-d # Number of alignments to be reported by default. (Used
+ in conjunction with -Q). No longer necessary, see "-b"
+ above.
+
+-f Penalty for the first residue in a gap (-12 by default
+ for proteins, -16 for DNA or for TFASTA).
+
+-g Penalty for additional residues in a gap (-2 by default
+ for proteins, -4 for DNA and TFASTA ).
+
+-h Penalty for frameshift (FASTX, TFASTX only).
+
+-H Omit histogram.
+
+-i Invert (reverse complement) the query sequence if it is
+ DNA. For TFASTX, search the reverse complement of the
+ library sequence only.
+
+-k # Threshold for joining init1 segments to build an initn
+ score (GAPCUT).
+
+-l file Location of library menu file (FASTLIBS).
+
+-L Display more information about the library sequence in
+ the alignment.
+
+-m # MARKX = # (0, 1, 2, 3, 4, 10)
+
+-n Force the query sequence to be treated as a DNA
+ sequence. This is particularly useful for query
+ sequences that contain a large number of ambiguous
+ residues, e.g. transcription factor binding sites.
+
+-O Send copy of results to "filename." Helpful for
+ environments without STDOUT.
+
+-o Turn off default optimization of all scores greater
+ than OPTCUT. Sort results by "initn" scores.
+
+-Q,-q Quiet - does not prompt for any input. Writes scores
+ and alignments to the terminal or standard output file.
+
+-r file Save a results summary line for every sequence in the
+ sequence library. The summary line includes the
+ sequence identifier, superfamily number (if available)
+ position in the library, and the similarity scores
+ calculated. This option can be used to evaluate the
+ sensitivity and selectivity of different search
+ strategies (see W. R. Pearson (1991) Genomics 11:635-
+ 650.)
+
+-s file SMATRIX is read from file. Several SMATRIX files are
+ provided with the standard distribution. For protein
+ sequences: codaa.mat - based on minimum mutation
+ matrix; idnaa.mat - identity matrix; pam250.mat - the
+ PAM250 matrix developed by Dayhoff et al (Atlas of
+ Protein Sequence and Structure, vol. 5, suppl. 3,
+ 1978); pam120.mat - a PAM120 matrix. The default
+ scoring matrix is BLOSUM50, PAM250 is available with
+ "-s 250", BLOSUM62 ("-s BL62") is also available.
+
+-v (LINEVAL) values used for line styles in plfasta
+
+-w # Line length (width) = number (<200)
+
+-x Specifies offsets for the beginning of the query and
+ library sequence. For example, if you are comparing
+ upstream regions for two genes, and the first sequence
+ contains 500 nt of upstream sequence while the second
+ contains 300 nt of upstream sequence, you might try:
+
+ fasta -x "-500 -300" seq1.nt seq2.nt
+
+ If the -x option is not used, FASTA assumes numbering
+ starts with 1. This option will not work properly with
+ the translated library sequence with tfasta. (You
+ should double check to be certain the negative
+ numbering works properly.)
+
+-y Set the width of the band used for calculating
+ "optimized" scores. For proteins and ktup=2, the width
+ is 16. For proteins with ktup=1, the width is 32 by
+ default. For DNA the width is 16.
+
+-z Turn off statistical calculations.
+
+-1 sort output by init1 score (as FASTP used to do).
+
+-3 (TFASTA, TFASTX only) translate only three forward
+ frames
+
+
+For example:
+
+ fasta -w 80 -a seq1.aa seq.aa
+
+would compare the sequence in seq1.aa to that in seq2.aa and
+display the results with 80 residues on an output line, showing
+all of the residues in both sequences. Be sure to enter the
+options before entering the file names, or just enter the options
+on the command line, and the program will prompt for the file
+names.
+
+ Not all of these options are appropriate for all of the
+programs. The options above are used by FASTA and TFASTA. RELATE
+uses the -s option, ALIGN uses the -w, -m, and -s options, and
+the PRDF program uses -c, -f, -k, and -s.
+
+4. Environment variable summary
+
+ Environment variables allow you to set search parameters
+that will be used frequently when you run a program; for example,
+if you prefer to use the PAM250 scoring matrix, you might "set
+SMATRIX=250." Command line parameters, if used, always override
+environment variable settings. The following environment
+variables are used by this program:
+
+AABANK the file name of the default sequence library.
+
+FASTLIBS the location of the file which contains the list of
+ library files to be searched.
+
+GAPCUT threshold used for joining init1 regions in the second
+ step of FASTA. Normally set based on sequence length
+ and ktup.
+
+LIBTYPE used to specify the format of the library sequence for
+ FASTA and TFASTA.
+
+LINLEN output line length - can go up to 200
+
+LINEVAL used by plfasta to determine the relationship between
+ line style and similarity score (-v). This should be a
+ string of three numbers, e.g. "200 100 50"
+
+MARKX symbol for denoting matches, mismatches. Note that this
+ symbol is only used across the optimized local region;
+ sequences that are outside this region are not marked.
+
+OPTCUT Set the threshold to be used for optimization in a band
+ around the best initial region. Normally the OPTCUT
+ value is calculated from the length of the sequence and
+ the ktup value (for a 200 residue sequence, it is about
+ 28). If OPTCUT=1, every sequence in the database will
+ be optimized. This is the most sensitive option.
+
+PAMFACT This version of fasta uses a more sensitive method for
+ identifying initial regions. Instead of using a
+ constant factor (fact) for each match in a ktup, it
+ uses the scoring matrix (PAM) scores. While this works
+ well for protein sequences, it has not been as
+ carefully tested for DNA sequences, so by default, this
+ modification is used for proteins but not for DNA.
+ Setting the PAMFACT environment variable to 1 forces
+ the option on; PAMFACT=0 turns it off.
+
+SHOWALL on output, show the complete sequence instead of just
+ the overlap of the two aligned sequences.
+
+SMATRIX alternative scoring matrix file.
+
+TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix
+ graphics by default) Generate Tektronix output.
+ Normally, PLFASTA and TGREASE plot graphs using the
+ Turbo C graphics library. Unfortunately, often these
+ plots cannot be printed out without special programs.
+ However, if you set TEKPLOT=1, tektronix graphics
+ commands will be used. Tektronix commands can be used
+ together with the PLOTDEV program, available from
+ Microplot Systems. They no lonter sell this program,
+ but it can be downloaded from
+ http://iquest.com/~microplt/index1.html. PLOTDEV also
+ allows you to print out graphics on the screen.
+
+As always, please inform me of bugs as soon as possible.
+
+William R. Pearson
+Department of Biochemistry
+Box 440, Jordan Hall
+U. of Virginia
+Charlottesville, VA
+
+wrp@virginia.EDU