--- /dev/null
+(Updated December, 2003)
+
+
+ COPYRIGHT NOTICE
+
+Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
+Pearson and the University of Virginia. All rights reserved. The
+FASTA program and documentation may not be sold or incorporated
+into a commercial product, in whole or in part, without written
+consent of William R. Pearson and the University of Virginia.
+For further information regarding permission for use or
+reproduction, please contact: David Hudson, Assistant Provost for
+Research, University of Virginia, P.O. Box 9025, Charlottesville,
+VA 22906-9025, (434) 924-6853
+
+The FASTA program package
+
+Introduction
+
+ This documentation describes the version 3 of the FASTA
+program package (see W. R. Pearson and D. J. Lipman (1988),
+"Improved Tools for Biological Sequence Analysis", PNAS
+85:2444-2448 (Pearson and Lipman, 1988); W. R. Pearson (1996)
+"Effective protein sequence comparison" Meth. Enzymol.
+266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
+46:24-36 (Zhang et al., 1997); Pearson, (1999) Meth. in
+Molecular Biology 132:185-219 (Pearson, 2000). Version 3 of the
+FASTA packages contains many programs for searching DNA and
+protein databases and one program (prss3) for evaluating
+statistical significance from randomly shuffled sequences.
+Several additional analysis programs, including programs that
+produce local alignments, are available as part of version 2 of
+the FASTA package, which is still available.
+
+ This document is divided into three sections: (1) A summary
+overview of the programs in the FASTA3 package; (2) A guide to
+installing the programs and databases; (3) A guide to using the
+FASTA programs. The revision history of the programs can be found
+in the readme.v30..v34, files. The programs are easy to use, so
+if you are using them on a machine that is administered by
+someone else, you can skip section (2) and focus on (1) and (3)
+to learn how to use the programsIf you are installing the
+programs on your own machine, you will need to read section (2)
+carefully.
+
+1. An overview of the FASTA programs
+
+ Although there are a large number of programs in this
+package, they belong to three groups: (1) "Conventional" Library
+search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
+TFASTY3, SSEARCH3; (2) Programs for searching with short
+fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
+significance: PRSS3. Programs that start with fast search
+protein databases, while tfast programs search translated DNA
+databases. Table I gives a brief description of the programs.
+
+
+ Table I. Comparison programs in the FASTA3 package
+
+---------------------------------------------------------------------------
+fasta3 Compare a protein sequence to a protein sequence
+ database or a DNA sequence to a DNA sequence database
+ using the FASTA algorithm (Pearson and Lipman, 1988,
+ Pearson, 1996). Search speed and selectivity are con-
+ trolled with the ktup(wordsize) parameter. For protein
+ comparisons, ktup = 2 by default; ktup =1 is more sen-
+ sitive but slower. For DNA comparisons, ktup=6 by de-
+ fault; ktup=3 or ktup=4 provides higher sensitivity;
+ ktup=1 should be used for oligonucleotides (DNA query
+ lengths < 20).
+
+ssearch3 Compare a protein sequence to a protein sequence
+ database or a DNA sequence to a DNA sequence database
+ using the Smith-Waterman algorithm (Smith and Water-
+ man, 1981). ssearch3 is about 10-times slower than
+ FASTA3, but is more sensitive for full-length protein
+ sequence comparison.
+
+fastx3/ fasty3 Compare a DNA sequence to a protein sequence database,
+ by comparing the translated DNA sequence in three
+ frames and allowing gaps and frameshifts. fastx3 uses
+ a simpler, faster algorithm for alignments that allows
+ frameshifts only between codons; fasty3 is slower but
+ produces better alignments with poor quality sequences
+ because frameshifts are allowed within codons.
+
+tfastx3/ tfasty3 Compare a protein sequence to a DNA sequence database,
+ calculating similarities with frameshifts to the for-
+ ward and reverse orientations.
+
+tfasta3 Compare a protein sequence to a DNA sequence database,
+ calculating similarities (without frameshifts) to the 3
+ forward and three reverse reading frames. tfastx3 and
+ tfasty3 are preferred because they calculate similarity
+ over frameshifts.
+
+fastf3/tfastf3 Compares an ordered peptide mixture, as would be ob-
+ tained by Edman degredation of a CNBr cleavage of a
+ protein, against a protein (fastf) or DNA (tfastf)
+ database.
+
+fasts3/tfasts3 Compares set of short peptide fragments, as would be
+ obtained from mass-spec. analysis of a protein, against
+ a protein (fasts) or DNA (tfasts) database.
+---------------------------------------------------------------------------
+
+2. Installing FASTA and the sequence databases
+
+2.1. Obtaining the libraries
+
+ The FASTA program package does not include any protein or
+DNA sequence libraries. Protein databases are available on CD-
+ROM from the PIR and EMBL (see below), or via anonymouse FTP from
+many different sources. As this document is updated in the fall
+of 1999, no DNA databases are available on CD-ROM from the major
+sequence databases: Genbank at the National for Biotechnology
+Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
+EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
+However, the databases are available via anonymous FTP from both
+sites.
+
+2.1.1. The GENBANK DNA sequence library
+
+ Because of the large size of DNA databases, you will
+probably want to keep DNA databases in only one, or possibly two,
+formats. The FASTA3 programs that search DNA databases - fasta3,
+tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
+(not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
+and BLAST2.0 (formatdb) formats, as well as EMBL format. If you
+are also running the GCG suite of sequence analysis programs, you
+should use GCG/compressed-binary format or BLAST2.0 format for
+your fasta3 searches. If not, BLAST2.0 is a good choice. These
+files are considerably more compact than Genbank flat files, and
+are preferred. The NCBI does not provide software for converting
+from Genbank flat files to Blast2.0 DNA databases, but you can
+use the Blast formatdb program to convert ASN.1 formated Genbank
+files, which are available from the NCBI ftp site.
+
+ The NCBI also provides the nr, swissprot, and several EST
+databases that are used by BLAST in FASTA format from:
+ftp://ncbi.nlm.nih.gov/blast/db. These databases are updated
+nightly.
+
+2.1.2. The NBRF protein sequence library
+
+ You can obtain the PIR protein sequence database (Barker et
+al., 1998) from:
+
+ National Biomedical Research Foundation
+ Georgetown University Medical Center
+ 3900 Reservoir Rd, N.W.
+ Washington, D.C. 20007
+
+or via ftp from nbrf.georgetown.edu or from the NCBI
+(ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
+directory is in PIR Codata format, which is not widely used. I
+recommend the PIR/VMS format data (libtype=5) in the vms
+directory.
+
+2.1.3. The EBI/EMBL CD-ROM libraries
+
+ The European Bioinformatics Institute (EBI) distributes both
+the EMBL DNA database and the SwissProt database on CD-ROM
+(Bairoch and Apweiler, 1996), and they are available from:
+
+ EMBL-Outstation European Bioinformatics Institute
+ Wellcome Trust Genome Campus,
+ Hinxton Hall
+ Hinxton,
+ Cambridge CB10 1SD
+ United Kingdom
+ Tel: +44 (0)1223 494444
+ Fax: +44 (0)1223 494468
+ Email: DATALIB@ebi.ac.uk
+
+In addition, the SWISS-PROT protein sequence database is
+available via anonymous FTP from
+ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
+www.expasy.ch).
+
+2.2. Finding the libraries: FASTLIBS
+
+ The major problem that most new users of the FASTA package
+have is in setting up the program to find the databases and their
+library type. In general, if you cannot get fasta3 to read a
+sequence database, it is likely that something is wrong with the
+FASTLIBS file. A common problem is that the database file is
+found, but either no sequences are read, or an incorrect number
+of entries is read. This is almost always because the library
+format (libtype) is incorrect. Note that a type 5 file (PIR/VMS
+format) can be read as a type 0 (default FASTA) format file, and
+the number of entries will be correct, but the sequence lengths
+will not.
+
+ All the search programs in the FASTA3 package use the
+environment variable FASTLIBS to find the protein and DNA
+sequence libraries. The FASTLIBS variable contains the name of a
+file that has the actual filenames of the libraries. The
+fastlibs file included with the distribution on is an example of
+a file that can be referred to by FASTLIBS. To use the fastlibs
+file, type:
+
+ setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
+ or
+ export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
+
+Then edit the fastlibs file to indicate where the protein and DNA
+sequence libraries can be found. If you have a hard disk and
+your protein sequence library is kept in the file
+/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
+in the directory: /usr/lib/genbank, then fastgbs might contain:
+
+ NBRF Protein$0P/usr/lib/seq/aabank.lib 0
+ SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
+ GB Primate$1P@/usr/lib/genbank/gpri.nam
+ GB Rodent$1R@/usr/lib/genbank/grod.nam
+ GB Mammal$1M@/usr/lib/genbank/gmammal.nam
+ ^ 1 ^^^^ 4 ^ ^
+ 23 (5)
+
+The first line of this file says that there is a copy of the NBRF
+protein sequence database (which is a protein database) that can
+be selected by typing "P" on the command line or when the
+database menu is presented in the file /usr/lib/seq/aabank.lib.
+
+ Note that there are 4 or 5 fields in the lines in fastgbs.
+The first field is the description of the library which will be
+displayed by FASTA; it ends with a '$'. The second field (1
+character), is a 0 if the library is a protein library and 1 if
+it is a DNA library. The third field (1 character) is the
+character to be typed to select the library.
+
+ The fourth field is the name of the library file. In the
+example above, the /usr/lib/seq/aabank.lib file contains the
+entire protein sequence library. However the DNA library file
+names are preceded by a '@', because these files (gpri.nam,
+grod.nam, gmammal.nam) do not contain the sequences; instead they
+contain the names of the files which contain the sequences. This
+is done because the GENBANK DNA database is broken down in to a
+large number of smaller files. In order to search the entire
+primate database, you must search more than a dozen files.
+
+ In addition, an optional fifth field can be used to specify
+the format of the library file. Alternatively, you can specify
+the library format in a file of file names (a file preceded by an
+'@'). This field must be separated from the file name by a space
+character (' ') from the filename. In the example above, the
+aabank.lib file is in Pearson/FASTA format, while the swiss.seq
+file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
+FASTA can read the following formats:
+
+ 0 Pearson/FASTA (>SEQID - comment/sequence)
+ 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
+ 2 NBRF CODATA (ENTRY/SEQUENCE)
+ 3 EMBL/SWISS-PROT (ID/DE/SQ)
+ 4 Intelligenetics (;comment/SEQID/sequence)
+ 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
+ 6 GCG (version 8.0) Unix Protein and DNA (compressed)
+ 11 NCBI Blast1.3.2 format (unix only)
+ 12 NCBI Blast2.0 format (unix only, fasta32t08 or later)
+
+In particular, this version will work with the EMBL and PIR VMS
+formats that are distributed on the EMBL CD-ROM. The latter
+format (PIR VMS) is much faster to search than EMBL format. This
+release also works with the protein and DNA database formats
+created for the BLASTP and BLASTN programs by SETDB and PRESSDB
+and with the new NCBI search format. If a library format is not
+specified, for example, because you are just comparing two
+sequences, Pearson/FASTA (format 0) is used by default. To
+specify a library type on the command line, add it to the library
+filename and surround the filename and library type in quotes:
+
+ fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
+
+ You can specify a group of library files by putting a '@'
+symbol before a file that contains a list of file names to be
+searched. For example, if @gmam.nam is in the fastgbs file, the
+file "gmam.nam" might contain the lines:
+
+ </seqdb/genbank
+ gbpri1.seq 1
+ gbpri2.seq 1
+ gbpri3.seq 1
+ gbpri4.seq 1
+ gbrod.seq 1
+ gbmam.seq 1
+
+In this case, the line beginning with a '<' indicates the
+directory the files will be found in. The remaining lines name
+the actual sequence files. So the first sequence file to be
+searched would be:
+
+ /usr/lib/genbank/gbpri.seq
+
+The notation "<PIRNAQ:" might be used under the VAX/VMS operating
+system. Under UNIX, the trailing '/' is left off, so the library
+directory might be written as "</usr/seqlib".
+
+ The FASTA programs can search a database composed of
+different files in different sequence formats. For example, you
+may wish to search the Genbank files (in GenBank flat file
+format) and the EMBL DNA sequence database on CD-ROM. To do
+this, you simply list the names and filetypes of the files to be
+searched in a file of filenames. For example, to search the
+mammalian portion of Genbank, the unannotated portion of Genbank,
+and the unannotated portion of the EMBL library, you could use
+the file:
+
+ </usr/lib/DNA
+ gbpri.seq 1
+ # (this '#' causes the program to display the size of the library)
+ gbrod.seq 1
+ ...
+ gbmam.seq 1
+ ...
+ gbuna.seq 1
+ ...
+ unanno.seq 5
+ #
+
+ You do not need to include library format numbers if you
+ only use the Pearson/FASTA version of the PIR protein se-
+ quence library. If no library type is specified, the
+ program assumes that type 0 is being used.
+
+ Test the setup by running FASTA. Enter the sequence file
+'mgstm1.aa' when the program requests it (this file is included
+with the programs). The program should then ask you to select a
+protein sequence library. Alternatively, if you run the TFASTA
+program and use the mgstm1.aa query sequence, the program should
+show you a selection of DNA sequence libraries. Once the fastgbs
+file has been set up correctly, you can set FASTLIBS=fastgbs in
+your AUTOEXEC.BAT file, and you will not need to remember where
+the libraries are kept or how they are named.
+
+3. Using the FASTA Package
+
+3.1. Overview
+
+ The FASTA sequence comparison programs all require similar
+information, the name of a query sequence file, a library file,
+and the ktup parameter. All of the programs can accept arguments
+on the command line, or they will prompt for the file names and
+ktup value.
+
+To use FASTA, simply type:
+
+ FASTA
+ and you will be prompted for :
+ the name of the test sequence file
+ the name of the library file
+ and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
+ (ktup of 2 is about 5 times faster than ktup = 1)
+
+The program can also be run by typing
+
+ FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
+
+Included with the package are several test files. To check to
+make certain that everything is working, you can try:
+
+ fasta musplfm.aa prot_test.lib
+ and
+ tfastx mgstm1.aa gst.nlib
+
+3.2. Sequence files
+
+ The fasta3 programs know about three kinds of sequence
+files: (1) plain sequence files - files that contain nothing but
+sequence residues - can only be used as query sequences. (2)
+FASTA format files. These are the same as plain sequence files,
+each sequence is preceded by a comment line with a '>' in the
+first column. (3) distributed sequence libraries (this is a broad
+class that includes the NBRF/PIR VMS and blocked ascii formats,
+Genbank flat-file format, EMBL flat-file format, and
+Intelligenetics format. All of the files that you create should
+be of type (1) or (2). FASTA format files (ones with a '>' and
+comment before the sequence) are preferred, because they can be
+used as query or library sequence files by all of the programs.
+
+ I have included several sample test files, *.aa and *.seq as
+well as two small sequence libraries, prot_test.lib and gst.nlib.
+The first line may begin with a '>' by a comment. Spaces and
+tabs (and anything else that is not an amino-acid code) are
+ignored.
+
+ Library files should have the form:
+
+ >Sequence name and identifier
+ A F A S Y T .... actual sequence.
+ F S S .... second line of sequence.
+ >Next sequence name and identifier
+
+This is often referred to as "FASTA" or format. You can build
+your own library by concatenating several sequence files. Just
+be sure that each sequence is preceded by a line beginning with a
+'>' with a sequence name.
+
+ The test file should not have lines longer than 120
+characters, and sequences entered with word processors should use
+a document mode, with normal carriage returns at the end of
+lines.
+
+ A different format is required to specify the ordered
+peptide mixture for fastf3/tfastf3. For example:
+
+ >mgstm1
+ MGCEN,
+ MIDYP,
+ MLLAY,
+ MLLGY
+
+indicates m in the first position of all three peptides (as from
+CNBr), G, I, L (twice) in the second position (first cycle),
+C,D,L (twice) in the third position, etc. The commas (,) are
+required to indicate the number of fragments in the mixture, but
+there should be no comma after the last residue.
+
+ For the fasts3/tfasts3 program, the format is the same,
+except that there is no requirement for the peptides to be the
+same length.
+
+4. Statistical Significance
+
+ All the programs in the FASTA3 package attempt to calculate
+accurate estimates of the statistical significance of a match.
+For fasta3, ssearch3, and fastx3/y3, these estimates are very
+accurate (Pearson, 1998, Zhang et al., 1997).. Altschul et al.
+(Altschul et al., 1994) provides an excellent review of the
+statistics of local similarity scores. Local sequence similarity
+scores follow the extreme value distribution, so that P(s > x) =
+1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
+the lengths of the query and library sequence. This formula can
+be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
+the average score for an unrelated library sequence increases
+with the logarithm of the length of the library sequence. The
+fasta3 programs use simple linear regression against the the log
+of the library sequence length to calculate a normalized "z-
+score" with mean 50, regardless of library sequence length, and
+variance 10. (Several other estimation methods are available with
+the -z option.) These z-scores can then be used with the extreme
+value distribution and the poisson distribution (to account for
+the fact that each library sequence comparison is an independent
+test) to calculate the number of library sequences to obtain a
+score greater than or equal to the score obtained in the search.
+The original idea and routines to do the linear regression on
+library sequence length were provided Phil Green, U. Washington.
+This version uses a slightly different strategy for fitting the
+data than those originally provided by Dr. Green.
+
+ The expected number of sequences is plotted in the histogram
+using an "*". Since the parameters for the extreme value
+distribution are not calculated directly from the distribution of
+similarity scores, the pattern of "*'s" in the histogram gives a
+qualitative view of how well the statistical theory fits the
+similarity scores calculated by the programs. For fasta3, if
+optimized scores are calculated for each sequence in the database
+(the default), the agreement between the actual distribution of
+"z-scores" and the expected distribution based on the length
+dependence of the score and the extreme value distribution is
+usually very good. Likewise, the distribution of ssearch3 Smith-
+Waterman scores typically agrees closely with the <actual
+distribution of "z-scores." The agreement with unoptimized
+scores, ktup=2, is often not very good, with too many high
+scoring sequences and too few low scoring sequences compared with
+the predicted relationship between sequence length and similarity
+score. In those cases, the expectation values may be
+overestimates.
+
+ With version 33t01, all the FASTA programs also report a
+"bit" score, which is equivalent to the bit score reported by
+BLAST2. The FASTA33/BLAST2 bit score is calculated as: (lambda*S
+- ln K)/ln 2, where S is the raw similarity score, lambda and K
+are statistical parameters estimated from the distribution of
+unrelated sequence similarity scores. The statistical
+signficance of a given bit score depends on the lengths of the
+query and library sequences and the size of the library, but a 1
+bit increase in score corresponds to a 2-fold reduction in
+expectation; a 10-bit increase implies 1000-fold lower
+expectation, etc.
+
+ The statistical routines assume that the library contains a
+large sample of unrelated sequences. If this is not true, then
+statistical parameters can be estimated by using the -z 11-15,
+options. -z options greater than 10 calculate a shuffled
+similarity score for each library sequence, in addition to the
+unshuffled score, and estimate the statistical parameters from
+the scores of the shuffled sequences. If there are fewer than 20
+sequences in the library, the statistical calculations are not
+done.
+
+ For protein searches, library sequences with E() values <
+0.01 for searches of a 10,000 entry protein database are almost
+always homologous. Frequently sequences with E()-values from 1 -
+10 are related as well, but unrelated sequences ( 1 - 10 per
+search) will have scores in this renage as well. Remember,
+however, that these E() values also reflect differences between
+the amino acid composition of the query sequence and that of the
+"average" library sequence. Thus, when searches are done with
+query sequences with "biased" amino-acid composition, unrelated
+sequences may have "significant" scores because of sequence bias.
+PRSS3 can address this problem by calculating similarity scores
+for random sequences with the same length and amino acid
+composition.
+
+5. Options
+
+ Command line options are available to change the scoring
+parameters and output display. Command line options must preceed
+other program arguments, such as the query and library file
+names.
+
+5.1. Command line options
+
+-a (fasta3, ssearch3 only) show both sequences in their
+ entirety.
+
+-A force Smith-Waterman alignments for fasta3 DNA sequences.
+ By default, only fasta3 protein sequence comparisons use
+ Smith-Waterman alignments.
+
+-B Show normalized score as a z-score, rather than a bit-score
+ in the list of best scores.
+
+-b # Number of sequence scores to be shown on output. In the
+ absence of this option, fasta (and tfasta and ssearch)
+ display all library sequences obtaining similarity scores
+ with expectations less than 10.0 if optimized score are
+ used, or 2.0 if they are not. The -b option can limit the
+ display further, but it will not cause additional sequences
+ to be displayed.
+
+-c # Threshold score for optimization (OPTCUT). Set "-c 1" to
+ optimize every sequence in a database.
+
+-E # Limit the number of scores and alignments shown based on the
+ expected number of scores. Used to override the expectation
+ value of 10.0 used by default. When used with -Q, -E 2.0
+ will show all library sequences with scores with an
+ expectation value <= 2.0.
+
+-d # Maximum number of alignments to be displayed. Ignored if
+ "-Q" is not used.
+
+-f Penalty for the first residue in a gap (-12 by default for
+ proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
+
+-F # Limit the number of scores and alignments shown based on the
+ expected number of scores. "-E #" sets the highest E()-value
+ shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
+ will not show any matches or alignments with E() < 0.0001.
+ This allows one to skip over close relationships in searches
+ for more distant relationships.
+
+-g Penalty for additional residues in a gap (-2 by default for
+ proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
+
+-h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
+
+-H Omit histogram.
+
+-i Invert (reverse complement) the query sequence if it is DNA.
+ For tfasta3/x3/y3, search the reverse complement of the
+ library sequence only.
+
+-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).
+
+-l file
+ Location of library menu file (FASTLIBS).
+
+-L Display more information about the library sequence in the
+ alignment.
+
+-M low-high
+ Range of amino acid sequence lengths to be included in the
+ search.
+
+-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
+
+ -m 0 -m 1 -m 2 -m 3 -m 4
+ MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
+ ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
+ MWKSCGYPYT MWKSCGYPYT
+
+ -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
+ -m 5 plus HTML formatting.
+
+-m 9 provides coordinates and scores with the best score
+ information. A simple " -m 9 extends the normal best score
+ information:
+
+ The best scores are: opt bits E(14548)
+ XURTG4 glutathione transferase (EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79
+
+ to include the additional information (on the same line,
+ separated by a <tab>):
+
+ %_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs
+ 0.771 0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0
+
+ -m 9c provides additional information: an encoded alignment
+ string. Thus:
+
+ 10 20 30 40 50 60 70
+ GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
+ :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
+ XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
+ 20 30 40 50 60
+
+ would be encoded:
+
+ =23+9=13-2=10-1=3+1=5
+
+ The alignment encoding is with repect to the alignment, not
+ the sequences. The coordinate of the alignment is given
+ earlier in the " -m 9c" line.
+
+-m 10
+ -m 10 is a new, parseable format for use with other
+ programs. See the file "readme.v20u4" for a more complete
+ description.
+
+ As of version "fa34t23b2", it has become possible to combine
+ independent "-m" options. Thus, one can use "-m 1 -m 6 -m
+ 9".
+
+-M low-high
+ Include library sequences (proteins only) with lengths
+ between low and high.
+
+-n Force the query sequence to be treated as a DNA sequence.
+ This is particularly useful for query sequences that contain
+ a large number of ambiguous residues, e.g. transcription
+ factor binding sites.
+
+-O Send copy of results to "filename." Helpful for
+ environments without STDOUT (mostly for the Macintosh).
+
+-o Turn off default optimization of all scores greater than
+ OPTCUT. Sort results by "initn" scores (reduces the accuracy
+ of statistical estimates).
+
+-p Force query to be treated as protein sequence.
+
+-Q,-q
+ Quiet - does not prompt for any input. Writes scores and
+ alignments to the terminal or standard output file.
+
+-r Specify match/mismatch scores for DNA comparisons. The
+ default is "+5/-4". "+3/-2" can perform better in some
+ cases.
+
+-R file
+ Save a results summary line for every sequence in the
+ sequence library. The summary line includes the sequence
+ identifier, superfamily number (if available) position in
+ the library, and the similarity scores calculated. This
+ option can be used to evaluate the sensitivity and
+ selectivity of different search strategies (Pearson, 1995,
+ Pearson, 1998).
+
+-s file
+ Specify the scoring matrix file. fasta3 uses the same
+ scoring matrices as Blast1.4/2.0. Several scoring matrix
+ files are included in the standard distribution. For
+ protein sequences: codaa.mat - based on minimum mutation
+ matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
+ matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
+ pam120.mat - a PAM120 matrix. The default scoring matrix is
+ BLOSUM50 ("-s BL50"). Other matrices available from within
+ the program are: PAM250/"-s P250", PAM120/"-s P120",
+ PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
+ (MDM are modern PAM matrices from Jones et al. (Jones et
+ al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
+ BL80".
+
+-S Treat lower-case characters in the query or library
+ sequences as "low-complexity" ("seg"-ed) residues.
+ Traditionally, the "seg" program (Wootton and
+ Federhen, 1993) is used to remove low complexity regions in
+ DNA sequences by replacing the residues with an "X". When
+ the "-S" option is used, the FASTA33 programs provide a
+ potentially more informative approach. With "-S", lower
+ case characters in the query or database sequences are
+ treated as "X"'s during the initial scan, but are treated as
+ normal residues during the final alignment display. Since
+ statistical significance is calculated from the similarity
+ score calculated during the library search, when the lower
+ case residues are "X"'s, low complexity regions will not
+ produce statistically significant matches. However, if a
+ significant alignment contains low complexity regions, their
+ alignmen is shown. With "-S", lower case characters may be
+ included in the alignment to indicate low complexity
+ regions, and the final alignment score may be higher than
+ the score obtained during the search.
+
+ The pseg program can be used to produce databases (or query
+ sequences) with lower case residues indicating low
+ complexity regions using the command:
+
+ pseg database.fasta -z 1 -q > database.lc_seg
+
+ (seg can also be used with some post processing, see
+ readme.v33tx.)
+
+-U Treat the query sequence an RNA sequence. In addition to
+ selecting a DNA/RNA alphabet, this option causes changes to
+ the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
+ as 'G:G'.
+
+-V str
+ It is now possible to specify some annotation characters
+ that can be included (and will be ignored), in the query
+ sequence file. Thus, One might have a file with:
+ "ACVS*ITRLFT?", where "*" and "?" are used to indicate
+ phosphorylation. By giving the option -V '*?', those
+ characters in the query will be moved to an "annotation
+ string", and alignments that include the annotated residues
+ will be highlighted with the appropriate character above the
+ sequence (on the number line).
+
+-w # Line length (width) = number (<200)
+
+-W # context length (default is 1/2 of line width -w) for
+ alignment, like fasta and ssearch, that provide additional
+ sequence context.
+
+-x # Specify the penalty for a match to an 'X', independently of
+ the PAM matrix. Particularly useful for fastx3/fasty3,
+ where termination codons are encoded as 'X'.
+
+-X Specifies offsets for the beginning of the query and library
+ sequence. For example, if you are comparing upstream
+ regions for two genes, and the first sequence contains 500
+ nt of upstream sequence while the second contains 300 nt of
+ upstream sequence, you might try:
+
+ fasta -X "-500 -300" seq1.nt seq2.nt
+
+ If the -X option is not used, FASTA assumes numbering starts
+ with 1. (You should double check to be certain the negative
+ numbering works properly.)
+
+-y Set the width of the band used for calculating "optimized"
+ scores. For proteins and ktup=2, the width is 16. For
+ proteins with ktup=1, the width is 32 by default. For DNA
+ the width is 16.
+
+-z -1,0,1,2,3,4,5
+ -z -1 turns off statistical calculations. z 0 estimates the
+ significance of the match from the mean and standard
+ deviation of the library scores, without correcting for
+ library sequence length. -z 1 (the default) uses a weighted
+ regression of average score vs library sequence length; -z 2
+ uses maximum likelihood estimates of Lambda and K; -z 3 uses
+ Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
+ uses two variations on the -z 1 strategy. -z 1 and -z 2 are
+ the best methods, in general.
+
+-z 11,12,14,15
+ estimate the statistical parameters from shuffled copies of
+ each library sequence. This doubles the time required for a
+ search, but allows accurate statistics to be estimated for
+ libraries comprised of a single protein family.
+
+-Z db_size
+ set the apparent size of the database to be used when
+ calculating expectation E() values. If you searched a
+ database with 1,000 sequences, but would like to have the
+ E()-values calculated in the context of a 100,000 sequence
+ database, use '-Z 100000'.
+
+-1 sort output by init1 score (for compatibility with FASTP -
+ do not use).
+
+-3 translate only three forward frames
+
+For example:
+
+ fasta -w 80 -a seq1.aa seq.aa
+
+would compare the sequence in seq1.aa to that in seq2.aa and
+display the results with 80 residues on an output line, showing
+all of the residues in both sequences. Be sure to enter the
+options before entering the file names, or just enter the options
+on the command line, and the program will prompt for the file
+names.
+
+ (November, 1997) In addition, it is now possible to provide
+the fasta programs with the query sequence (fasta, fasty,
+ssearch, tfastx), or two sequences (prss, lalign, plalign) from
+the unix "stdin" stream. This makes it much easier to set up
+FASTA or PRSS WWW pages. To specify that stdin be used, rather
+than a file, the file name should be specified as '-' or '@' (the
+latter file name makes it possible to specify a subset of the
+sequence). Thus:
+
+ cat query.aa | fasta -q @:25-75 s
+
+would take residues 25-75 from query.aa and search the 's'
+library (see the discussion of FASTLIBS).
+
+5.2. Environment variables
+
+ Because the current version of the program allows the user
+to set virtually every option on the command line (except the
+ktup, which must be set as the third command line argument), only
+the FASTLIBS environment variable is routinely used.
+
+FASTLIBS
+ specifies the location of the file which contains the list
+ of library descriptions, locations, and library types (see
+ section on finding library files).
+
+6. Frequently Asked Questions
+
+ (1) Which program should I use? See Table I.
+
+ (2) How do I search with both DNA strands with fasta3 and
+ fastx3? With version 32 of the FASTA program package, all
+ searches that use DNA queries (e.g. fasta3, fastx3/y3)
+ examine both strands. To revert to earlier FASTA behavior
+ - only looking at the forward or reverse strand - use -3
+ to search only the forward strand and -i -3 to search only
+ the reverse strand.
+
+ (3) When I search Genbank - the program reports: 0 residues in
+ 0 sequences. This typically happens because the program
+ does not know that you are searching a Genbank flatfile
+ database and is looking for a FASTA format database. Be
+ certain to specify the library type ("1" for Genbank
+ flatfile) with the database name.
+
+ (4) What is the difference between fastx3 and fasty3 (or
+ tfastx3 and tfasty3). [t]fastx3 uses a simpler codon
+ based model for alignments that does not allow frameshifts
+ in some codon positions (see ref. (Zhang et al., 1997)).
+ tfastx3 is about 30% faster, but tfasty3 can produce
+ higher quality alignments in some cases.
+
+ (5) When I run fasta3 -q, I don't see any (or very little)
+ output, but I get lots of scores when I run interactively.
+ With the -Q option, the number of high scores displayed is
+ limited by the -E # cutoff, which is 10.0 for protein
+ comparisons, 2.0 for DNA comparisons, and 5.0 for
+ translated DNA:protein comparisons. In interactive mode
+ (without -Q), by default you see 20 high scores,
+ regardless of E() value.
+
+ (6) What is ktup - All of the programs with fast in their name
+ use a computer science method called a lookup table to
+ speed the search. For proteins with ktup=2, this means
+ that the program does not look at any sequence alignment
+ that does not involve matching two identical residues in
+ both sequences. Likewise with DNA and ktup = 6, the
+ initial alignment of the sequences looks for 6 identical
+ adjacent nucleotides in both sequences. Because it is
+ less likely that two identical amino-acids will line up by
+ chance in two unrelated proteins, this speeds up the
+ comparison. But very distantly related sequences may
+ never have two identical residues in a row but will have
+ single aligned identities. In this case, ktup = 1 may
+ find alignments that ktup=2 misses.
+
+ (7) Sometimes, in the list of best scores, the same sequence
+ is shown twice with exactly the same score. Sometimes,
+ the sequence is there twice, but the scores are slightly
+ different. When any of the fasta3 programs searches a long
+ sequence, it breaks the sequence up into overlapping
+ pieces. The length of the piece depends on the length of
+ the query and the particular program being used (it can
+ also be controlled with the -N #### option). Since the
+ pieces overlap by the length of the query sequence (or
+ 3*query_length for fastx/y3 and tfasta/x/y3), if the
+ highest scoring alignment is at the end of one piece, it
+ will be scored again at the beginning of the next piece.
+ If the alignment is not be completely included in the
+ overlap region, one of the pieces will give a higher score
+ than the other. These duplications can be detected by
+ looking at the coordinates of the alignment. If either
+ the beginning or end coordinate is identical in two
+ alignments, the alignments are at least partially
+ duplicates.
+
+As always, please inform me of bugs as soon as possible.
+
+William R. Pearson
+Department of Biochemistry
+Jordan Hall Box 800733
+U. of Virginia
+Charlottesville, VA
+
+wrp@virginia.EDU
+
+7. References
+
+Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
+(1994). Issues in searching molecular sequence databases. Nature
+Genet. 6,119-129.
+
+Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
+Methods Enzymol. 266,460-480.
+
+Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
+sequence data bank and its new supplement TrEMBL. Nucleic Acids.
+Res. 24,21-25.
+
+Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
+Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
+Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
+(1998). The PIR-International Protein Sequence Database. Nucleic
+Acids Res 26,27-32.
+
+Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
+of evolutionary change in proteins. In Atlas of Protein Sequence
+and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
+Spring, MD: National Biomedical Research Foundation), pp.
+345-352.
+
+Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
+rapid generation of mutation data matrices from protein
+sequences. Comp. Appl. Biosci. 8,275-282.
+
+Pearson, W. R. (2000). Flexible similarity searching with the
+FASTA3 program package. In Bioinformatics Methods and Protocols,
+S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
+185-219.
+
+Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
+biological sequence comparison. Proc. Natl. Acad. Sci. USA
+85,2444-2448.
+
+Pearson, W. R. (1995). Comparison of methods for searching
+protein sequence databases. Prot. Sci. 4,1145-1160.
+
+Pearson, W. R. (1996). Effective protein sequence comparison.
+Methods Enzymol. 266,227-258.
+
+Pearson, W. R. (1998). Empirical statistical estimates for
+sequence similarity searches. J. Mol. Biol. 276,71-84.
+
+Smith, T. F. and Waterman, M. S. (1981). Identification of common
+molecular subsequences. J. Mol. Biol. 147,195-197.
+
+Wootton, J. C. and Federhen, S. (1993). Statistics of local
+complexity in amino acid sequences and sequence databases.
+Comput. Chem. 17,149-163.
+
+Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
+sequence with a protein sequence. J. Computational Biology
+4,339-349.
+