+++ /dev/null
-(Updated December, 2003)
-
-
- COPYRIGHT NOTICE
-
-Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
-Pearson and the University of Virginia. All rights reserved. The
-FASTA program and documentation may not be sold or incorporated
-into a commercial product, in whole or in part, without written
-consent of William R. Pearson and the University of Virginia.
-For further information regarding permission for use or
-reproduction, please contact: David Hudson, Assistant Provost for
-Research, University of Virginia, P.O. Box 9025, Charlottesville,
-VA 22906-9025, (434) 924-6853
-
-The FASTA program package
-
-Introduction
-
- This documentation describes the version 3 of the FASTA
-program package (see W. R. Pearson and D. J. Lipman (1988),
-"Improved Tools for Biological Sequence Analysis", PNAS
-85:2444-2448 (Pearson and Lipman, 1988); W. R. Pearson (1996)
-"Effective protein sequence comparison" Meth. Enzymol.
-266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
-46:24-36 (Zhang et al., 1997); Pearson, (1999) Meth. in
-Molecular Biology 132:185-219 (Pearson, 2000). Version 3 of the
-FASTA packages contains many programs for searching DNA and
-protein databases and one program (prss3) for evaluating
-statistical significance from randomly shuffled sequences.
-Several additional analysis programs, including programs that
-produce local alignments, are available as part of version 2 of
-the FASTA package, which is still available.
-
- This document is divided into three sections: (1) A summary
-overview of the programs in the FASTA3 package; (2) A guide to
-installing the programs and databases; (3) A guide to using the
-FASTA programs. The revision history of the programs can be found
-in the readme.v30..v34, files. The programs are easy to use, so
-if you are using them on a machine that is administered by
-someone else, you can skip section (2) and focus on (1) and (3)
-to learn how to use the programsIf you are installing the
-programs on your own machine, you will need to read section (2)
-carefully.
-
-1. An overview of the FASTA programs
-
- Although there are a large number of programs in this
-package, they belong to three groups: (1) "Conventional" Library
-search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
-TFASTY3, SSEARCH3; (2) Programs for searching with short
-fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
-significance: PRSS3. Programs that start with fast search
-protein databases, while tfast programs search translated DNA
-databases. Table I gives a brief description of the programs.
-
-
- Table I. Comparison programs in the FASTA3 package
-
----------------------------------------------------------------------------
-fasta3 Compare a protein sequence to a protein sequence
- database or a DNA sequence to a DNA sequence database
- using the FASTA algorithm (Pearson and Lipman, 1988,
- Pearson, 1996). Search speed and selectivity are con-
- trolled with the ktup(wordsize) parameter. For protein
- comparisons, ktup = 2 by default; ktup =1 is more sen-
- sitive but slower. For DNA comparisons, ktup=6 by de-
- fault; ktup=3 or ktup=4 provides higher sensitivity;
- ktup=1 should be used for oligonucleotides (DNA query
- lengths < 20).
-
-ssearch3 Compare a protein sequence to a protein sequence
- database or a DNA sequence to a DNA sequence database
- using the Smith-Waterman algorithm (Smith and Water-
- man, 1981). ssearch3 is about 10-times slower than
- FASTA3, but is more sensitive for full-length protein
- sequence comparison.
-
-fastx3/ fasty3 Compare a DNA sequence to a protein sequence database,
- by comparing the translated DNA sequence in three
- frames and allowing gaps and frameshifts. fastx3 uses
- a simpler, faster algorithm for alignments that allows
- frameshifts only between codons; fasty3 is slower but
- produces better alignments with poor quality sequences
- because frameshifts are allowed within codons.
-
-tfastx3/ tfasty3 Compare a protein sequence to a DNA sequence database,
- calculating similarities with frameshifts to the for-
- ward and reverse orientations.
-
-tfasta3 Compare a protein sequence to a DNA sequence database,
- calculating similarities (without frameshifts) to the 3
- forward and three reverse reading frames. tfastx3 and
- tfasty3 are preferred because they calculate similarity
- over frameshifts.
-
-fastf3/tfastf3 Compares an ordered peptide mixture, as would be ob-
- tained by Edman degredation of a CNBr cleavage of a
- protein, against a protein (fastf) or DNA (tfastf)
- database.
-
-fasts3/tfasts3 Compares set of short peptide fragments, as would be
- obtained from mass-spec. analysis of a protein, against
- a protein (fasts) or DNA (tfasts) database.
----------------------------------------------------------------------------
-
-2. Installing FASTA and the sequence databases
-
-2.1. Obtaining the libraries
-
- The FASTA program package does not include any protein or
-DNA sequence libraries. Protein databases are available on CD-
-ROM from the PIR and EMBL (see below), or via anonymouse FTP from
-many different sources. As this document is updated in the fall
-of 1999, no DNA databases are available on CD-ROM from the major
-sequence databases: Genbank at the National for Biotechnology
-Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
-EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
-However, the databases are available via anonymous FTP from both
-sites.
-
-2.1.1. The GENBANK DNA sequence library
-
- Because of the large size of DNA databases, you will
-probably want to keep DNA databases in only one, or possibly two,
-formats. The FASTA3 programs that search DNA databases - fasta3,
-tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
-(not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
-and BLAST2.0 (formatdb) formats, as well as EMBL format. If you
-are also running the GCG suite of sequence analysis programs, you
-should use GCG/compressed-binary format or BLAST2.0 format for
-your fasta3 searches. If not, BLAST2.0 is a good choice. These
-files are considerably more compact than Genbank flat files, and
-are preferred. The NCBI does not provide software for converting
-from Genbank flat files to Blast2.0 DNA databases, but you can
-use the Blast formatdb program to convert ASN.1 formated Genbank
-files, which are available from the NCBI ftp site.
-
- The NCBI also provides the nr, swissprot, and several EST
-databases that are used by BLAST in FASTA format from:
-ftp://ncbi.nlm.nih.gov/blast/db. These databases are updated
-nightly.
-
-2.1.2. The NBRF protein sequence library
-
- You can obtain the PIR protein sequence database (Barker et
-al., 1998) from:
-
- National Biomedical Research Foundation
- Georgetown University Medical Center
- 3900 Reservoir Rd, N.W.
- Washington, D.C. 20007
-
-or via ftp from nbrf.georgetown.edu or from the NCBI
-(ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
-directory is in PIR Codata format, which is not widely used. I
-recommend the PIR/VMS format data (libtype=5) in the vms
-directory.
-
-2.1.3. The EBI/EMBL CD-ROM libraries
-
- The European Bioinformatics Institute (EBI) distributes both
-the EMBL DNA database and the SwissProt database on CD-ROM
-(Bairoch and Apweiler, 1996), and they are available from:
-
- EMBL-Outstation European Bioinformatics Institute
- Wellcome Trust Genome Campus,
- Hinxton Hall
- Hinxton,
- Cambridge CB10 1SD
- United Kingdom
- Tel: +44 (0)1223 494444
- Fax: +44 (0)1223 494468
- Email: DATALIB@ebi.ac.uk
-
-In addition, the SWISS-PROT protein sequence database is
-available via anonymous FTP from
-ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
-www.expasy.ch).
-
-2.2. Finding the libraries: FASTLIBS
-
- The major problem that most new users of the FASTA package
-have is in setting up the program to find the databases and their
-library type. In general, if you cannot get fasta3 to read a
-sequence database, it is likely that something is wrong with the
-FASTLIBS file. A common problem is that the database file is
-found, but either no sequences are read, or an incorrect number
-of entries is read. This is almost always because the library
-format (libtype) is incorrect. Note that a type 5 file (PIR/VMS
-format) can be read as a type 0 (default FASTA) format file, and
-the number of entries will be correct, but the sequence lengths
-will not.
-
- All the search programs in the FASTA3 package use the
-environment variable FASTLIBS to find the protein and DNA
-sequence libraries. The FASTLIBS variable contains the name of a
-file that has the actual filenames of the libraries. The
-fastlibs file included with the distribution on is an example of
-a file that can be referred to by FASTLIBS. To use the fastlibs
-file, type:
-
- setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
- or
- export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
-
-Then edit the fastlibs file to indicate where the protein and DNA
-sequence libraries can be found. If you have a hard disk and
-your protein sequence library is kept in the file
-/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
-in the directory: /usr/lib/genbank, then fastgbs might contain:
-
- NBRF Protein$0P/usr/lib/seq/aabank.lib 0
- SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
- GB Primate$1P@/usr/lib/genbank/gpri.nam
- GB Rodent$1R@/usr/lib/genbank/grod.nam
- GB Mammal$1M@/usr/lib/genbank/gmammal.nam
- ^ 1 ^^^^ 4 ^ ^
- 23 (5)
-
-The first line of this file says that there is a copy of the NBRF
-protein sequence database (which is a protein database) that can
-be selected by typing "P" on the command line or when the
-database menu is presented in the file /usr/lib/seq/aabank.lib.
-
- Note that there are 4 or 5 fields in the lines in fastgbs.
-The first field is the description of the library which will be
-displayed by FASTA; it ends with a '$'. The second field (1
-character), is a 0 if the library is a protein library and 1 if
-it is a DNA library. The third field (1 character) is the
-character to be typed to select the library.
-
- The fourth field is the name of the library file. In the
-example above, the /usr/lib/seq/aabank.lib file contains the
-entire protein sequence library. However the DNA library file
-names are preceded by a '@', because these files (gpri.nam,
-grod.nam, gmammal.nam) do not contain the sequences; instead they
-contain the names of the files which contain the sequences. This
-is done because the GENBANK DNA database is broken down in to a
-large number of smaller files. In order to search the entire
-primate database, you must search more than a dozen files.
-
- In addition, an optional fifth field can be used to specify
-the format of the library file. Alternatively, you can specify
-the library format in a file of file names (a file preceded by an
-'@'). This field must be separated from the file name by a space
-character (' ') from the filename. In the example above, the
-aabank.lib file is in Pearson/FASTA format, while the swiss.seq
-file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
-FASTA can read the following formats:
-
- 0 Pearson/FASTA (>SEQID - comment/sequence)
- 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
- 2 NBRF CODATA (ENTRY/SEQUENCE)
- 3 EMBL/SWISS-PROT (ID/DE/SQ)
- 4 Intelligenetics (;comment/SEQID/sequence)
- 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
- 6 GCG (version 8.0) Unix Protein and DNA (compressed)
- 11 NCBI Blast1.3.2 format (unix only)
- 12 NCBI Blast2.0 format (unix only, fasta32t08 or later)
-
-In particular, this version will work with the EMBL and PIR VMS
-formats that are distributed on the EMBL CD-ROM. The latter
-format (PIR VMS) is much faster to search than EMBL format. This
-release also works with the protein and DNA database formats
-created for the BLASTP and BLASTN programs by SETDB and PRESSDB
-and with the new NCBI search format. If a library format is not
-specified, for example, because you are just comparing two
-sequences, Pearson/FASTA (format 0) is used by default. To
-specify a library type on the command line, add it to the library
-filename and surround the filename and library type in quotes:
-
- fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
-
- You can specify a group of library files by putting a '@'
-symbol before a file that contains a list of file names to be
-searched. For example, if @gmam.nam is in the fastgbs file, the
-file "gmam.nam" might contain the lines:
-
- </seqdb/genbank
- gbpri1.seq 1
- gbpri2.seq 1
- gbpri3.seq 1
- gbpri4.seq 1
- gbrod.seq 1
- gbmam.seq 1
-
-In this case, the line beginning with a '<' indicates the
-directory the files will be found in. The remaining lines name
-the actual sequence files. So the first sequence file to be
-searched would be:
-
- /usr/lib/genbank/gbpri.seq
-
-The notation "<PIRNAQ:" might be used under the VAX/VMS operating
-system. Under UNIX, the trailing '/' is left off, so the library
-directory might be written as "</usr/seqlib".
-
- The FASTA programs can search a database composed of
-different files in different sequence formats. For example, you
-may wish to search the Genbank files (in GenBank flat file
-format) and the EMBL DNA sequence database on CD-ROM. To do
-this, you simply list the names and filetypes of the files to be
-searched in a file of filenames. For example, to search the
-mammalian portion of Genbank, the unannotated portion of Genbank,
-and the unannotated portion of the EMBL library, you could use
-the file:
-
- </usr/lib/DNA
- gbpri.seq 1
- # (this '#' causes the program to display the size of the library)
- gbrod.seq 1
- ...
- gbmam.seq 1
- ...
- gbuna.seq 1
- ...
- unanno.seq 5
- #
-
- You do not need to include library format numbers if you
- only use the Pearson/FASTA version of the PIR protein se-
- quence library. If no library type is specified, the
- program assumes that type 0 is being used.
-
- Test the setup by running FASTA. Enter the sequence file
-'mgstm1.aa' when the program requests it (this file is included
-with the programs). The program should then ask you to select a
-protein sequence library. Alternatively, if you run the TFASTA
-program and use the mgstm1.aa query sequence, the program should
-show you a selection of DNA sequence libraries. Once the fastgbs
-file has been set up correctly, you can set FASTLIBS=fastgbs in
-your AUTOEXEC.BAT file, and you will not need to remember where
-the libraries are kept or how they are named.
-
-3. Using the FASTA Package
-
-3.1. Overview
-
- The FASTA sequence comparison programs all require similar
-information, the name of a query sequence file, a library file,
-and the ktup parameter. All of the programs can accept arguments
-on the command line, or they will prompt for the file names and
-ktup value.
-
-To use FASTA, simply type:
-
- FASTA
- and you will be prompted for :
- the name of the test sequence file
- the name of the library file
- and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
- (ktup of 2 is about 5 times faster than ktup = 1)
-
-The program can also be run by typing
-
- FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
-
-Included with the package are several test files. To check to
-make certain that everything is working, you can try:
-
- fasta musplfm.aa prot_test.lib
- and
- tfastx mgstm1.aa gst.nlib
-
-3.2. Sequence files
-
- The fasta3 programs know about three kinds of sequence
-files: (1) plain sequence files - files that contain nothing but
-sequence residues - can only be used as query sequences. (2)
-FASTA format files. These are the same as plain sequence files,
-each sequence is preceded by a comment line with a '>' in the
-first column. (3) distributed sequence libraries (this is a broad
-class that includes the NBRF/PIR VMS and blocked ascii formats,
-Genbank flat-file format, EMBL flat-file format, and
-Intelligenetics format. All of the files that you create should
-be of type (1) or (2). FASTA format files (ones with a '>' and
-comment before the sequence) are preferred, because they can be
-used as query or library sequence files by all of the programs.
-
- I have included several sample test files, *.aa and *.seq as
-well as two small sequence libraries, prot_test.lib and gst.nlib.
-The first line may begin with a '>' by a comment. Spaces and
-tabs (and anything else that is not an amino-acid code) are
-ignored.
-
- Library files should have the form:
-
- >Sequence name and identifier
- A F A S Y T .... actual sequence.
- F S S .... second line of sequence.
- >Next sequence name and identifier
-
-This is often referred to as "FASTA" or format. You can build
-your own library by concatenating several sequence files. Just
-be sure that each sequence is preceded by a line beginning with a
-'>' with a sequence name.
-
- The test file should not have lines longer than 120
-characters, and sequences entered with word processors should use
-a document mode, with normal carriage returns at the end of
-lines.
-
- A different format is required to specify the ordered
-peptide mixture for fastf3/tfastf3. For example:
-
- >mgstm1
- MGCEN,
- MIDYP,
- MLLAY,
- MLLGY
-
-indicates m in the first position of all three peptides (as from
-CNBr), G, I, L (twice) in the second position (first cycle),
-C,D,L (twice) in the third position, etc. The commas (,) are
-required to indicate the number of fragments in the mixture, but
-there should be no comma after the last residue.
-
- For the fasts3/tfasts3 program, the format is the same,
-except that there is no requirement for the peptides to be the
-same length.
-
-4. Statistical Significance
-
- All the programs in the FASTA3 package attempt to calculate
-accurate estimates of the statistical significance of a match.
-For fasta3, ssearch3, and fastx3/y3, these estimates are very
-accurate (Pearson, 1998, Zhang et al., 1997).. Altschul et al.
-(Altschul et al., 1994) provides an excellent review of the
-statistics of local similarity scores. Local sequence similarity
-scores follow the extreme value distribution, so that P(s > x) =
-1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
-the lengths of the query and library sequence. This formula can
-be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
-the average score for an unrelated library sequence increases
-with the logarithm of the length of the library sequence. The
-fasta3 programs use simple linear regression against the the log
-of the library sequence length to calculate a normalized "z-
-score" with mean 50, regardless of library sequence length, and
-variance 10. (Several other estimation methods are available with
-the -z option.) These z-scores can then be used with the extreme
-value distribution and the poisson distribution (to account for
-the fact that each library sequence comparison is an independent
-test) to calculate the number of library sequences to obtain a
-score greater than or equal to the score obtained in the search.
-The original idea and routines to do the linear regression on
-library sequence length were provided Phil Green, U. Washington.
-This version uses a slightly different strategy for fitting the
-data than those originally provided by Dr. Green.
-
- The expected number of sequences is plotted in the histogram
-using an "*". Since the parameters for the extreme value
-distribution are not calculated directly from the distribution of
-similarity scores, the pattern of "*'s" in the histogram gives a
-qualitative view of how well the statistical theory fits the
-similarity scores calculated by the programs. For fasta3, if
-optimized scores are calculated for each sequence in the database
-(the default), the agreement between the actual distribution of
-"z-scores" and the expected distribution based on the length
-dependence of the score and the extreme value distribution is
-usually very good. Likewise, the distribution of ssearch3 Smith-
-Waterman scores typically agrees closely with the <actual
-distribution of "z-scores." The agreement with unoptimized
-scores, ktup=2, is often not very good, with too many high
-scoring sequences and too few low scoring sequences compared with
-the predicted relationship between sequence length and similarity
-score. In those cases, the expectation values may be
-overestimates.
-
- With version 33t01, all the FASTA programs also report a
-"bit" score, which is equivalent to the bit score reported by
-BLAST2. The FASTA33/BLAST2 bit score is calculated as: (lambda*S
-- ln K)/ln 2, where S is the raw similarity score, lambda and K
-are statistical parameters estimated from the distribution of
-unrelated sequence similarity scores. The statistical
-signficance of a given bit score depends on the lengths of the
-query and library sequences and the size of the library, but a 1
-bit increase in score corresponds to a 2-fold reduction in
-expectation; a 10-bit increase implies 1000-fold lower
-expectation, etc.
-
- The statistical routines assume that the library contains a
-large sample of unrelated sequences. If this is not true, then
-statistical parameters can be estimated by using the -z 11-15,
-options. -z options greater than 10 calculate a shuffled
-similarity score for each library sequence, in addition to the
-unshuffled score, and estimate the statistical parameters from
-the scores of the shuffled sequences. If there are fewer than 20
-sequences in the library, the statistical calculations are not
-done.
-
- For protein searches, library sequences with E() values <
-0.01 for searches of a 10,000 entry protein database are almost
-always homologous. Frequently sequences with E()-values from 1 -
-10 are related as well, but unrelated sequences ( 1 - 10 per
-search) will have scores in this renage as well. Remember,
-however, that these E() values also reflect differences between
-the amino acid composition of the query sequence and that of the
-"average" library sequence. Thus, when searches are done with
-query sequences with "biased" amino-acid composition, unrelated
-sequences may have "significant" scores because of sequence bias.
-PRSS3 can address this problem by calculating similarity scores
-for random sequences with the same length and amino acid
-composition.
-
-5. Options
-
- Command line options are available to change the scoring
-parameters and output display. Command line options must preceed
-other program arguments, such as the query and library file
-names.
-
-5.1. Command line options
-
--a (fasta3, ssearch3 only) show both sequences in their
- entirety.
-
--A force Smith-Waterman alignments for fasta3 DNA sequences.
- By default, only fasta3 protein sequence comparisons use
- Smith-Waterman alignments.
-
--B Show normalized score as a z-score, rather than a bit-score
- in the list of best scores.
-
--b # Number of sequence scores to be shown on output. In the
- absence of this option, fasta (and tfasta and ssearch)
- display all library sequences obtaining similarity scores
- with expectations less than 10.0 if optimized score are
- used, or 2.0 if they are not. The -b option can limit the
- display further, but it will not cause additional sequences
- to be displayed.
-
--c # Threshold score for optimization (OPTCUT). Set "-c 1" to
- optimize every sequence in a database.
-
--E # Limit the number of scores and alignments shown based on the
- expected number of scores. Used to override the expectation
- value of 10.0 used by default. When used with -Q, -E 2.0
- will show all library sequences with scores with an
- expectation value <= 2.0.
-
--d # Maximum number of alignments to be displayed. Ignored if
- "-Q" is not used.
-
--f Penalty for the first residue in a gap (-12 by default for
- proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
-
--F # Limit the number of scores and alignments shown based on the
- expected number of scores. "-E #" sets the highest E()-value
- shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
- will not show any matches or alignments with E() < 0.0001.
- This allows one to skip over close relationships in searches
- for more distant relationships.
-
--g Penalty for additional residues in a gap (-2 by default for
- proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
-
--h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
-
--H Omit histogram.
-
--i Invert (reverse complement) the query sequence if it is DNA.
- For tfasta3/x3/y3, search the reverse complement of the
- library sequence only.
-
--j # Penalty for frameshift within a codon (fasty3/tfasty3 only).
-
--l file
- Location of library menu file (FASTLIBS).
-
--L Display more information about the library sequence in the
- alignment.
-
--M low-high
- Range of amino acid sequence lengths to be included in the
- search.
-
--m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
-
- -m 0 -m 1 -m 2 -m 3 -m 4
- MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
- ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
- MWKSCGYPYT MWKSCGYPYT
-
- -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
- -m 5 plus HTML formatting.
-
--m 9 provides coordinates and scores with the best score
- information. A simple " -m 9 extends the normal best score
- information:
-
- The best scores are: opt bits E(14548)
- XURTG4 glutathione transferase (EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79
-
- to include the additional information (on the same line,
- separated by a <tab>):
-
- %_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs
- 0.771 0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0
-
- -m 9c provides additional information: an encoded alignment
- string. Thus:
-
- 10 20 30 40 50 60 70
- GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
- :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
- XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
- 20 30 40 50 60
-
- would be encoded:
-
- =23+9=13-2=10-1=3+1=5
-
- The alignment encoding is with repect to the alignment, not
- the sequences. The coordinate of the alignment is given
- earlier in the " -m 9c" line.
-
--m 10
- -m 10 is a new, parseable format for use with other
- programs. See the file "readme.v20u4" for a more complete
- description.
-
- As of version "fa34t23b2", it has become possible to combine
- independent "-m" options. Thus, one can use "-m 1 -m 6 -m
- 9".
-
--M low-high
- Include library sequences (proteins only) with lengths
- between low and high.
-
--n Force the query sequence to be treated as a DNA sequence.
- This is particularly useful for query sequences that contain
- a large number of ambiguous residues, e.g. transcription
- factor binding sites.
-
--O Send copy of results to "filename." Helpful for
- environments without STDOUT (mostly for the Macintosh).
-
--o Turn off default optimization of all scores greater than
- OPTCUT. Sort results by "initn" scores (reduces the accuracy
- of statistical estimates).
-
--p Force query to be treated as protein sequence.
-
--Q,-q
- Quiet - does not prompt for any input. Writes scores and
- alignments to the terminal or standard output file.
-
--r Specify match/mismatch scores for DNA comparisons. The
- default is "+5/-4". "+3/-2" can perform better in some
- cases.
-
--R file
- Save a results summary line for every sequence in the
- sequence library. The summary line includes the sequence
- identifier, superfamily number (if available) position in
- the library, and the similarity scores calculated. This
- option can be used to evaluate the sensitivity and
- selectivity of different search strategies (Pearson, 1995,
- Pearson, 1998).
-
--s file
- Specify the scoring matrix file. fasta3 uses the same
- scoring matrices as Blast1.4/2.0. Several scoring matrix
- files are included in the standard distribution. For
- protein sequences: codaa.mat - based on minimum mutation
- matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
- matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
- pam120.mat - a PAM120 matrix. The default scoring matrix is
- BLOSUM50 ("-s BL50"). Other matrices available from within
- the program are: PAM250/"-s P250", PAM120/"-s P120",
- PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
- (MDM are modern PAM matrices from Jones et al. (Jones et
- al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
- BL80".
-
--S Treat lower-case characters in the query or library
- sequences as "low-complexity" ("seg"-ed) residues.
- Traditionally, the "seg" program (Wootton and
- Federhen, 1993) is used to remove low complexity regions in
- DNA sequences by replacing the residues with an "X". When
- the "-S" option is used, the FASTA33 programs provide a
- potentially more informative approach. With "-S", lower
- case characters in the query or database sequences are
- treated as "X"'s during the initial scan, but are treated as
- normal residues during the final alignment display. Since
- statistical significance is calculated from the similarity
- score calculated during the library search, when the lower
- case residues are "X"'s, low complexity regions will not
- produce statistically significant matches. However, if a
- significant alignment contains low complexity regions, their
- alignmen is shown. With "-S", lower case characters may be
- included in the alignment to indicate low complexity
- regions, and the final alignment score may be higher than
- the score obtained during the search.
-
- The pseg program can be used to produce databases (or query
- sequences) with lower case residues indicating low
- complexity regions using the command:
-
- pseg database.fasta -z 1 -q > database.lc_seg
-
- (seg can also be used with some post processing, see
- readme.v33tx.)
-
--U Treat the query sequence an RNA sequence. In addition to
- selecting a DNA/RNA alphabet, this option causes changes to
- the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
- as 'G:G'.
-
--V str
- It is now possible to specify some annotation characters
- that can be included (and will be ignored), in the query
- sequence file. Thus, One might have a file with:
- "ACVS*ITRLFT?", where "*" and "?" are used to indicate
- phosphorylation. By giving the option -V '*?', those
- characters in the query will be moved to an "annotation
- string", and alignments that include the annotated residues
- will be highlighted with the appropriate character above the
- sequence (on the number line).
-
--w # Line length (width) = number (<200)
-
--W # context length (default is 1/2 of line width -w) for
- alignment, like fasta and ssearch, that provide additional
- sequence context.
-
--x # Specify the penalty for a match to an 'X', independently of
- the PAM matrix. Particularly useful for fastx3/fasty3,
- where termination codons are encoded as 'X'.
-
--X Specifies offsets for the beginning of the query and library
- sequence. For example, if you are comparing upstream
- regions for two genes, and the first sequence contains 500
- nt of upstream sequence while the second contains 300 nt of
- upstream sequence, you might try:
-
- fasta -X "-500 -300" seq1.nt seq2.nt
-
- If the -X option is not used, FASTA assumes numbering starts
- with 1. (You should double check to be certain the negative
- numbering works properly.)
-
--y Set the width of the band used for calculating "optimized"
- scores. For proteins and ktup=2, the width is 16. For
- proteins with ktup=1, the width is 32 by default. For DNA
- the width is 16.
-
--z -1,0,1,2,3,4,5
- -z -1 turns off statistical calculations. z 0 estimates the
- significance of the match from the mean and standard
- deviation of the library scores, without correcting for
- library sequence length. -z 1 (the default) uses a weighted
- regression of average score vs library sequence length; -z 2
- uses maximum likelihood estimates of Lambda and K; -z 3 uses
- Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
- uses two variations on the -z 1 strategy. -z 1 and -z 2 are
- the best methods, in general.
-
--z 11,12,14,15
- estimate the statistical parameters from shuffled copies of
- each library sequence. This doubles the time required for a
- search, but allows accurate statistics to be estimated for
- libraries comprised of a single protein family.
-
--Z db_size
- set the apparent size of the database to be used when
- calculating expectation E() values. If you searched a
- database with 1,000 sequences, but would like to have the
- E()-values calculated in the context of a 100,000 sequence
- database, use '-Z 100000'.
-
--1 sort output by init1 score (for compatibility with FASTP -
- do not use).
-
--3 translate only three forward frames
-
-For example:
-
- fasta -w 80 -a seq1.aa seq.aa
-
-would compare the sequence in seq1.aa to that in seq2.aa and
-display the results with 80 residues on an output line, showing
-all of the residues in both sequences. Be sure to enter the
-options before entering the file names, or just enter the options
-on the command line, and the program will prompt for the file
-names.
-
- (November, 1997) In addition, it is now possible to provide
-the fasta programs with the query sequence (fasta, fasty,
-ssearch, tfastx), or two sequences (prss, lalign, plalign) from
-the unix "stdin" stream. This makes it much easier to set up
-FASTA or PRSS WWW pages. To specify that stdin be used, rather
-than a file, the file name should be specified as '-' or '@' (the
-latter file name makes it possible to specify a subset of the
-sequence). Thus:
-
- cat query.aa | fasta -q @:25-75 s
-
-would take residues 25-75 from query.aa and search the 's'
-library (see the discussion of FASTLIBS).
-
-5.2. Environment variables
-
- Because the current version of the program allows the user
-to set virtually every option on the command line (except the
-ktup, which must be set as the third command line argument), only
-the FASTLIBS environment variable is routinely used.
-
-FASTLIBS
- specifies the location of the file which contains the list
- of library descriptions, locations, and library types (see
- section on finding library files).
-
-6. Frequently Asked Questions
-
- (1) Which program should I use? See Table I.
-
- (2) How do I search with both DNA strands with fasta3 and
- fastx3? With version 32 of the FASTA program package, all
- searches that use DNA queries (e.g. fasta3, fastx3/y3)
- examine both strands. To revert to earlier FASTA behavior
- - only looking at the forward or reverse strand - use -3
- to search only the forward strand and -i -3 to search only
- the reverse strand.
-
- (3) When I search Genbank - the program reports: 0 residues in
- 0 sequences. This typically happens because the program
- does not know that you are searching a Genbank flatfile
- database and is looking for a FASTA format database. Be
- certain to specify the library type ("1" for Genbank
- flatfile) with the database name.
-
- (4) What is the difference between fastx3 and fasty3 (or
- tfastx3 and tfasty3). [t]fastx3 uses a simpler codon
- based model for alignments that does not allow frameshifts
- in some codon positions (see ref. (Zhang et al., 1997)).
- tfastx3 is about 30% faster, but tfasty3 can produce
- higher quality alignments in some cases.
-
- (5) When I run fasta3 -q, I don't see any (or very little)
- output, but I get lots of scores when I run interactively.
- With the -Q option, the number of high scores displayed is
- limited by the -E # cutoff, which is 10.0 for protein
- comparisons, 2.0 for DNA comparisons, and 5.0 for
- translated DNA:protein comparisons. In interactive mode
- (without -Q), by default you see 20 high scores,
- regardless of E() value.
-
- (6) What is ktup - All of the programs with fast in their name
- use a computer science method called a lookup table to
- speed the search. For proteins with ktup=2, this means
- that the program does not look at any sequence alignment
- that does not involve matching two identical residues in
- both sequences. Likewise with DNA and ktup = 6, the
- initial alignment of the sequences looks for 6 identical
- adjacent nucleotides in both sequences. Because it is
- less likely that two identical amino-acids will line up by
- chance in two unrelated proteins, this speeds up the
- comparison. But very distantly related sequences may
- never have two identical residues in a row but will have
- single aligned identities. In this case, ktup = 1 may
- find alignments that ktup=2 misses.
-
- (7) Sometimes, in the list of best scores, the same sequence
- is shown twice with exactly the same score. Sometimes,
- the sequence is there twice, but the scores are slightly
- different. When any of the fasta3 programs searches a long
- sequence, it breaks the sequence up into overlapping
- pieces. The length of the piece depends on the length of
- the query and the particular program being used (it can
- also be controlled with the -N #### option). Since the
- pieces overlap by the length of the query sequence (or
- 3*query_length for fastx/y3 and tfasta/x/y3), if the
- highest scoring alignment is at the end of one piece, it
- will be scored again at the beginning of the next piece.
- If the alignment is not be completely included in the
- overlap region, one of the pieces will give a higher score
- than the other. These duplications can be detected by
- looking at the coordinates of the alignment. If either
- the beginning or end coordinate is identical in two
- alignments, the alignments are at least partially
- duplicates.
-
-As always, please inform me of bugs as soon as possible.
-
-William R. Pearson
-Department of Biochemistry
-Jordan Hall Box 800733
-U. of Virginia
-Charlottesville, VA
-
-wrp@virginia.EDU
-
-7. References
-
-Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
-(1994). Issues in searching molecular sequence databases. Nature
-Genet. 6,119-129.
-
-Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
-Methods Enzymol. 266,460-480.
-
-Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
-sequence data bank and its new supplement TrEMBL. Nucleic Acids.
-Res. 24,21-25.
-
-Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
-Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
-Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
-(1998). The PIR-International Protein Sequence Database. Nucleic
-Acids Res 26,27-32.
-
-Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
-of evolutionary change in proteins. In Atlas of Protein Sequence
-and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
-Spring, MD: National Biomedical Research Foundation), pp.
-345-352.
-
-Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
-rapid generation of mutation data matrices from protein
-sequences. Comp. Appl. Biosci. 8,275-282.
-
-Pearson, W. R. (2000). Flexible similarity searching with the
-FASTA3 program package. In Bioinformatics Methods and Protocols,
-S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
-185-219.
-
-Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
-biological sequence comparison. Proc. Natl. Acad. Sci. USA
-85,2444-2448.
-
-Pearson, W. R. (1995). Comparison of methods for searching
-protein sequence databases. Prot. Sci. 4,1145-1160.
-
-Pearson, W. R. (1996). Effective protein sequence comparison.
-Methods Enzymol. 266,227-258.
-
-Pearson, W. R. (1998). Empirical statistical estimates for
-sequence similarity searches. J. Mol. Biol. 276,71-84.
-
-Smith, T. F. and Waterman, M. S. (1981). Identification of common
-molecular subsequences. J. Mol. Biol. 147,195-197.
-
-Wootton, J. C. and Federhen, S. (1993). Statistics of local
-complexity in amino acid sequences and sequence databases.
-Comput. Chem. 17,149-163.
-
-Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
-sequence with a protein sequence. J. Computational Biology
-4,339-349.
-