1 (Updated December, 2003)
6 Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
7 Pearson and the University of Virginia. All rights reserved. The
8 FASTA program and documentation may not be sold or incorporated
9 into a commercial product, in whole or in part, without written
10 consent of William R. Pearson and the University of Virginia.
11 For further information regarding permission for use or
12 reproduction, please contact: David Hudson, Assistant Provost for
13 Research, University of Virginia, P.O. Box 9025, Charlottesville,
14 VA 22906-9025, (434) 924-6853
16 The FASTA program package
20 This documentation describes the version 3 of the FASTA
21 program package (see W. R. Pearson and D. J. Lipman (1988),
22 "Improved Tools for Biological Sequence Analysis", PNAS
23 85:2444-2448 (Pearson and Lipman, 1988); W. R. Pearson (1996)
24 "Effective protein sequence comparison" Meth. Enzymol.
25 266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
26 46:24-36 (Zhang et al., 1997); Pearson, (1999) Meth. in
27 Molecular Biology 132:185-219 (Pearson, 2000). Version 3 of the
28 FASTA packages contains many programs for searching DNA and
29 protein databases and one program (prss3) for evaluating
30 statistical significance from randomly shuffled sequences.
31 Several additional analysis programs, including programs that
32 produce local alignments, are available as part of version 2 of
33 the FASTA package, which is still available.
35 This document is divided into three sections: (1) A summary
36 overview of the programs in the FASTA3 package; (2) A guide to
37 installing the programs and databases; (3) A guide to using the
38 FASTA programs. The revision history of the programs can be found
39 in the readme.v30..v34, files. The programs are easy to use, so
40 if you are using them on a machine that is administered by
41 someone else, you can skip section (2) and focus on (1) and (3)
42 to learn how to use the programsIf you are installing the
43 programs on your own machine, you will need to read section (2)
46 1. An overview of the FASTA programs
48 Although there are a large number of programs in this
49 package, they belong to three groups: (1) "Conventional" Library
50 search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
51 TFASTY3, SSEARCH3; (2) Programs for searching with short
52 fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
53 significance: PRSS3. Programs that start with fast search
54 protein databases, while tfast programs search translated DNA
55 databases. Table I gives a brief description of the programs.
58 Table I. Comparison programs in the FASTA3 package
60 ---------------------------------------------------------------------------
61 fasta3 Compare a protein sequence to a protein sequence
62 database or a DNA sequence to a DNA sequence database
63 using the FASTA algorithm (Pearson and Lipman, 1988,
64 Pearson, 1996). Search speed and selectivity are con-
65 trolled with the ktup(wordsize) parameter. For protein
66 comparisons, ktup = 2 by default; ktup =1 is more sen-
67 sitive but slower. For DNA comparisons, ktup=6 by de-
68 fault; ktup=3 or ktup=4 provides higher sensitivity;
69 ktup=1 should be used for oligonucleotides (DNA query
72 ssearch3 Compare a protein sequence to a protein sequence
73 database or a DNA sequence to a DNA sequence database
74 using the Smith-Waterman algorithm (Smith and Water-
75 man, 1981). ssearch3 is about 10-times slower than
76 FASTA3, but is more sensitive for full-length protein
79 fastx3/ fasty3 Compare a DNA sequence to a protein sequence database,
80 by comparing the translated DNA sequence in three
81 frames and allowing gaps and frameshifts. fastx3 uses
82 a simpler, faster algorithm for alignments that allows
83 frameshifts only between codons; fasty3 is slower but
84 produces better alignments with poor quality sequences
85 because frameshifts are allowed within codons.
87 tfastx3/ tfasty3 Compare a protein sequence to a DNA sequence database,
88 calculating similarities with frameshifts to the for-
89 ward and reverse orientations.
91 tfasta3 Compare a protein sequence to a DNA sequence database,
92 calculating similarities (without frameshifts) to the 3
93 forward and three reverse reading frames. tfastx3 and
94 tfasty3 are preferred because they calculate similarity
97 fastf3/tfastf3 Compares an ordered peptide mixture, as would be ob-
98 tained by Edman degredation of a CNBr cleavage of a
99 protein, against a protein (fastf) or DNA (tfastf)
102 fasts3/tfasts3 Compares set of short peptide fragments, as would be
103 obtained from mass-spec. analysis of a protein, against
104 a protein (fasts) or DNA (tfasts) database.
105 ---------------------------------------------------------------------------
107 2. Installing FASTA and the sequence databases
109 2.1. Obtaining the libraries
111 The FASTA program package does not include any protein or
112 DNA sequence libraries. Protein databases are available on CD-
113 ROM from the PIR and EMBL (see below), or via anonymouse FTP from
114 many different sources. As this document is updated in the fall
115 of 1999, no DNA databases are available on CD-ROM from the major
116 sequence databases: Genbank at the National for Biotechnology
117 Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
118 EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
119 However, the databases are available via anonymous FTP from both
122 2.1.1. The GENBANK DNA sequence library
124 Because of the large size of DNA databases, you will
125 probably want to keep DNA databases in only one, or possibly two,
126 formats. The FASTA3 programs that search DNA databases - fasta3,
127 tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
128 (not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
129 and BLAST2.0 (formatdb) formats, as well as EMBL format. If you
130 are also running the GCG suite of sequence analysis programs, you
131 should use GCG/compressed-binary format or BLAST2.0 format for
132 your fasta3 searches. If not, BLAST2.0 is a good choice. These
133 files are considerably more compact than Genbank flat files, and
134 are preferred. The NCBI does not provide software for converting
135 from Genbank flat files to Blast2.0 DNA databases, but you can
136 use the Blast formatdb program to convert ASN.1 formated Genbank
137 files, which are available from the NCBI ftp site.
139 The NCBI also provides the nr, swissprot, and several EST
140 databases that are used by BLAST in FASTA format from:
141 ftp://ncbi.nlm.nih.gov/blast/db. These databases are updated
144 2.1.2. The NBRF protein sequence library
146 You can obtain the PIR protein sequence database (Barker et
149 National Biomedical Research Foundation
150 Georgetown University Medical Center
151 3900 Reservoir Rd, N.W.
152 Washington, D.C. 20007
154 or via ftp from nbrf.georgetown.edu or from the NCBI
155 (ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
156 directory is in PIR Codata format, which is not widely used. I
157 recommend the PIR/VMS format data (libtype=5) in the vms
160 2.1.3. The EBI/EMBL CD-ROM libraries
162 The European Bioinformatics Institute (EBI) distributes both
163 the EMBL DNA database and the SwissProt database on CD-ROM
164 (Bairoch and Apweiler, 1996), and they are available from:
166 EMBL-Outstation European Bioinformatics Institute
167 Wellcome Trust Genome Campus,
172 Tel: +44 (0)1223 494444
173 Fax: +44 (0)1223 494468
174 Email: DATALIB@ebi.ac.uk
176 In addition, the SWISS-PROT protein sequence database is
177 available via anonymous FTP from
178 ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
181 2.2. Finding the libraries: FASTLIBS
183 The major problem that most new users of the FASTA package
184 have is in setting up the program to find the databases and their
185 library type. In general, if you cannot get fasta3 to read a
186 sequence database, it is likely that something is wrong with the
187 FASTLIBS file. A common problem is that the database file is
188 found, but either no sequences are read, or an incorrect number
189 of entries is read. This is almost always because the library
190 format (libtype) is incorrect. Note that a type 5 file (PIR/VMS
191 format) can be read as a type 0 (default FASTA) format file, and
192 the number of entries will be correct, but the sequence lengths
195 All the search programs in the FASTA3 package use the
196 environment variable FASTLIBS to find the protein and DNA
197 sequence libraries. The FASTLIBS variable contains the name of a
198 file that has the actual filenames of the libraries. The
199 fastlibs file included with the distribution on is an example of
200 a file that can be referred to by FASTLIBS. To use the fastlibs
203 setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
205 export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
207 Then edit the fastlibs file to indicate where the protein and DNA
208 sequence libraries can be found. If you have a hard disk and
209 your protein sequence library is kept in the file
210 /usr/lib/aabank.lib and your Genbank DNA sequence library is kept
211 in the directory: /usr/lib/genbank, then fastgbs might contain:
213 NBRF Protein$0P/usr/lib/seq/aabank.lib 0
214 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
215 GB Primate$1P@/usr/lib/genbank/gpri.nam
216 GB Rodent$1R@/usr/lib/genbank/grod.nam
217 GB Mammal$1M@/usr/lib/genbank/gmammal.nam
221 The first line of this file says that there is a copy of the NBRF
222 protein sequence database (which is a protein database) that can
223 be selected by typing "P" on the command line or when the
224 database menu is presented in the file /usr/lib/seq/aabank.lib.
226 Note that there are 4 or 5 fields in the lines in fastgbs.
227 The first field is the description of the library which will be
228 displayed by FASTA; it ends with a '$'. The second field (1
229 character), is a 0 if the library is a protein library and 1 if
230 it is a DNA library. The third field (1 character) is the
231 character to be typed to select the library.
233 The fourth field is the name of the library file. In the
234 example above, the /usr/lib/seq/aabank.lib file contains the
235 entire protein sequence library. However the DNA library file
236 names are preceded by a '@', because these files (gpri.nam,
237 grod.nam, gmammal.nam) do not contain the sequences; instead they
238 contain the names of the files which contain the sequences. This
239 is done because the GENBANK DNA database is broken down in to a
240 large number of smaller files. In order to search the entire
241 primate database, you must search more than a dozen files.
243 In addition, an optional fifth field can be used to specify
244 the format of the library file. Alternatively, you can specify
245 the library format in a file of file names (a file preceded by an
246 '@'). This field must be separated from the file name by a space
247 character (' ') from the filename. In the example above, the
248 aabank.lib file is in Pearson/FASTA format, while the swiss.seq
249 file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
250 FASTA can read the following formats:
252 0 Pearson/FASTA (>SEQID - comment/sequence)
253 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
254 2 NBRF CODATA (ENTRY/SEQUENCE)
255 3 EMBL/SWISS-PROT (ID/DE/SQ)
256 4 Intelligenetics (;comment/SEQID/sequence)
257 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
258 6 GCG (version 8.0) Unix Protein and DNA (compressed)
259 11 NCBI Blast1.3.2 format (unix only)
260 12 NCBI Blast2.0 format (unix only, fasta32t08 or later)
262 In particular, this version will work with the EMBL and PIR VMS
263 formats that are distributed on the EMBL CD-ROM. The latter
264 format (PIR VMS) is much faster to search than EMBL format. This
265 release also works with the protein and DNA database formats
266 created for the BLASTP and BLASTN programs by SETDB and PRESSDB
267 and with the new NCBI search format. If a library format is not
268 specified, for example, because you are just comparing two
269 sequences, Pearson/FASTA (format 0) is used by default. To
270 specify a library type on the command line, add it to the library
271 filename and surround the filename and library type in quotes:
273 fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
275 You can specify a group of library files by putting a '@'
276 symbol before a file that contains a list of file names to be
277 searched. For example, if @gmam.nam is in the fastgbs file, the
278 file "gmam.nam" might contain the lines:
288 In this case, the line beginning with a '<' indicates the
289 directory the files will be found in. The remaining lines name
290 the actual sequence files. So the first sequence file to be
293 /usr/lib/genbank/gbpri.seq
295 The notation "<PIRNAQ:" might be used under the VAX/VMS operating
296 system. Under UNIX, the trailing '/' is left off, so the library
297 directory might be written as "</usr/seqlib".
299 The FASTA programs can search a database composed of
300 different files in different sequence formats. For example, you
301 may wish to search the Genbank files (in GenBank flat file
302 format) and the EMBL DNA sequence database on CD-ROM. To do
303 this, you simply list the names and filetypes of the files to be
304 searched in a file of filenames. For example, to search the
305 mammalian portion of Genbank, the unannotated portion of Genbank,
306 and the unannotated portion of the EMBL library, you could use
311 # (this '#' causes the program to display the size of the library)
321 You do not need to include library format numbers if you
322 only use the Pearson/FASTA version of the PIR protein se-
323 quence library. If no library type is specified, the
324 program assumes that type 0 is being used.
326 Test the setup by running FASTA. Enter the sequence file
327 'mgstm1.aa' when the program requests it (this file is included
328 with the programs). The program should then ask you to select a
329 protein sequence library. Alternatively, if you run the TFASTA
330 program and use the mgstm1.aa query sequence, the program should
331 show you a selection of DNA sequence libraries. Once the fastgbs
332 file has been set up correctly, you can set FASTLIBS=fastgbs in
333 your AUTOEXEC.BAT file, and you will not need to remember where
334 the libraries are kept or how they are named.
336 3. Using the FASTA Package
340 The FASTA sequence comparison programs all require similar
341 information, the name of a query sequence file, a library file,
342 and the ktup parameter. All of the programs can accept arguments
343 on the command line, or they will prompt for the file names and
346 To use FASTA, simply type:
349 and you will be prompted for :
350 the name of the test sequence file
351 the name of the library file
352 and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
353 (ktup of 2 is about 5 times faster than ktup = 1)
355 The program can also be run by typing
357 FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
359 Included with the package are several test files. To check to
360 make certain that everything is working, you can try:
362 fasta musplfm.aa prot_test.lib
364 tfastx mgstm1.aa gst.nlib
368 The fasta3 programs know about three kinds of sequence
369 files: (1) plain sequence files - files that contain nothing but
370 sequence residues - can only be used as query sequences. (2)
371 FASTA format files. These are the same as plain sequence files,
372 each sequence is preceded by a comment line with a '>' in the
373 first column. (3) distributed sequence libraries (this is a broad
374 class that includes the NBRF/PIR VMS and blocked ascii formats,
375 Genbank flat-file format, EMBL flat-file format, and
376 Intelligenetics format. All of the files that you create should
377 be of type (1) or (2). FASTA format files (ones with a '>' and
378 comment before the sequence) are preferred, because they can be
379 used as query or library sequence files by all of the programs.
381 I have included several sample test files, *.aa and *.seq as
382 well as two small sequence libraries, prot_test.lib and gst.nlib.
383 The first line may begin with a '>' by a comment. Spaces and
384 tabs (and anything else that is not an amino-acid code) are
387 Library files should have the form:
389 >Sequence name and identifier
390 A F A S Y T .... actual sequence.
391 F S S .... second line of sequence.
392 >Next sequence name and identifier
394 This is often referred to as "FASTA" or format. You can build
395 your own library by concatenating several sequence files. Just
396 be sure that each sequence is preceded by a line beginning with a
397 '>' with a sequence name.
399 The test file should not have lines longer than 120
400 characters, and sequences entered with word processors should use
401 a document mode, with normal carriage returns at the end of
404 A different format is required to specify the ordered
405 peptide mixture for fastf3/tfastf3. For example:
413 indicates m in the first position of all three peptides (as from
414 CNBr), G, I, L (twice) in the second position (first cycle),
415 C,D,L (twice) in the third position, etc. The commas (,) are
416 required to indicate the number of fragments in the mixture, but
417 there should be no comma after the last residue.
419 For the fasts3/tfasts3 program, the format is the same,
420 except that there is no requirement for the peptides to be the
423 4. Statistical Significance
425 All the programs in the FASTA3 package attempt to calculate
426 accurate estimates of the statistical significance of a match.
427 For fasta3, ssearch3, and fastx3/y3, these estimates are very
428 accurate (Pearson, 1998, Zhang et al., 1997).. Altschul et al.
429 (Altschul et al., 1994) provides an excellent review of the
430 statistics of local similarity scores. Local sequence similarity
431 scores follow the extreme value distribution, so that P(s > x) =
432 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
433 the lengths of the query and library sequence. This formula can
434 be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
435 the average score for an unrelated library sequence increases
436 with the logarithm of the length of the library sequence. The
437 fasta3 programs use simple linear regression against the the log
438 of the library sequence length to calculate a normalized "z-
439 score" with mean 50, regardless of library sequence length, and
440 variance 10. (Several other estimation methods are available with
441 the -z option.) These z-scores can then be used with the extreme
442 value distribution and the poisson distribution (to account for
443 the fact that each library sequence comparison is an independent
444 test) to calculate the number of library sequences to obtain a
445 score greater than or equal to the score obtained in the search.
446 The original idea and routines to do the linear regression on
447 library sequence length were provided Phil Green, U. Washington.
448 This version uses a slightly different strategy for fitting the
449 data than those originally provided by Dr. Green.
451 The expected number of sequences is plotted in the histogram
452 using an "*". Since the parameters for the extreme value
453 distribution are not calculated directly from the distribution of
454 similarity scores, the pattern of "*'s" in the histogram gives a
455 qualitative view of how well the statistical theory fits the
456 similarity scores calculated by the programs. For fasta3, if
457 optimized scores are calculated for each sequence in the database
458 (the default), the agreement between the actual distribution of
459 "z-scores" and the expected distribution based on the length
460 dependence of the score and the extreme value distribution is
461 usually very good. Likewise, the distribution of ssearch3 Smith-
462 Waterman scores typically agrees closely with the <actual
463 distribution of "z-scores." The agreement with unoptimized
464 scores, ktup=2, is often not very good, with too many high
465 scoring sequences and too few low scoring sequences compared with
466 the predicted relationship between sequence length and similarity
467 score. In those cases, the expectation values may be
470 With version 33t01, all the FASTA programs also report a
471 "bit" score, which is equivalent to the bit score reported by
472 BLAST2. The FASTA33/BLAST2 bit score is calculated as: (lambda*S
473 - ln K)/ln 2, where S is the raw similarity score, lambda and K
474 are statistical parameters estimated from the distribution of
475 unrelated sequence similarity scores. The statistical
476 signficance of a given bit score depends on the lengths of the
477 query and library sequences and the size of the library, but a 1
478 bit increase in score corresponds to a 2-fold reduction in
479 expectation; a 10-bit increase implies 1000-fold lower
482 The statistical routines assume that the library contains a
483 large sample of unrelated sequences. If this is not true, then
484 statistical parameters can be estimated by using the -z 11-15,
485 options. -z options greater than 10 calculate a shuffled
486 similarity score for each library sequence, in addition to the
487 unshuffled score, and estimate the statistical parameters from
488 the scores of the shuffled sequences. If there are fewer than 20
489 sequences in the library, the statistical calculations are not
492 For protein searches, library sequences with E() values <
493 0.01 for searches of a 10,000 entry protein database are almost
494 always homologous. Frequently sequences with E()-values from 1 -
495 10 are related as well, but unrelated sequences ( 1 - 10 per
496 search) will have scores in this renage as well. Remember,
497 however, that these E() values also reflect differences between
498 the amino acid composition of the query sequence and that of the
499 "average" library sequence. Thus, when searches are done with
500 query sequences with "biased" amino-acid composition, unrelated
501 sequences may have "significant" scores because of sequence bias.
502 PRSS3 can address this problem by calculating similarity scores
503 for random sequences with the same length and amino acid
508 Command line options are available to change the scoring
509 parameters and output display. Command line options must preceed
510 other program arguments, such as the query and library file
513 5.1. Command line options
515 -a (fasta3, ssearch3 only) show both sequences in their
518 -A force Smith-Waterman alignments for fasta3 DNA sequences.
519 By default, only fasta3 protein sequence comparisons use
520 Smith-Waterman alignments.
522 -B Show normalized score as a z-score, rather than a bit-score
523 in the list of best scores.
525 -b # Number of sequence scores to be shown on output. In the
526 absence of this option, fasta (and tfasta and ssearch)
527 display all library sequences obtaining similarity scores
528 with expectations less than 10.0 if optimized score are
529 used, or 2.0 if they are not. The -b option can limit the
530 display further, but it will not cause additional sequences
533 -c # Threshold score for optimization (OPTCUT). Set "-c 1" to
534 optimize every sequence in a database.
536 -E # Limit the number of scores and alignments shown based on the
537 expected number of scores. Used to override the expectation
538 value of 10.0 used by default. When used with -Q, -E 2.0
539 will show all library sequences with scores with an
540 expectation value <= 2.0.
542 -d # Maximum number of alignments to be displayed. Ignored if
545 -f Penalty for the first residue in a gap (-12 by default for
546 proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
548 -F # Limit the number of scores and alignments shown based on the
549 expected number of scores. "-E #" sets the highest E()-value
550 shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
551 will not show any matches or alignments with E() < 0.0001.
552 This allows one to skip over close relationships in searches
553 for more distant relationships.
555 -g Penalty for additional residues in a gap (-2 by default for
556 proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
558 -h Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
562 -i Invert (reverse complement) the query sequence if it is DNA.
563 For tfasta3/x3/y3, search the reverse complement of the
564 library sequence only.
566 -j # Penalty for frameshift within a codon (fasty3/tfasty3 only).
569 Location of library menu file (FASTLIBS).
571 -L Display more information about the library sequence in the
575 Range of amino acid sequence lengths to be included in the
578 -m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
580 -m 0 -m 1 -m 2 -m 3 -m 4
581 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
582 ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
583 MWKSCGYPYT MWKSCGYPYT
585 -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
586 -m 5 plus HTML formatting.
588 -m 9 provides coordinates and scores with the best score
589 information. A simple " -m 9 extends the normal best score
592 The best scores are: opt bits E(14548)
593 XURTG4 glutathione transferase (EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79
595 to include the additional information (on the same line,
596 separated by a <tab>):
598 %_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs
599 0.771 0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0
601 -m 9c provides additional information: an encoded alignment
605 GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
606 :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
607 XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
612 =23+9=13-2=10-1=3+1=5
614 The alignment encoding is with repect to the alignment, not
615 the sequences. The coordinate of the alignment is given
616 earlier in the " -m 9c" line.
619 -m 10 is a new, parseable format for use with other
620 programs. See the file "readme.v20u4" for a more complete
623 As of version "fa34t23b2", it has become possible to combine
624 independent "-m" options. Thus, one can use "-m 1 -m 6 -m
628 Include library sequences (proteins only) with lengths
629 between low and high.
631 -n Force the query sequence to be treated as a DNA sequence.
632 This is particularly useful for query sequences that contain
633 a large number of ambiguous residues, e.g. transcription
634 factor binding sites.
636 -O Send copy of results to "filename." Helpful for
637 environments without STDOUT (mostly for the Macintosh).
639 -o Turn off default optimization of all scores greater than
640 OPTCUT. Sort results by "initn" scores (reduces the accuracy
641 of statistical estimates).
643 -p Force query to be treated as protein sequence.
646 Quiet - does not prompt for any input. Writes scores and
647 alignments to the terminal or standard output file.
649 -r Specify match/mismatch scores for DNA comparisons. The
650 default is "+5/-4". "+3/-2" can perform better in some
654 Save a results summary line for every sequence in the
655 sequence library. The summary line includes the sequence
656 identifier, superfamily number (if available) position in
657 the library, and the similarity scores calculated. This
658 option can be used to evaluate the sensitivity and
659 selectivity of different search strategies (Pearson, 1995,
663 Specify the scoring matrix file. fasta3 uses the same
664 scoring matrices as Blast1.4/2.0. Several scoring matrix
665 files are included in the standard distribution. For
666 protein sequences: codaa.mat - based on minimum mutation
667 matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
668 matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
669 pam120.mat - a PAM120 matrix. The default scoring matrix is
670 BLOSUM50 ("-s BL50"). Other matrices available from within
671 the program are: PAM250/"-s P250", PAM120/"-s P120",
672 PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
673 (MDM are modern PAM matrices from Jones et al. (Jones et
674 al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
677 -S Treat lower-case characters in the query or library
678 sequences as "low-complexity" ("seg"-ed) residues.
679 Traditionally, the "seg" program (Wootton and
680 Federhen, 1993) is used to remove low complexity regions in
681 DNA sequences by replacing the residues with an "X". When
682 the "-S" option is used, the FASTA33 programs provide a
683 potentially more informative approach. With "-S", lower
684 case characters in the query or database sequences are
685 treated as "X"'s during the initial scan, but are treated as
686 normal residues during the final alignment display. Since
687 statistical significance is calculated from the similarity
688 score calculated during the library search, when the lower
689 case residues are "X"'s, low complexity regions will not
690 produce statistically significant matches. However, if a
691 significant alignment contains low complexity regions, their
692 alignmen is shown. With "-S", lower case characters may be
693 included in the alignment to indicate low complexity
694 regions, and the final alignment score may be higher than
695 the score obtained during the search.
697 The pseg program can be used to produce databases (or query
698 sequences) with lower case residues indicating low
699 complexity regions using the command:
701 pseg database.fasta -z 1 -q > database.lc_seg
703 (seg can also be used with some post processing, see
706 -U Treat the query sequence an RNA sequence. In addition to
707 selecting a DNA/RNA alphabet, this option causes changes to
708 the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
712 It is now possible to specify some annotation characters
713 that can be included (and will be ignored), in the query
714 sequence file. Thus, One might have a file with:
715 "ACVS*ITRLFT?", where "*" and "?" are used to indicate
716 phosphorylation. By giving the option -V '*?', those
717 characters in the query will be moved to an "annotation
718 string", and alignments that include the annotated residues
719 will be highlighted with the appropriate character above the
720 sequence (on the number line).
722 -w # Line length (width) = number (<200)
724 -W # context length (default is 1/2 of line width -w) for
725 alignment, like fasta and ssearch, that provide additional
728 -x # Specify the penalty for a match to an 'X', independently of
729 the PAM matrix. Particularly useful for fastx3/fasty3,
730 where termination codons are encoded as 'X'.
732 -X Specifies offsets for the beginning of the query and library
733 sequence. For example, if you are comparing upstream
734 regions for two genes, and the first sequence contains 500
735 nt of upstream sequence while the second contains 300 nt of
736 upstream sequence, you might try:
738 fasta -X "-500 -300" seq1.nt seq2.nt
740 If the -X option is not used, FASTA assumes numbering starts
741 with 1. (You should double check to be certain the negative
742 numbering works properly.)
744 -y Set the width of the band used for calculating "optimized"
745 scores. For proteins and ktup=2, the width is 16. For
746 proteins with ktup=1, the width is 32 by default. For DNA
750 -z -1 turns off statistical calculations. z 0 estimates the
751 significance of the match from the mean and standard
752 deviation of the library scores, without correcting for
753 library sequence length. -z 1 (the default) uses a weighted
754 regression of average score vs library sequence length; -z 2
755 uses maximum likelihood estimates of Lambda and K; -z 3 uses
756 Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
757 uses two variations on the -z 1 strategy. -z 1 and -z 2 are
758 the best methods, in general.
761 estimate the statistical parameters from shuffled copies of
762 each library sequence. This doubles the time required for a
763 search, but allows accurate statistics to be estimated for
764 libraries comprised of a single protein family.
767 set the apparent size of the database to be used when
768 calculating expectation E() values. If you searched a
769 database with 1,000 sequences, but would like to have the
770 E()-values calculated in the context of a 100,000 sequence
771 database, use '-Z 100000'.
773 -1 sort output by init1 score (for compatibility with FASTP -
776 -3 translate only three forward frames
780 fasta -w 80 -a seq1.aa seq.aa
782 would compare the sequence in seq1.aa to that in seq2.aa and
783 display the results with 80 residues on an output line, showing
784 all of the residues in both sequences. Be sure to enter the
785 options before entering the file names, or just enter the options
786 on the command line, and the program will prompt for the file
789 (November, 1997) In addition, it is now possible to provide
790 the fasta programs with the query sequence (fasta, fasty,
791 ssearch, tfastx), or two sequences (prss, lalign, plalign) from
792 the unix "stdin" stream. This makes it much easier to set up
793 FASTA or PRSS WWW pages. To specify that stdin be used, rather
794 than a file, the file name should be specified as '-' or '@' (the
795 latter file name makes it possible to specify a subset of the
798 cat query.aa | fasta -q @:25-75 s
800 would take residues 25-75 from query.aa and search the 's'
801 library (see the discussion of FASTLIBS).
803 5.2. Environment variables
805 Because the current version of the program allows the user
806 to set virtually every option on the command line (except the
807 ktup, which must be set as the third command line argument), only
808 the FASTLIBS environment variable is routinely used.
811 specifies the location of the file which contains the list
812 of library descriptions, locations, and library types (see
813 section on finding library files).
815 6. Frequently Asked Questions
817 (1) Which program should I use? See Table I.
819 (2) How do I search with both DNA strands with fasta3 and
820 fastx3? With version 32 of the FASTA program package, all
821 searches that use DNA queries (e.g. fasta3, fastx3/y3)
822 examine both strands. To revert to earlier FASTA behavior
823 - only looking at the forward or reverse strand - use -3
824 to search only the forward strand and -i -3 to search only
827 (3) When I search Genbank - the program reports: 0 residues in
828 0 sequences. This typically happens because the program
829 does not know that you are searching a Genbank flatfile
830 database and is looking for a FASTA format database. Be
831 certain to specify the library type ("1" for Genbank
832 flatfile) with the database name.
834 (4) What is the difference between fastx3 and fasty3 (or
835 tfastx3 and tfasty3). [t]fastx3 uses a simpler codon
836 based model for alignments that does not allow frameshifts
837 in some codon positions (see ref. (Zhang et al., 1997)).
838 tfastx3 is about 30% faster, but tfasty3 can produce
839 higher quality alignments in some cases.
841 (5) When I run fasta3 -q, I don't see any (or very little)
842 output, but I get lots of scores when I run interactively.
843 With the -Q option, the number of high scores displayed is
844 limited by the -E # cutoff, which is 10.0 for protein
845 comparisons, 2.0 for DNA comparisons, and 5.0 for
846 translated DNA:protein comparisons. In interactive mode
847 (without -Q), by default you see 20 high scores,
848 regardless of E() value.
850 (6) What is ktup - All of the programs with fast in their name
851 use a computer science method called a lookup table to
852 speed the search. For proteins with ktup=2, this means
853 that the program does not look at any sequence alignment
854 that does not involve matching two identical residues in
855 both sequences. Likewise with DNA and ktup = 6, the
856 initial alignment of the sequences looks for 6 identical
857 adjacent nucleotides in both sequences. Because it is
858 less likely that two identical amino-acids will line up by
859 chance in two unrelated proteins, this speeds up the
860 comparison. But very distantly related sequences may
861 never have two identical residues in a row but will have
862 single aligned identities. In this case, ktup = 1 may
863 find alignments that ktup=2 misses.
865 (7) Sometimes, in the list of best scores, the same sequence
866 is shown twice with exactly the same score. Sometimes,
867 the sequence is there twice, but the scores are slightly
868 different. When any of the fasta3 programs searches a long
869 sequence, it breaks the sequence up into overlapping
870 pieces. The length of the piece depends on the length of
871 the query and the particular program being used (it can
872 also be controlled with the -N #### option). Since the
873 pieces overlap by the length of the query sequence (or
874 3*query_length for fastx/y3 and tfasta/x/y3), if the
875 highest scoring alignment is at the end of one piece, it
876 will be scored again at the beginning of the next piece.
877 If the alignment is not be completely included in the
878 overlap region, one of the pieces will give a higher score
879 than the other. These duplications can be detected by
880 looking at the coordinates of the alignment. If either
881 the beginning or end coordinate is identical in two
882 alignments, the alignments are at least partially
885 As always, please inform me of bugs as soon as possible.
888 Department of Biochemistry
889 Jordan Hall Box 800733
897 Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
898 (1994). Issues in searching molecular sequence databases. Nature
901 Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
902 Methods Enzymol. 266,460-480.
904 Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
905 sequence data bank and its new supplement TrEMBL. Nucleic Acids.
908 Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
909 Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
910 Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
911 (1998). The PIR-International Protein Sequence Database. Nucleic
914 Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
915 of evolutionary change in proteins. In Atlas of Protein Sequence
916 and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
917 Spring, MD: National Biomedical Research Foundation), pp.
920 Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
921 rapid generation of mutation data matrices from protein
922 sequences. Comp. Appl. Biosci. 8,275-282.
924 Pearson, W. R. (2000). Flexible similarity searching with the
925 FASTA3 program package. In Bioinformatics Methods and Protocols,
926 S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
929 Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
930 biological sequence comparison. Proc. Natl. Acad. Sci. USA
933 Pearson, W. R. (1995). Comparison of methods for searching
934 protein sequence databases. Prot. Sci. 4,1145-1160.
936 Pearson, W. R. (1996). Effective protein sequence comparison.
937 Methods Enzymol. 266,227-258.
939 Pearson, W. R. (1998). Empirical statistical estimates for
940 sequence similarity searches. J. Mol. Biol. 276,71-84.
942 Smith, T. F. and Waterman, M. S. (1981). Identification of common
943 molecular subsequences. J. Mol. Biol. 147,195-197.
945 Wootton, J. C. and Federhen, S. (1993). Statistics of local
946 complexity in amino acid sequences and sequence databases.
947 Comput. Chem. 17,149-163.
949 Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
950 sequence with a protein sequence. J. Computational Biology