4 Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
5 Pearson and the University of Virginia. All rights reserved. The
6 FASTA program and documentation may not be sold or incorporated
7 into a commercial product, in whole or in part, without written
8 consent of William R. Pearson and the University of Virginia.
9 For further information regarding permission for use or
10 reproduction, please contact: David Hudson, Assistant Provost for
11 Research, University of Virginia, P.O. Box 9025, Charlottesville,
12 VA 22906-9025, (434) 924-6853
15 The FASTA program package
19 This documentation describes the version 2.0x of the FASTA
20 program package (see W. R. Pearson and D. J. Lipman (1988),
21 "Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
22 2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence
23 Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
24 98). Version 2.0 modifies version 1.8 to include explicit
25 statistical estimates for similarity scores based on the extreme
26 value distribution. In addition, FASTA protein alignments now
27 use the Smith-Waterman algorithm with no limitation on gap size.
28 FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
29 options to change gap penalties on the command line. Version 1.7
30 replaces rdf2 and rss with prdf and prss, which use the extreme-
31 value distribution to calculate accurate probability estimates.
34 Although there are a large number of programs in this package,
35 they belong to four groups:
38 Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
40 Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
42 Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
44 Global alignment: ALIGN
48 In addition, I have included several programs for protein
49 sequence analysis, including a Kyte-Doolittle hydropathicity
50 plotting program (GREASE, TGREASE), and a secondary structure
51 prediction package (GARNIER).
53 The FASTA sequence comparison programs on this disk are
54 improved versions of the FASTP program, originally described in
55 Science (Lipman and Pearson, (1985) Science 227:1435-1441). We
56 have made several improvements. First, the library search
57 programs use a more sensitive method for the initial comparison
58 of two sequences which allows the scores of several similar
59 regions to be combined. As a result, the results of a library
60 search are now given with three scores, initn (the new initial
61 score which may include several similar regions), init1 (the old
62 fastp initial score from the best initial region), and opt (the
63 old fastp optimized score allowing gaps in a 32 residue wide
66 These programs have also been modified to become "universal"
67 (hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
68 FAST-N (nucleotides)); by changing the environment variable
69 SMATRIX, the programs can be used to search protein sequences,
70 DNA sequences, or whatever you like. By default, FASTA, LFASTA,
71 and the PRDF programs automatically recognize protein and DNA
72 sequences. Sequences are first read as amino acids, and then
73 converted to nucleotides if the sequence is greater than 85%
74 A,C,G,T (the '-n' option can be used to indicate DNA sequences).
75 TFASTA compares protein sequences to a translated DNA sequence.
76 Alternative scoring matrices can also be used. In addition to
77 the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
78 based on simple identities or the genetic code can also be used
79 for sequence comparisons or evaluation of significance. Several
80 different protein sequence matrices have been included;
81 instructions for constructing your own scoring matrix are
82 included in the file FORMAT.DOC.
85 The remainder of this document is divided into three sections:
86 (1) a brief history of the changes to the FASTA package; (2) A
87 guide to installing the programs and databases; (3) A guide to
88 using the FASTA programs. The programs are very easy to use, so
89 if you are using them on a machine that is administered by
90 someone else, you may want to skip to section (3) to learn how to
91 use the programs, and then read section (1) to look at some of
92 the more recent changes. If you are installing the programs on
93 your own machine, you will need to read section (2) carefully.
98 1.1. Changes with version 2.0u
100 Version 2.0u provides several major improvements over
101 previous versions of FASTA (and SSEARCH). The most important is
102 the incorporation of explicit statistical estimates and
103 appropriate normalization of similarity scores. This improvement
104 is discussed in more detail below in the section entitled
105 Statistical Significance. In addition, all of the protein
106 comparison programs now use the BLOSUM50 matrix, with gap
107 penalties of -12, -2, by default. BLOSUM50 performs
108 significantly better than the older PAM250 matrix. PAM250 can
109 still be used with the command line option: -s 250. (DNA
110 sequence comparisons use a more stringent gap penalty of -16, -4,
111 which produces excellent statistical estimates when optimized
112 scores are used. TFASTA uses -16, -4 as well.)
114 The quality of the fit of the extreme value distribution to
115 the actual distribution of similarity scores is summarized with
116 the Kolmogorov-Smirnov statistic. The acceptance limits for this
117 statistic can be found in many statistics books. In general,
118 values <0.10 (N=30) indicate excellent agreement between the
119 actual and theoretical distributions. If this statistic is >
120 0.2, consider using a higher (more stringent) gap penalty, e.g.
121 -16, -4 rather than -12, -2. The default scoring matrix for DNA
122 has been changed to score +5 for an identity and -4 for a
123 mismatch. These are the same scores used by BLASTN.
125 With explicit expectation calculations, the program now
126 shows all scores and alignments with expectations less than 10.0
127 (with optimized scores, 2.0 without optimization) when the "-Q"
128 (quiet) mode is used. The expectation threshold can be changed
129 with the "-E" option.
131 Finally, the algorithm used to produce the final alignments
132 of protein sequences is now a full Smith-Waterman, with unlimited
133 gaps. (The older band-limited alignments are used for DNA
134 sequences and TFASTA by default, because Smith-Waterman
135 alignments are very slow for long sequences.) Both the optimized
136 and Smith-Waterman scores are reported; if the Smith-Waterman
137 score is higher, then additional gaps allowed a better alignment
138 and similarity score to be calculated.
140 FASTA searches now optimize similarity scores by default
141 (this slows searches about 2-fold (worst case) for ktup=2). Thus,
142 the meaning of the "-o" option has been reversed; "-o" now turns
143 off optimization and reports results sorted by "initn" scores.
144 Optimization significantly improves the sensitivity of FASTA, so
145 that it almost matches Smith-Waterman. With version 2.0, the
146 default band width used for optimized calculations can be varied
147 with the "-y" option. For proteins with ktup=2, a width of 16
148 (-y 16) is used; 16 is also used for DNA sequences. For proteins
149 and ktup=1, a width of 32 is used. Searches that disable
150 optimization with the "-o" option will work fine for sequences
151 that share 25% or more identity in general, but to detect
152 evolutionary relationships with 20% - 25% identity, the more
153 sensitive default optimization is often required. Optimization
154 is required for accurate statistical estimates with either
155 protein or DNA sequences.
157 The FASTA package now includes FASTX, a program that
158 compares a DNA sequence to a protein sequence database by
159 translating the DNA sequence in three frames (the reverse frames
160 are selected with the -i option) and aligning the three-frame
161 translation with the sequences in the protein database.
162 Alignment scores allow frameshifts so that a cDNA or EST sequence
163 with insertion/deletion errors can be aligned with its homologues
164 from beginning to end.
166 With release 20u6, there is also a TFASTX program, which is
167 a replacement for TFASTA. TFASTA treats each of the six reading
168 frames of a DNA library sequence as a different sequence; TFASTX
169 compares a protein sequence against only two sequences from each
170 DNA sequence - the forward and reverse orientation. For a given
171 orientation, TFASTX calculates a similarity score for alignments
172 that allow frameshifts, thus considering all possible reading
175 Another new program is included - randseq - which will
176 produce a randomly shuffled (uniform or local shuffle) from an
177 input sequence. This randomly shuffled sequence can be used to
178 evaluate the statistical estimates produced by FASTA, SSEARCH, or
181 1.2. Changes with version 1.7
182 Version 1.7 has been released to provide the PRDF and PRSS
183 programs for shuffling sequences and estimating accurately the
184 probabilities of the unshuffled-sequence scores.
186 PRDF a version of RDF2 that uses calculates the probability
187 of a similarity score more accurately by using a fit to
188 an extreme value distribution. Code to fit the extreme
189 value distribution parameters and the impetus to update
190 RDF2 was provided by Phil Green, U. of Washington.
192 PRSS a version of PRDF that uses a rigorous Smith-Waterman
193 calculation to score similarities
195 1.3. Changes with version 1.6
197 FASTA version 1.6 uses a new method for calculating optimal
198 scores in a band (the optimization or last step in the FASTA
199 algorithm). In addition, it uses a linear-space method for
200 calculating the actual alignments. FASTA v1.6 package includes
201 several new programs:
203 SSEARCH a program to search a sequence database using the
204 rigorous Smith-Waterman algorithm (this program is
205 about 100-fold slower than FASTA with ktup=2 (for
208 LALIGN A rigorous local sequence alignment program that will
209 display the N-best local alignments (N=10 by default).
211 PLALIGN a version of lalign that plots the local alignments to
214 FLALIGN a version of lalign that plots the local alignments to
217 The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
218 algorithm described by Huang and Miller (1991) Adv. Appl. Math.
219 12:337-357. The SSEARCH and PRSS programs incorporate algorithms
220 described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
222 LFASTA and PLFASTA now calculate a different number of local
223 similarities; they now behave more like LALIGN/PLALIGN. Since
224 local alignments of identical sequences produce "mirror-image"
225 alignments, lalign and lfasta consider only one-half of the
226 potential alignments between sequences from identical file names.
229 lfasta mchu.aa mchu.aa
231 Displays only two alignments, with earlier versions of the
232 program, it would have displayed five, including the identity
233 alignment. PLFASTA does display five alignments; when two
234 identical filenames are given, it draws the identity alignment,
235 calculates the two unique local alignments, draws them, and draws
236 their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
237 filenames, rather than the actual sequences, to determine whether
238 sequences are identical; you can "trick" the programs into
239 behaving the old way by putting the same sequence in two
242 1.4. Changes with version 1.5
244 FASTA version 1.5 includes a number of substantial revisions
245 to improve the performance and sensitivity of the program. It is
246 now possible to tell the program to optimize all of the initn
247 scores greater than a threshold. The threshold is set at the
248 same value as the old FASTA cutoff score. Alternatively, you can
249 tell FASTA to sort the results by the init1, rather than the
250 initn, score by using the -1 option. FASTA -1 ... will report
251 the results the way the older FASTP program did.
253 A new method has been provided for selecting libraries. In
254 the past, one could enter the name of a sequence file to be
255 searched or a single letter that would specify a library from the
256 list included in the $FASTLIBS file. Now, you can specify a set
257 of library files with a string of letters preceded by a '%'.
258 Thus, if the FASTLIBS file has the lines:
260 Genbank 70 primates$1P/seqlib/gbpri.seq 1
261 Genbank 70 rodents$1R/seqlib/gbrod.seq 1
262 Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
263 Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
265 Then the string: "%PRMB" would tell FASTA to search the four
266 libraries listed above. The %PRMB string can be entered either
267 on the command line or when the program asks for a filename or
270 FASTA1.5 also provides additional flexibility for specifying
271 the number of results and alignments to be displayed with the -Q
272 (quiet) option. The -b number option allows you to specify the
273 number of sequence scores to show when the search is finished.
280 tells the program to display the top 100 sequence scores. In the
281 past, if you displayed 100 scores (in -Q mode), you would also
282 have store 100 alignments. The -d option allows you to limit the
283 number of alignments shown. FASTA -b 100 -d 20 would show 100
284 scores and 20 alignments.
286 Finally, FASTA can provide a complete list of all of the
287 sequences and scores calculated to a file with the -r (results)
288 option. FASTA -r results.out ... creates a file with a list of
289 scores for every sequence in the library. The list is not
290 sorted, and only includes those scores calculated during the
291 initial scan of the library.
293 2. Installing the FASTA package
295 2.1. Installing the programs
299 The FASTA distribution comes with several makefile's that
300 can be used to compile the FASTA programs. Over the years, as
301 ATT Unix System 5 and BSD unix have converged, these files have
302 become very similar. To begin with, I recommend using the
303 standard Makefile. There are two values in the makefile that
304 should be checked against the values used on your system: the HZ
305 value, which is the frequency in ticks per second used by the
306 times() system call, this value can usually be found by running:
308 grep HZ /usr/include/sys/*
310 and the functions available to return random numbers. If you
311 have a rand48() function that returns a 32-bit random number, use
312 it and use the lines:
317 If not, you will need to use the rand() function call and
318 determine whether it returns a 16-bit or a 32-bit value. These
319 functions are used by PRDF and PRSS. If you have problems
320 compiling the programs, you may want to examine the makefile.unx
321 and makefile.sun files, to look for differences. I have tried to
322 use very standard unix functions in these programs, and they have
323 been successfully compiled, with very small changes to the
324 Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
325 machines (under the BSD environment).
327 2.1.2. IBM-PC/DOS version
329 For the IBM-PC/DOS version, the FASTA source code disk
330 contains the complete source code to all of the programs on the
331 other disks. The programs were compiled with Borland's Turbo
332 'C++', using Borland's MAKE utility. The graphics programs
333 (PLFASTA, TGREASE) use the graphics device drivers supplied with
334 the Turbo 'C' V2.0 package. Also included are the documentation
335 files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the
336 files the source code disk to run the programs. The files on
337 this disk are identical to the UNIX and VMS versions that run on
338 larger machines. Also included is the code to compile
339 ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize
342 If you have the DOS or Macintosh version of the FASTA
343 package, to install the programs you should:
345 (1) Make a new directory (folder) for the FASTA programs.
346 This need not be the same as the directory for your
349 (2) Copy the files from the FASTA source disk to the new
352 (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
353 PATH command to include the FASTA directory and (b) add
356 set FASTLIBS=c:\yourfastadirectory\fastgbs
358 On the Macintosh, you may need to edit the "environment"
359 file and change the line that reads:
363 to indicate the full directory path for the fastgbs file,
366 FASTLIBS=Q105:FASTA:fastgbs
369 (4) Finally, you will need to edit the fastgbs file. This is
370 usually the most confusing part of the installation. An
371 example of this file is shown below; to customize this
372 file for your machine, you will need to change the file
373 names from those provided in the fastgbs file to ones that
374 reflect the directory names and file names you use on your
375 machine. This is explained in more detail below. In
376 addition, some entries in the fastgbs file refer to other
377 files of file names. These files of file names (as
378 opposed to actual database files) may also need to be
381 2.2. Installing the libraries
383 2.2.1. The NBRF protein sequence library
385 The FASTA program package does not include any protein or
386 DNA sequence libraries. You can obtain the PIR protein sequence
389 National Biomedical Research Foundation
390 Georgetown University Medical Center
391 3900 Reservoir Rd, N.W.
392 Washington, D.C. 20007
394 In addition, this database is available via anonymous ftp from
395 the host "ftp.bchs.uh.edu". It is available in two formats, VMS
396 and CODATA format. The "VMS" format (library type 5 below) can
397 be searched much faster, can be easily reformatted for use by the
398 "BLAST" rapid searching program, and is compatible with the
399 Genetics Computer Group package of programs. The CODATA format
400 is used by the EUGENE/MBIR computing package from Baylor (library
403 2.2.2. The GENBANK DNA sequence library
405 FASTA, and TFASTA search sequences from the GENBANK
406 "flatfile" (not ASN.1) DNA sequence library in the flat-file
407 format distributed by the National Center for Biotechnology
408 Information and the PIR format used by EBI/EMBL. CD-ROMs can be
412 National Center for Biotechnology Information
413 National Library of Medicine
414 National Institutes of Health
419 The GenBank DNA sequence library is also available via
420 anonymous FTP from ncbi.nlm.nih.gov.
422 2.2.3. The EBI/EMBL CD-ROM libraries
424 The European Bioinformatics Institute (EBI) is now
425 distributing the EMBL CD-ROM that contains both the complete EMBL
426 DNA sequence database (which should be essentially identical to
427 the GenBank DNA sequence database) and the SWISS-PROT protein
428 sequence database. SWISS-PROT is derived from the NBRF Protein
429 sequence database with additions from the EBI/EMBL DNA sequence
430 database. This CD-ROM is a "best-buy," since it provides both
431 DNA and protein sequence libraries. It is available from:
434 European Bioinformatics Institute
435 Hinxton Genome Campus, Hinxton Hall
436 Hinxton, Cambridge CB10 1RQ,
440 Email: DATALIB@ebi.ac.uk
444 In addition, the SWISS-PROT protein sequence database is
445 available via anonymous FTP from ncbi.nlm.nih.gov.
447 2.3. Finding the libraries: FASTLIBS
449 FASTA and TFASTA use the environment variable FASTLIBS to
450 find the protein and DNA sequence libraries. The FASTLIBS
451 variable contains the name of a file that has the actual
452 filenames of the libraries. The FASTGBS file on is an example of
453 a file that can be referred to by FASTLIBS. To use the FASTGBS
456 setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
458 export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
460 Then edit the FASTGBS file to indicate where the protein and DNA
461 sequence libraries can be found. If you have a hard disk and
462 your protein sequence library is kept in the file
463 /usr/lib/aabank.lib and your Genbank DNA sequence library is kept
464 in the directory: /usr/lib/genbank, then fastgbs might contain:
466 NBRF Protein$0P/usr/lib/seq/aabank.lib 0
467 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
468 GB Primate$1P@/usr/lib/genbank/gpri.nam
469 GB Rodent$1R@/usr/lib/genbank/grod.nam
470 GB Mammal$1M@/usr/lib/genbank/gmammal.nam
474 The first line of this file says that there is a copy of the NBRF
475 protein sequence database (which is a protein database) that can
476 be selected by typing "P" on the command line or when the
477 database menu is presented in the file /usr/lib/seq/aabank.lib.
479 Note that there are 4 or 5 fields in the lines in fastgbs.
480 The first field is the description of the library which will be
481 displayed by FASTA; it ends with a '$'. The second field (1
482 character), is a 0 if the library is a protein library and 1 if
483 it is a DNA library. The third field (1 character) is the
484 character to be typed to select the library.
486 The fourth field is the name of the library file. In the
487 example above, the /usr/lib/seq/aabank.lib file contains the
488 entire protein sequence library. However the DNA library file
489 names are preceded by a '@', because these files (gpri.nam,
490 grod.nam, gmammal.nam) do not contain the sequences; instead they
491 contain the names of the files which contain the sequences. This
492 is done because the GENBANK DNA database is broken down in to a
493 large number of smaller files. In order to search the entire
494 primate database, you must search more than a dozen files.
496 In addition, an optional fifth field can be used to specify
497 the format of the library file. Alternatively, you can specify
498 the library format in a file of file names (a file preceded by an
499 '@'). This field must be separated from the file name by a space
500 character (' ') from the filename. In the example above, the
501 aabank.lib file is in Pearson/FASTA format, while the swiss.seq
502 file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
503 FASTA can read the following formats:
505 0 Pearson/FASTA (>SEQID - comment/sequence)
506 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
507 2 NBRF CODATA (ENTRY/SEQUENCE)
508 3 EMBL/SWISS-PROT (ID/DE/SQ)
509 4 Intelligenetics (;comment/SEQID/sequence)
510 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
511 6 GCG (version 8.0) Unix Protein and DNA (compressed)
512 11 NCBI Blast1.3.2 format (unix only)
514 In particular, this version will work with the EMBL and PIR VMS
515 formats that are distributed on the EMBL CD-ROM. The latter
516 format (PIR VMS) is much faster to search than EMBL format. This
517 release also works with the protein and DNA database formats
518 created for the BLASTP and BLASTN programs by SETDB and PRESSDB
519 and with the new NCBI search format. If a library format is not
520 specified, for example, because you are just comparing two
521 sequences, Pearson/FASTA (format 0) is used by default. To
522 change this default, you may set the LIBTYPE environment variable
523 to a number. For example,
527 would cause the program to use the GenBank LOCUS format by
528 default for libraries (or the second sequence file), but the
529 Pearson/FASTA format would still be used for the query sequence.
531 You can specify a group of library files by putting a '@'
532 symbol before a file that contains a list of file names to be
533 searched. For example, if @gmam.nam is in the fastgbs file, the
534 file "gmam.nam" might contain the lines:
541 In this case, the line beginning with a '<' indicates the
542 directory the files will be found in. The remaining lines name
543 the actual sequence files. So the first sequence file to be
546 /usr/lib/genbank/gbpri.seq
548 The notation "<PIRNAQ:" might be used under the VAX/VMS operating
549 system. Under UNIX, the trailing '/' is left off, so the library
550 directory might be written as "</usr/seqlib".
552 With version 1.4 of the FASTA package, the FASTA and TFASTA
553 programs can search a library composed of different files in
554 different sequence formats. For example, you may wish to search
555 the Genbank files (in GenBank flat file format) and the EMBL DNA
556 sequence database on CD-ROM. To do this, you simply list the
557 names and filetypes of the files to be searched in a file of
558 filenames. For example, to search the mammalian portion of
559 Genbank, the unannotated portion of Genbank, and the unannotated
560 portion of the EMBL library, you could use the file:
564 # (this '#' causes the program to display the size of the library)
571 You do not need to include library format numbers if you
572 only use the Pearson/FASTA version of the PIR protein se-
573 quence library. If no library type is specified, the
574 program assumes that type 0 is being used (unless you
577 Support for the old compressed GenBank files, which have not been
578 distributed for more than four years, has been removed from
579 programs in the FASTA package.
582 Test the setup by running FASTA. Enter the sequence file
583 'MUSPLFM.AA' when the program requests it (this file is included
584 with the programs). The program should then ask you to select a
585 protein sequence library. Alternatively, if you run the TFASTA
586 program and use the MUSPLFM.AA query sequence, the program should
587 show you a selection of DNA sequence libraries. Once the fastgbs
588 file has been set up correctly, you can set FASTLIBS=fastgbs in
589 your AUTOEXEC.BAT file, and you will not need to remember where
590 the libraries are kept or how they are named.
592 FASTA and TFASTA must open a large number of files when
593 searching and reporting the results of a GENBANK floppy disk
594 format library search. You may have problems with the large
595 number of files under DOS on IBM-PC's (Unix and VMS users will
596 not have these problems). If you are going to search the GENBANK
597 floppy disk format DNA sequence library under DOS, you should add
602 to your CONFIG.SYS file. (Typically this is already done for
603 programs like Windows or WordPerfect.)
605 3. Using the FASTA Package
609 The FASTA sequence comparison programs all require similar
610 information, the name of a query sequence file, a library file,
611 and the ktup parameter. All of the programs can accept arguments
612 on the command line, or they will prompt for the file names and
615 To use FASTA, simply type:
618 and you will be prompted for :
619 the name of the test sequence file
620 the name of the library file
621 and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
623 ktup of 2 is about 5 times faster than ktup = 1.
624 For a 200 aa sequence against a 10,000,000 aa
625 library, the program takes about 30 min with
626 ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
630 The program can also be run by typing
632 FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
635 Included with the package are the test files, MUSPLFM.AA,
636 LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that
637 everything is working, you can try:
639 fasta musplfm.aa lcbo.aa
641 tfasta musplfm.aa bovprl.seq
643 To test the local similarity programs LFASTA and PLFASTA, try:
645 lfasta mchu.aa mchu.aa
647 plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
648 or on a Tektronix terminal under UNIX or VMS)
650 MCHU (calmodulin) has four duplicated calcium binding sites that
651 are clearly detected by LFASTA. For a more complicated example,
652 try MWRTC1.aa, myosin heavy chain.
656 The FASTA programs know about three kinds of sequence files
657 (four under VMS): (1) plain sequence files that can only be used
658 as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
659 library files. These are the same as plain sequence files, each
660 sequence is preceded by a comment line with a '>' in the first
661 column. (3) distributed sequence libraries (this is a broad class
662 that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
663 flat-file format, EMBL flat-file format, and Intelligenetics
664 format. All of the files that you create should be of type (1)
665 or (2). Type (2) files (ones with a be used as query or library
666 sequence files by all of the programs.
668 I have included several sample test files, *.AA. The first
669 line may begin with a '>' or ';' followed by a comment. The
670 text after ';' in other lines will be ignored. Spaces and
671 tabs (and anything else that is not an amino-acid code) are
674 Library files should have the form:
676 >Sequence name and identifier
677 A F A S Y T .... actual sequence.
678 F S S .... second line of sequence.
679 >Next sequence name and identifier
681 This is often referred to as "FASTA" or "Pearson" format. You
682 can build your own library by concatenating several sequence
683 files. Just be sure that each sequence is preceded by a line
684 beginning with a '>' with a sequence name.
686 The test file should not have lines longer than 120
687 characters, and sequences entered with word processors should use
688 a document mode, with normal carriage returns at the end of
693 3.3. Sequence search programs
695 FASTA universal sequence comparison. Defaults to comparing
696 protein sequences; if the sequences are > 85% A+C+G+T
697 or the -n option is used, a DNA sequence is assumed.
699 FASTX Search a protein sequence library using amino acid
700 sequence comparison to the forward three frames of a
701 translated DNA query sequence. (The reverse frames are
702 specified with the -i option.) Alignment scores allow
703 frameshifts; the final alignment uses a Smith-Waterman
704 type alignment routine (no limit on gaps) that allows
707 TFASTA Search DNA library for a protein sequence by
708 translating the DNA sequence to protein in all six
709 frames (three forward frames with the -3 command line
710 option). TFASTA with ktup=2 is about as fast as a DNA
711 FASTA with ktup=4, and is substantially more sensitive.
712 (also reads the GENBANK library)
714 TFASTX Search DNA library for a protein sequence by
715 translating the DNA sequence to protein in all six
716 frames (three forward frames with the -3 command line
717 option) calculating similarity scores that allow
718 frameshifts. TFASTX produces an optimal Smith-Waterman
719 alignment of the query and translated-library sequence.
721 SSEARCH Universal sequence comparison using the Smith-Waterman
722 algorithm ( T. F. Smith and M. S. Waterman (1981) J.
723 Mol. Biol. 147:195-197). This program uses code
724 developed by Huang and Miller (X. Huang, R. C.
725 Hardison, W. Miller (1990) CABIOS 6:373-381) for
726 calculating the local similarity score and code from
727 the ALIGN program (see below) for calculating the local
728 alignment. SSEARCH is about 50-times slower than FASTA
729 with ktup=2 (for proteins).
731 ALIGN optimal global alignment of two sequences with no
732 short-cuts. This program is a slightly modified
733 version of one taken from E. Myers and W. Miller. The
734 algorithm is described in E. Myers and W. Miller,
735 "Optimal Alignments in Linear Space" (CABIOS (1988)
738 3.4. Local similarity programs
740 LFASTA local similarity searches showing local alignments.
741 The algorithm used to calculate the local alignment in
742 a band has been improved (Chao, Pearson, and Miller,
745 PLFASTA local similarity searches with plot output (on the IBM,
746 this program requires that the environment variable
749 PCLFASTA (unix only) local similarity searches with plot output
752 LALIGN Calculates the N-best local alignments using a rigorous
753 algorithm. (N=10 by default.) The algorithm was
754 developed by Huang and Miller (X. Huang and W. Miller
755 (1991) Adv. Appl. Math. 12:337-357), which is a
756 linear-space version of an algorithm described by M. S.
757 Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
758 Like SSEARCH, LALIGN is rigorous, but also very slow.
760 PLALIGN A version of LALIGN that plots its output to a screen
761 or to a Tektronix terminal emulator.
763 3.5. Statistical Significance
765 With version 2.0 of the FASTA program distribution, FASTA,
766 TFASTA, and SSEARCH now provide estimates of statistical
767 significance for library searches. Work by Altschul, Arratia,
768 Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
769 Nature Genetics 6:119 for an excellent review) suggests that
770 local sequence similarity scores follow the extreme value
771 distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
772 u = ln(Kmn)/lambda and m,m are the lengths of the query and
773 library sequence. This formula can be rewritten as: 1 - exp(-Kmn
774 exp(-lambda x), which shows that the average score for an
775 unrelated library sequence increases with the logarithm of the
776 length of the library sequence. FASTA and SSEARCH use simple
777 linear regression against the the log of the library sequence
778 length to calculate a normalized "z-score" with mean 50,
779 regardless of library sequence length, and variance 10. These
780 z-scores can then be used with the extreme value distribution and
781 the poisson distribution (to account for the fact that each
782 library sequence comparison is an independent test) to calculate
783 the number of library sequences to obtain a score greater than or
784 equal to the score obtained in the search. The original idea and
785 routines to do the linear regression on library sequence length
786 were provided Phil Green, U. Washington. This version of FASTA
787 and SSEARCH uses a slightly different strategy for fitting the
788 data than those originally provided by Dr. Green.
790 The expected number of sequences is plotted in the histogram
791 using an "*". Since the parameters for the extreme value
792 distribution are not calculated directly from the distribution of
793 similarity scores, the pattern of "*'s" in the histogram gives a
794 qualitative view of how well the statistical theory fits the
795 similarity scores calculated by FASTA and SSEARCH. For FASTA, if
796 optimized scores are calculated for each sequence in the database
797 (the default), the agreement between the actual distribution of
798 "z-scores" and the expected distribution based on the length
799 dependence of the score and the extreme value distribution is
800 usually very good. Likewise, the distribution of SSEARCH Smith-
801 Waterman scores typically agrees closely with the actual
802 distribution of "z-scores." The agreement with unoptimized
803 scores, ktup=2, is often not very good, with too many high
804 scoring sequences and too few low scoring sequences compared with
805 the predicted relationship between sequence length and similarity
806 score. In those cases, the expectation values may be
809 The statistical routines assume that the library contains a
810 large sample of unrelated sequences. If this is not the case,
811 then the expectation values are meaningless. Likewise, if there
812 are fewer than 20 sequences in the library, the statistical
813 calculations are not done.
815 For protein searches, library sequences with E() values <
816 0.01 for searches of a 10,000 entry protein database are almost
817 always homologous. Frequently sequences with E()-values from 1 -
818 10 are related as well. Remember, however, that these E() values
819 also reflect differences between the amino acid composition of
820 the query sequence and that of the "average" library sequence.
821 Thus, when searches are done with query sequences with "biased"
822 amino-acid composition, unrelated sequences may have
823 "significant" scores because of sequence bias. The programs
824 below, PRDF and PRSS, can address this problem by calculating
825 similarity scores for random sequences with the same length and
826 amino acid composition.
828 If optimization is not used ("-o"), E-values for DNA
829 sequences overestimate the significance of the scores that are
830 obtained and unrelated sequences frequently have E()-values <
831 0.0005. With optimization, the agreement between E()-value
832 compares favorably with protein sequence comparison. This is in
833 part due to the use of more stringent gap penalties for DNA
834 sequence comparison, -16, -4 rather than -12, -2. With the
835 latter penalties, many unrelated sequences appear to have
836 significant similarity. Nevertheless, since protein sequence
837 comparison is much more sensitive, DNA sequence comparison should
838 not be used to identify sequences that encode protein. Even with
839 ktup=6, optimization rarely increases run-times more than 50%
840 with mRNA-size query sequences. Optimization should be used
843 Similar comments apply to TFASTA, where higher gap
844 penalties (-16,-4) are required for accurate statistical
845 estimates. Because TFASTA produces so many artificial "coding"
846 sequences with atypical amino acid compositions, the statistical
847 estimates with TFASTA are often over estimates. With optimized
848 scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
849 will sometimes have E() values of 0.1. If initn scores are used,
850 unrelated sequences may have have E() values < 0.01.
852 PRDF improved version of RDF program that includes accurate
853 probability estimates for all three scoring methods
854 (includes local or window shuffle routine)
856 PRSS A version of PRDF that uses the rigorous Smith-Waterman
857 calculation used by SSEARCH.
859 RANDSEQ produces a randomly shuffled sequence from a query
862 RELATE significance program described by Dayhoff (Atlas of
863 Protein Sequence and Structure, Vol. 5, Supplement 3).
864 Each chunk of 25 residues in one sequence is compared
865 to every 25 residue fragment of the second sequence.
866 Sequences which are genuinely related will have a large
867 number of scores greater than 3 standard deviations
868 above the mean score of all of the comparisons.
870 3.6. Other analysis programs
872 AACOMP calculate the amino acid composition and molecular
873 weight of a sequence.
875 BESTSCOR calculate the best self-comparison score.
877 GREASE Kyte-Doolittle hydropathicity profile
879 TGREASE graphic plot of Kyte-Doolittle profile
881 FROMGB convert from GenBank LOCUS format (also used by the
882 IBI-Pustell programs) to Pearson/FASTA format.
884 GARNIER A secondary structure prediction program using the
885 method of Garnier, Osgusthorpe, and Robson, J. Mol.
886 Biol., (1978) 120:97-120.
890 These programs have a number of output options, which are
891 invoked by the environment variables LINLEN, SHOWALL, and MARKX.
892 Alternatively, these values can be controlled by command line
893 options. The number of sequence residues per output line is now
894 adjustable by setting the environment variable LINLEN, or the
895 command line option -w. LINLEN is normally 60, to change it set
896 LINLEN=80 before running the program or add -w 80 to the command
897 line. LINLEN can be set up to 200. SHOWALL (-a) determines
898 whether all, or just a portion, of the aligned sequences are
899 displayed. Previously, FASTP would show the entire length of
900 both sequences in an alignment while FASTN would only show the
901 portions of the two sequences that overlapped. Now the default is
902 to show only the overlap between the two sequences, to show
903 complete sequences, set SHOWALL=1, or use the -a option on the
906 The differences between the two aligned sequences can be
907 highlighted in three different ways by changing the environment
908 variable MARKX or the -m option. Normally (MARKX=0) the program
909 uses ':' do denote identities and '.' to denote conservative
910 replacements. If MARKX=1, the program will not mark identities;
911 instead conservative replacements are denoted by a 'x' and non-
912 conservative substitutions by a 'X'. If MARKX=2, the residues in
913 the second sequence are only shown if they are different from the
914 first. MARKX=3 displays the aligned library sequences without the
915 query sequence; these can be used to build a primitive multiple
916 alignment. MARKX=4 provides a graphical display of the
917 boundaries of the alignments. Thus the five options are:
920 MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4
922 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
923 ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------
924 MWKSCGYPYT MWKSCGYPYT
927 (fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
928 format for use with other programs. See the file"readme.v20u4"
929 for a more complete description.
931 3.8. Command line options
933 It is now possible to specify several options on the
934 command line, instead of using environment variables. The
935 command line options are preceded by a dash; the following
936 options are available:
940 -A force Smith-Waterman alignments for DNA sequences and
941 TFASA. By default, only FASTA protein sequence
942 comparisons use Smith-Waterman alignments.
944 -b # Number of sequence scores to be shown on output. In
945 the absence of this option, fasta (and tfasta and
946 ssearch) display all library sequences obtaining
947 similarity scores with expectations less than 10.0 if
948 optimized score are used, or 2.0 if they are not. The
949 -b option can limit the display further, but it will
950 not cause additional sequences to be displayed.
952 -c # Threshold score for optimization (OPTCUT). Set "-c 1"
953 to optimize every sequence in a database. (This slows
954 the program down about 5-fold).
956 -E # Limit the number of scores and alignments shown based
957 on the expected number of scores. Used to override the
958 expectation value of 10.0 used by default. When used
959 with -Q, -E 2.0 will show all library sequences with
960 scores with an expectation value <= 2.0.
962 -d # Number of alignments to be reported by default. (Used
963 in conjunction with -Q). No longer necessary, see "-b"
966 -f Penalty for the first residue in a gap (-12 by default
967 for proteins, -16 for DNA or for TFASTA).
969 -g Penalty for additional residues in a gap (-2 by default
970 for proteins, -4 for DNA and TFASTA ).
972 -h Penalty for frameshift (FASTX, TFASTX only).
976 -i Invert (reverse complement) the query sequence if it is
977 DNA. For TFASTX, search the reverse complement of the
978 library sequence only.
980 -k # Threshold for joining init1 segments to build an initn
983 -l file Location of library menu file (FASTLIBS).
985 -L Display more information about the library sequence in
988 -m # MARKX = # (0, 1, 2, 3, 4, 10)
990 -n Force the query sequence to be treated as a DNA
991 sequence. This is particularly useful for query
992 sequences that contain a large number of ambiguous
993 residues, e.g. transcription factor binding sites.
995 -O Send copy of results to "filename." Helpful for
996 environments without STDOUT.
998 -o Turn off default optimization of all scores greater
999 than OPTCUT. Sort results by "initn" scores.
1001 -Q,-q Quiet - does not prompt for any input. Writes scores
1002 and alignments to the terminal or standard output file.
1004 -r file Save a results summary line for every sequence in the
1005 sequence library. The summary line includes the
1006 sequence identifier, superfamily number (if available)
1007 position in the library, and the similarity scores
1008 calculated. This option can be used to evaluate the
1009 sensitivity and selectivity of different search
1010 strategies (see W. R. Pearson (1991) Genomics 11:635-
1013 -s file SMATRIX is read from file. Several SMATRIX files are
1014 provided with the standard distribution. For protein
1015 sequences: codaa.mat - based on minimum mutation
1016 matrix; idnaa.mat - identity matrix; pam250.mat - the
1017 PAM250 matrix developed by Dayhoff et al (Atlas of
1018 Protein Sequence and Structure, vol. 5, suppl. 3,
1019 1978); pam120.mat - a PAM120 matrix. The default
1020 scoring matrix is BLOSUM50, PAM250 is available with
1021 "-s 250", BLOSUM62 ("-s BL62") is also available.
1023 -v (LINEVAL) values used for line styles in plfasta
1025 -w # Line length (width) = number (<200)
1027 -x Specifies offsets for the beginning of the query and
1028 library sequence. For example, if you are comparing
1029 upstream regions for two genes, and the first sequence
1030 contains 500 nt of upstream sequence while the second
1031 contains 300 nt of upstream sequence, you might try:
1033 fasta -x "-500 -300" seq1.nt seq2.nt
1035 If the -x option is not used, FASTA assumes numbering
1036 starts with 1. This option will not work properly with
1037 the translated library sequence with tfasta. (You
1038 should double check to be certain the negative
1039 numbering works properly.)
1041 -y Set the width of the band used for calculating
1042 "optimized" scores. For proteins and ktup=2, the width
1043 is 16. For proteins with ktup=1, the width is 32 by
1044 default. For DNA the width is 16.
1046 -z Turn off statistical calculations.
1048 -1 sort output by init1 score (as FASTP used to do).
1050 -3 (TFASTA, TFASTX only) translate only three forward
1056 fasta -w 80 -a seq1.aa seq.aa
1058 would compare the sequence in seq1.aa to that in seq2.aa and
1059 display the results with 80 residues on an output line, showing
1060 all of the residues in both sequences. Be sure to enter the
1061 options before entering the file names, or just enter the options
1062 on the command line, and the program will prompt for the file
1065 Not all of these options are appropriate for all of the
1066 programs. The options above are used by FASTA and TFASTA. RELATE
1067 uses the -s option, ALIGN uses the -w, -m, and -s options, and
1068 the PRDF program uses -c, -f, -k, and -s.
1070 4. Environment variable summary
1072 Environment variables allow you to set search parameters
1073 that will be used frequently when you run a program; for example,
1074 if you prefer to use the PAM250 scoring matrix, you might "set
1075 SMATRIX=250." Command line parameters, if used, always override
1076 environment variable settings. The following environment
1077 variables are used by this program:
1079 AABANK the file name of the default sequence library.
1081 FASTLIBS the location of the file which contains the list of
1082 library files to be searched.
1084 GAPCUT threshold used for joining init1 regions in the second
1085 step of FASTA. Normally set based on sequence length
1088 LIBTYPE used to specify the format of the library sequence for
1091 LINLEN output line length - can go up to 200
1093 LINEVAL used by plfasta to determine the relationship between
1094 line style and similarity score (-v). This should be a
1095 string of three numbers, e.g. "200 100 50"
1097 MARKX symbol for denoting matches, mismatches. Note that this
1098 symbol is only used across the optimized local region;
1099 sequences that are outside this region are not marked.
1101 OPTCUT Set the threshold to be used for optimization in a band
1102 around the best initial region. Normally the OPTCUT
1103 value is calculated from the length of the sequence and
1104 the ktup value (for a 200 residue sequence, it is about
1105 28). If OPTCUT=1, every sequence in the database will
1106 be optimized. This is the most sensitive option.
1108 PAMFACT This version of fasta uses a more sensitive method for
1109 identifying initial regions. Instead of using a
1110 constant factor (fact) for each match in a ktup, it
1111 uses the scoring matrix (PAM) scores. While this works
1112 well for protein sequences, it has not been as
1113 carefully tested for DNA sequences, so by default, this
1114 modification is used for proteins but not for DNA.
1115 Setting the PAMFACT environment variable to 1 forces
1116 the option on; PAMFACT=0 turns it off.
1118 SHOWALL on output, show the complete sequence instead of just
1119 the overlap of the two aligned sequences.
1121 SMATRIX alternative scoring matrix file.
1123 TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix
1124 graphics by default) Generate Tektronix output.
1125 Normally, PLFASTA and TGREASE plot graphs using the
1126 Turbo C graphics library. Unfortunately, often these
1127 plots cannot be printed out without special programs.
1128 However, if you set TEKPLOT=1, tektronix graphics
1129 commands will be used. Tektronix commands can be used
1130 together with the PLOTDEV program, available from
1131 Microplot Systems. They no lonter sell this program,
1132 but it can be downloaded from
1133 http://iquest.com/~microplt/index1.html. PLOTDEV also
1134 allows you to print out graphics on the screen.
1136 As always, please inform me of bugs as soon as possible.
1139 Department of Biochemistry
1140 Box 440, Jordan Hall