binaries/src/fasta34/fasta3x.doc

   1 (Updated December, 2003)
   2
   3
   4                         COPYRIGHT NOTICE
   5
   6 Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
   7 Pearson and the University of Virginia.  All rights reserved. The
   8 FASTA program and documentation may not be sold or incorporated
   9 into a commercial product, in whole or in part, without written
  10 consent of William R. Pearson and the University of Virginia.
  11 For further information regarding permission for use or
  12 reproduction, please contact: David Hudson, Assistant Provost for
  13 Research, University of Virginia, P.O. Box 9025, Charlottesville,
  14 VA 22906-9025, (434) 924-6853
  15
  16 The FASTA program package
  17
  18 Introduction
  19
  20      This documentation describes the version 3 of the FASTA
  21 program package (see W. R. Pearson and D. J. Lipman (1988),
  22 "Improved Tools for Biological Sequence Analysis", PNAS
  23 85:2444-2448 (Pearson and Lipman, 1988); W. R.  Pearson (1996)
  24 "Effective protein sequence comparison" Meth. Enzymol.
  25 266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
  26 46:24-36 (Zhang et al., 1997);  Pearson, (1999) Meth. in
  27 Molecular Biology 132:185-219 (Pearson, 2000).  Version 3 of the
  28 FASTA packages contains many programs for searching DNA and
  29 protein databases and one program (prss3) for evaluating
  30 statistical significance from randomly shuffled sequences.
  31 Several additional analysis programs, including programs that
  32 produce local alignments, are available as part of version 2 of
  33 the FASTA package, which is still available.
  34
  35      This document is divided into three sections: (1) A summary
  36 overview of the programs in the FASTA3 package; (2) A guide to
  37 installing the programs and databases; (3) A guide to using the
  38 FASTA programs. The revision history of the programs can be found
  39 in the readme.v30..v34, files. The programs are easy to use, so
  40 if you are using them on a machine that is administered by
  41 someone else, you can skip section (2) and focus on (1) and (3)
  42 to learn how to use the programsIf you are installing the
  43 programs on your own machine, you will need to read section (2)
  44 carefully.
  45
  46 1.  An overview of the FASTA programs
  47
  48      Although there are a large number of programs in this
  49 package, they belong to three groups: (1) "Conventional" Library
  50 search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
  51 TFASTY3, SSEARCH3; (2) Programs for searching with short
  52 fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
  53 significance: PRSS3.  Programs that start with fast search
  54 protein databases, while tfast programs search translated DNA
  55 databases.  Table I gives a brief description of the programs.
  56
  57
  58             Table I. Comparison programs in the FASTA3 package
  59
  60 ---------------------------------------------------------------------------
  61 fasta3             Compare  a  protein  sequence  to  a  protein  sequence
  62                    database  or  a DNA sequence to a DNA sequence database
  63                    using the FASTA algorithm  (Pearson  and  Lipman, 1988,
  64                    Pearson, 1996).   Search speed and selectivity are con-
  65                    trolled with the ktup(wordsize) parameter.  For protein
  66                    comparisons,  ktup = 2 by default; ktup =1 is more sen-
  67                    sitive but slower.  For DNA comparisons, ktup=6 by  de-
  68                    fault;  ktup=3  or  ktup=4 provides higher sensitivity;
  69                    ktup=1 should be used for oligonucleotides  (DNA  query
  70                    lengths < 20).
  71
  72 ssearch3           Compare  a  protein  sequence  to  a  protein  sequence
  73                    database or a DNA sequence to a DNA  sequence  database
  74                    using  the  Smith-Waterman  algorithm (Smith and Water-
  75                    man, 1981).  ssearch3 is  about  10-times  slower  than
  76                    FASTA3,  but  is more sensitive for full-length protein
  77                    sequence comparison.
  78
  79 fastx3/ fasty3     Compare a DNA sequence to a protein sequence  database,
  80                    by  comparing  the  translated  DNA  sequence  in three
  81                    frames and allowing gaps and frameshifts.  fastx3  uses
  82                    a  simpler, faster algorithm for alignments that allows
  83                    frameshifts only between codons; fasty3 is  slower  but
  84                    produces  better alignments with poor quality sequences
  85                    because frameshifts are allowed within codons.
  86
  87 tfastx3/ tfasty3   Compare a protein sequence to a DNA sequence  database,
  88                    calculating  similarities  with frameshifts to the for-
  89                    ward and reverse orientations.
  90
  91 tfasta3            Compare a protein sequence to a DNA sequence  database,
  92                    calculating similarities (without frameshifts) to the 3
  93                    forward and three reverse reading frames.  tfastx3  and
  94                    tfasty3 are preferred because they calculate similarity
  95                    over frameshifts.
  96
  97 fastf3/tfastf3     Compares an ordered peptide mixture, as  would  be  ob-
  98                    tained  by  Edman  degredation  of a CNBr cleavage of a
  99                    protein, against a  protein  (fastf)  or  DNA  (tfastf)
 100                    database.
 101
 102 fasts3/tfasts3     Compares  set  of  short peptide fragments, as would be
 103                    obtained from mass-spec. analysis of a protein, against
 104                    a protein (fasts) or DNA (tfasts) database.
 105 ---------------------------------------------------------------------------
 106
 107 2.  Installing FASTA and the sequence databases
 108
 109 2.1.  Obtaining the libraries
 110
 111      The FASTA program package does not include any protein or
 112 DNA sequence libraries.  Protein databases are available on CD-
 113 ROM from the PIR and EMBL (see below), or via anonymouse FTP from
 114 many different sources.  As this document is updated in the fall
 115 of 1999, no DNA databases are available on CD-ROM from the major
 116 sequence databases: Genbank at the National for Biotechnology
 117 Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
 118 EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
 119 However, the databases are available via anonymous FTP from both
 120 sites.
 121
 122 2.1.1.  The GENBANK DNA sequence library
 123
 124      Because of the large size of DNA databases, you will
 125 probably want to keep DNA databases in only one, or possibly two,
 126 formats.  The FASTA3 programs that search DNA databases - fasta3,
 127 tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
 128 (not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
 129 and BLAST2.0 (formatdb) formats, as well as EMBL format.  If you
 130 are also running the GCG suite of sequence analysis programs, you
 131 should use GCG/compressed-binary format or BLAST2.0 format for
 132 your fasta3 searches.  If not, BLAST2.0 is a good choice.  These
 133 files are considerably more compact than Genbank flat files, and
 134 are preferred.  The NCBI does not provide software for converting
 135 from Genbank flat files to Blast2.0 DNA databases, but you can
 136 use the Blast formatdb program to convert ASN.1 formated Genbank
 137 files, which are available from the NCBI ftp site.
 138
 139      The NCBI also provides the nr, swissprot, and several EST
 140 databases that are used by BLAST in FASTA format from:
 141 ftp://ncbi.nlm.nih.gov/blast/db.  These databases are updated
 142 nightly.
 143
 144 2.1.2.  The NBRF protein sequence library
 145
 146      You can obtain the PIR protein sequence database (Barker et
 147 al., 1998) from:
 148
 149     National  Biomedical Research Foundation
 150     Georgetown  University  Medical  Center
 151     3900 Reservoir Rd, N.W.
 152     Washington, D.C. 20007
 153
 154 or via ftp from nbrf.georgetown.edu or from the NCBI
 155 (ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
 156 directory is in PIR Codata format, which is not widely used.  I
 157 recommend the PIR/VMS format data (libtype=5) in the vms
 158 directory.
 159
 160 2.1.3.  The EBI/EMBL CD-ROM libraries
 161
 162      The European Bioinformatics Institute (EBI) distributes both
 163 the EMBL DNA database and the SwissProt database on CD-ROM
 164 (Bairoch and Apweiler, 1996), and they are available from:
 165
 166     EMBL-Outstation  European Bioinformatics Institute
 167     Wellcome Trust Genome Campus,
 168     Hinxton Hall
 169     Hinxton,
 170     Cambridge CB10 1SD
 171     United Kingdom
 172     Tel: +44 (0)1223 494444
 173     Fax: +44 (0)1223 494468
 174     Email: DATALIB@ebi.ac.uk
 175
 176 In addition, the SWISS-PROT protein sequence database is
 177 available via anonymous FTP from
 178 ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
 179 www.expasy.ch).
 180
 181 2.2.  Finding the libraries: FASTLIBS
 182
 183      The major problem that most new users of the FASTA package
 184 have is in setting up the program to find the databases and their
 185 library type.  In general, if you cannot get fasta3 to read a
 186 sequence database, it is likely that something is wrong with the
 187 FASTLIBS file.  A common problem is that the database file is
 188 found, but either no sequences are read, or an incorrect number
 189 of entries is read.  This is almost always because the library
 190 format (libtype) is incorrect.  Note that a type 5 file (PIR/VMS
 191 format) can be read as a type 0 (default FASTA) format file, and
 192 the number of entries will be correct, but the sequence lengths
 193 will not.
 194
 195      All the search programs in the FASTA3 package use the
 196 environment variable FASTLIBS to find the protein and DNA
 197 sequence libraries.  The FASTLIBS variable contains the name of a
 198 file that has the actual filenames of the libraries.  The
 199 fastlibs file included with the distribution on is an example of
 200 a file that can be referred to by FASTLIBS. To use the fastlibs
 201 file, type:
 202
 203     setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
 204     or
 205     export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
 206
 207 Then edit the fastlibs file to indicate where the protein and DNA
 208 sequence libraries can be found.  If you have a hard disk and
 209 your protein sequence library is kept in the file
 210 /usr/lib/aabank.lib and your Genbank DNA sequence library is kept
 211 in the directory: /usr/lib/genbank, then fastgbs might contain:
 212
 213     NBRF Protein$0P/usr/lib/seq/aabank.lib 0
 214     SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
 215     GB Primate$1P@/usr/lib/genbank/gpri.nam
 216     GB Rodent$1R@/usr/lib/genbank/grod.nam
 217     GB Mammal$1M@/usr/lib/genbank/gmammal.nam
 218     ^   1    ^^^^       4                   ^     ^
 219               23                             (5)
 220
 221 The first line of this file says that there is a copy of the NBRF
 222 protein sequence database (which is a protein database) that can
 223 be selected by typing "P" on the command line or when the
 224 database menu is presented in the file /usr/lib/seq/aabank.lib.
 225
 226      Note that there are 4 or 5 fields in the lines in fastgbs.
 227 The first field is the description of the library which will be
 228 displayed by FASTA; it ends with a '$'.  The second field (1
 229 character), is a 0 if the library is a protein library and 1 if
 230 it is a DNA library.  The third field (1 character) is the
 231 character to be typed to select the library.
 232
 233      The fourth field is the name of the library file.  In the
 234 example above, the /usr/lib/seq/aabank.lib file contains the
 235 entire protein sequence library.  However the DNA library file
 236 names are preceded by a '@', because these files (gpri.nam,
 237 grod.nam, gmammal.nam) do not contain the sequences; instead they
 238 contain the names of the files which contain the sequences.  This
 239 is done because the GENBANK DNA database is broken down in to a
 240 large number of smaller files.  In order to search the entire
 241 primate database, you must search more than a dozen files.
 242
 243      In addition, an optional fifth field can be used to specify
 244 the format of the library file.  Alternatively, you can specify
 245 the library format in a file of file names (a file preceded by an
 246 '@').  This field must be separated from the file name by a space
 247 character (' ') from the filename.  In the example above, the
 248 aabank.lib file is in Pearson/FASTA format, while the swiss.seq
 249 file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
 250 FASTA can read the following formats:
 251
 252     0 Pearson/FASTA (>SEQID - comment/sequence)
 253     1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
 254     2 NBRF CODATA (ENTRY/SEQUENCE)
 255     3 EMBL/SWISS-PROT (ID/DE/SQ)
 256     4 Intelligenetics (;comment/SEQID/sequence)
 257     5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
 258     6 GCG (version 8.0) Unix Protein and DNA (compressed)
 259     11 NCBI Blast1.3.2 format  (unix only)
 260     12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)
 261
 262 In particular, this version will work with the EMBL and PIR VMS
 263 formats that are distributed on the EMBL CD-ROM. The latter
 264 format (PIR VMS) is much faster to search than EMBL format.  This
 265 release also works with the protein and DNA database formats
 266 created for the BLASTP and BLASTN programs by SETDB and PRESSDB
 267 and with the new NCBI search format.  If a library format is not
 268 specified, for example, because you are just comparing two
 269 sequences, Pearson/FASTA (format 0) is used by default. To
 270 specify a library type on the command line, add it to the library
 271 filename and surround the filename and library type in quotes:
 272
 273     fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
 274
 275      You can specify a group of library files by putting a '@'
 276 symbol before a file that contains a list of file names to be
 277 searched.  For example, if @gmam.nam is in the fastgbs file, the
 278 file "gmam.nam" might contain the lines:
 279
 280     </seqdb/genbank
 281     gbpri1.seq 1
 282     gbpri2.seq 1
 283     gbpri3.seq 1
 284     gbpri4.seq 1
 285     gbrod.seq 1
 286     gbmam.seq 1
 287
 288 In this case, the line beginning with a '<' indicates the
 289 directory the files will be found in.  The remaining lines name
 290 the actual sequence files.  So the first sequence file to be
 291 searched would be:
 292
 293     /usr/lib/genbank/gbpri.seq
 294
 295 The notation "<PIRNAQ:" might be used under the VAX/VMS operating
 296 system. Under UNIX, the trailing '/' is left off, so the library
 297 directory might be written as "</usr/seqlib".
 298
 299      The FASTA programs can search a database composed of
 300 different files in different sequence formats.  For example, you
 301 may wish to search the Genbank files (in GenBank flat file
 302 format) and the EMBL DNA sequence database on CD-ROM.  To do
 303 this, you simply list the names and filetypes of the files to be
 304 searched in a file of filenames.  For example, to search the
 305 mammalian portion of Genbank, the unannotated portion of Genbank,
 306 and the unannotated portion of the EMBL library, you could use
 307 the file:
 308
 309     </usr/lib/DNA
 310     gbpri.seq 1
 311     #  (this '#' causes the program to display the size of the library)
 312     gbrod.seq 1
 313     ...
 314     gbmam.seq 1
 315     ...
 316     gbuna.seq 1
 317     ...
 318     unanno.seq 5
 319     #
 320
 321     You do not need to include library format numbers if  you
 322     only use the Pearson/FASTA version of the PIR protein se-
 323     quence library.  If no library  type  is  specified,  the
 324     program assumes that type 0 is being used.
 325
 326      Test the setup by running FASTA.  Enter the sequence file
 327 'mgstm1.aa' when the program requests it (this file is included
 328 with the programs).  The program should then ask you to select a
 329 protein sequence library.  Alternatively, if you run the TFASTA
 330 program and use the mgstm1.aa query sequence, the program should
 331 show you a selection of DNA sequence libraries.  Once the fastgbs
 332 file has been set up correctly, you can set FASTLIBS=fastgbs in
 333 your AUTOEXEC.BAT file, and you will not need to remember where
 334 the libraries are kept or how they are named.
 335
 336 3.  Using the FASTA Package
 337
 338 3.1.  Overview
 339
 340      The FASTA sequence comparison programs all require similar
 341 information, the name of a query sequence file, a library file,
 342 and the ktup parameter.  All of the programs can accept arguments
 343 on the command line, or they will prompt for the file names and
 344 ktup value.
 345
 346 To use FASTA, simply type:
 347
 348     FASTA
 349     and you will be prompted for :
 350          the name of the test sequence file
 351          the name of the library file
 352          and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
 353          (ktup of 2 is about 5 times faster than ktup = 1)
 354
 355 The program can also be run by typing
 356
 357     FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
 358
 359 Included with the package are several test files.  To check to
 360 make certain that everything is working, you can try:
 361
 362     fasta musplfm.aa prot_test.lib
 363     and
 364     tfastx mgstm1.aa gst.nlib
 365
 366 3.2.  Sequence files
 367
 368      The fasta3 programs know about three kinds of sequence
 369 files: (1) plain sequence files - files that contain nothing but
 370 sequence residues - can only be used as query sequences. (2)
 371 FASTA format files.  These are the same as plain sequence files,
 372 each sequence is preceded by a comment line with a '>' in the
 373 first column. (3) distributed sequence libraries (this is a broad
 374 class that includes the NBRF/PIR VMS and blocked ascii formats,
 375 Genbank flat-file format, EMBL flat-file format, and
 376 Intelligenetics format.  All of the files that you create should
 377 be of type (1) or (2).  FASTA format files (ones with a '>' and
 378 comment before the sequence) are preferred, because they can be
 379 used as query or library sequence files by all of the programs.
 380
 381      I have included several sample test files, *.aa and *.seq as
 382 well as two small sequence libraries, prot_test.lib and gst.nlib.
 383 The first line may begin with a '>' by a comment.  Spaces and
 384 tabs (and anything else that is not an amino-acid code) are
 385 ignored.
 386
 387      Library files should have the form:
 388
 389     >Sequence name and identifier
 390     A F A S Y T .... actual sequence.
 391     F S S       .... second line of sequence.
 392     >Next sequence name and identifier
 393
 394 This is often referred to as "FASTA" or format.  You can build
 395 your own library by concatenating several sequence files.  Just
 396 be sure that each sequence is preceded by a line beginning with a
 397 '>' with a sequence name.
 398
 399      The test file should not have lines longer than 120
 400 characters, and sequences entered with word processors should use
 401 a document mode, with normal carriage returns at the end of
 402 lines.
 403
 404      A different format is required to specify the ordered
 405 peptide mixture for fastf3/tfastf3. For example:
 406
 407     >mgstm1
 408     MGCEN,
 409     MIDYP,
 410     MLLAY,
 411     MLLGY
 412
 413 indicates m in the first position of all three peptides (as from
 414 CNBr), G, I, L (twice) in the second position (first cycle),
 415 C,D,L (twice) in the third position, etc.  The commas (,) are
 416 required to indicate the number of fragments in the mixture, but
 417 there should be no comma after the last residue.
 418
 419      For the fasts3/tfasts3 program, the format is the same,
 420 except that there is no requirement for the peptides to be the
 421 same length.
 422
 423 4.  Statistical Significance
 424
 425      All the programs in the FASTA3 package attempt to calculate
 426 accurate estimates of the statistical significance of a match.
 427 For fasta3, ssearch3, and fastx3/y3, these estimates are very
 428 accurate (Pearson, 1998, Zhang et al., 1997)..  Altschul et al.
 429 (Altschul et al., 1994) provides an excellent review of the
 430 statistics of local similarity scores.  Local sequence similarity
 431 scores follow the extreme value distribution, so that P(s > x) =
 432 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
 433 the lengths of the query and library sequence. This formula can
 434 be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
 435 the average score for an unrelated library sequence increases
 436 with the logarithm of the length of the library sequence.  The
 437 fasta3 programs use simple linear regression against the the log
 438 of the library sequence length to calculate a normalized "z-
 439 score" with mean 50, regardless of library sequence length, and
 440 variance 10. (Several other estimation methods are available with
 441 the -z option.) These z-scores can then be used with the extreme
 442 value distribution and the poisson distribution (to account for
 443 the fact that each library sequence comparison is an independent
 444 test) to calculate the number of library sequences to obtain a
 445 score greater than or equal to the score obtained in the search.
 446 The original idea and routines to do the linear regression on
 447 library sequence length were provided Phil Green, U. Washington.
 448 This version uses a slightly different strategy for fitting the
 449 data than those originally provided by Dr. Green.
 450
 451      The expected number of sequences is plotted in the histogram
 452 using an "*". Since the parameters for the extreme value
 453 distribution are not calculated directly from the distribution of
 454 similarity scores, the pattern of "*'s" in the histogram gives a
 455 qualitative view of how well the statistical theory fits the
 456 similarity scores calculated by the programs.  For fasta3, if
 457 optimized scores are calculated for each sequence in the database
 458 (the default), the agreement between the actual distribution of
 459 "z-scores" and the expected distribution based on the length
 460 dependence of the score and the extreme value distribution is
 461 usually very good.  Likewise, the distribution of ssearch3 Smith-
 462 Waterman scores typically agrees closely with the <actual
 463 distribution of "z-scores."  The agreement with unoptimized
 464 scores, ktup=2, is often not very good, with too many high
 465 scoring sequences and too few low scoring sequences compared with
 466 the predicted relationship between sequence length and similarity
 467 score.  In those cases, the expectation values may be
 468 overestimates.
 469
 470      With version 33t01, all the FASTA programs also report a
 471 "bit" score, which is equivalent to the bit score reported by
 472 BLAST2.  The FASTA33/BLAST2 bit score is calculated as: (lambda*S
 473 - ln K)/ln 2, where S is the raw similarity score, lambda and K
 474 are statistical parameters estimated from the distribution of
 475 unrelated sequence similarity scores.  The statistical
 476 signficance of a given bit score depends on the lengths of the
 477 query and library sequences and the size of the library, but a 1
 478 bit increase in score corresponds to a 2-fold reduction in
 479 expectation; a 10-bit increase implies 1000-fold lower
 480 expectation, etc.
 481
 482      The statistical routines assume that the library contains a
 483 large sample of unrelated sequences.  If this is not true, then
 484 statistical parameters can be estimated by using the -z 11-15,
 485 options.  -z options greater than 10 calculate a shuffled
 486 similarity score for each library sequence, in addition to the
 487 unshuffled score, and estimate the statistical parameters from
 488 the scores of the shuffled sequences.  If there are fewer than 20
 489 sequences in the library, the statistical calculations are not
 490 done.
 491
 492      For protein searches, library sequences with E() values <
 493 0.01 for searches of a 10,000 entry protein database are almost
 494 always homologous. Frequently sequences with E()-values from 1 -
 495 10 are related as well, but unrelated sequences ( 1 - 10 per
 496 search) will have scores in this renage as well. Remember,
 497 however, that these E() values also reflect differences between
 498 the amino acid composition of the query sequence and that of the
 499 "average" library sequence.  Thus, when searches are done with
 500 query sequences with "biased" amino-acid composition, unrelated
 501 sequences may have "significant" scores because of sequence bias.
 502 PRSS3 can address this problem by calculating similarity scores
 503 for random sequences with the same length and amino acid
 504 composition.
 505
 506 5.  Options
 507
 508      Command line options are available to change the scoring
 509 parameters and output display. Command line options must preceed
 510 other program arguments, such as the query and library file
 511 names.
 512
 513 5.1.  Command line options
 514
 515 -a   (fasta3, ssearch3 only) show both sequences in their
 516      entirety.
 517
 518 -A   force Smith-Waterman alignments for fasta3 DNA sequences.
 519      By default, only fasta3 protein sequence comparisons use
 520      Smith-Waterman alignments.
 521
 522 -B   Show normalized score as a z-score, rather than a bit-score
 523      in the list of best scores.
 524
 525 -b # Number of sequence scores to be shown on output.  In the
 526      absence of this option, fasta (and tfasta and ssearch)
 527      display all library sequences obtaining similarity scores
 528      with expectations less than 10.0 if optimized score are
 529      used, or 2.0 if they are not. The -b option can limit the
 530      display further, but it will not cause additional sequences
 531      to be displayed.
 532
 533 -c # Threshold score for optimization (OPTCUT).  Set "-c 1" to
 534      optimize every sequence in a database.
 535
 536 -E # Limit the number of scores and alignments shown based on the
 537      expected number of scores.  Used to override the expectation
 538      value of 10.0 used by default.  When used with -Q, -E 2.0
 539      will show all library sequences with scores with an
 540      expectation value <= 2.0.
 541
 542 -d # Maximum number of alignments to be displayed.  Ignored if
 543      "-Q" is not used.
 544
 545 -f   Penalty for the first residue in a gap (-12 by default for
 546      proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
 547
 548 -F # Limit the number of scores and alignments shown based on the
 549      expected number of scores. "-E #" sets the highest E()-value
 550      shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
 551      will not show any matches or alignments with E() < 0.0001.
 552      This allows one to skip over close relationships in searches
 553      for more distant relationships.
 554
 555 -g   Penalty for additional residues in a gap (-2 by default for
 556      proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
 557
 558 -h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
 559
 560 -H   Omit histogram.
 561
 562 -i   Invert (reverse complement) the query sequence if it is DNA.
 563      For tfasta3/x3/y3, search the reverse complement of the
 564      library sequence only.
 565
 566 -j # Penalty for frameshift within a codon (fasty3/tfasty3 only).
 567
 568 -l file
 569      Location of library menu file (FASTLIBS).
 570
 571 -L   Display more information about the library sequence in the
 572      alignment.
 573
 574 -M low-high
 575      Range of amino acid sequence lengths to be included in the
 576      search.
 577
 578 -m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
 579
 580              -m 0        -m 1          -m 2          -m 3        -m 4
 581          MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
 582          ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
 583          MWKSCGYPYT   MWKSCGYPYT
 584
 585      -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
 586      -m 5 plus HTML formatting.
 587
 588 -m 9 provides coordinates and scores with the best score
 589      information.  A simple " -m 9 extends the normal best score
 590      information:
 591
 592          The best scores are:                                      opt bits E(14548)
 593          XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79
 594
 595      to include the additional information (on the same line,
 596      separated by a <tab>):
 597
 598          %_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
 599          0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0
 600
 601       -m 9c provides additional information: an encoded alignment
 602      string.  Thus:
 603
 604                 10        20        30        40        50          60         70
 605          GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
 606                 :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
 607          XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
 608                         20        30                 40        50        60
 609
 610      would be encoded:
 611
 612          =23+9=13-2=10-1=3+1=5
 613
 614      The alignment encoding is with repect to the alignment, not
 615      the sequences.  The coordinate of the alignment is given
 616      earlier in the " -m 9c" line.
 617
 618 -m 10
 619      -m 10 is a new, parseable format for use with other
 620      programs.  See the file "readme.v20u4" for a more complete
 621      description.
 622
 623      As of version "fa34t23b2", it has become possible to combine
 624      independent "-m" options.  Thus, one can use "-m 1 -m 6 -m
 625      9".
 626
 627 -M low-high
 628      Include library sequences (proteins only) with lengths
 629      between low and high.
 630
 631 -n   Force the query sequence to be treated as a DNA sequence.
 632      This is particularly useful for query sequences that contain
 633      a large number of ambiguous residues, e.g. transcription
 634      factor binding sites.
 635
 636 -O   Send copy of results to "filename."  Helpful for
 637      environments without STDOUT (mostly for the Macintosh).
 638
 639 -o   Turn off default optimization of all scores greater than
 640      OPTCUT. Sort results by "initn" scores (reduces the accuracy
 641      of statistical estimates).
 642
 643 -p   Force query to be treated as protein sequence.
 644
 645 -Q,-q
 646      Quiet - does not prompt for any input.  Writes scores and
 647      alignments to the terminal or standard output file.
 648
 649 -r   Specify match/mismatch scores for DNA comparisons.  The
 650      default is "+5/-4". "+3/-2" can perform better in some
 651      cases.
 652
 653 -R file
 654      Save a results summary line for every sequence in the
 655      sequence library.  The summary line includes the sequence
 656      identifier, superfamily number (if available) position in
 657      the library, and the similarity scores calculated.  This
 658      option can be used to evaluate the sensitivity and
 659      selectivity of different search strategies (Pearson, 1995,
 660      Pearson, 1998).
 661
 662 -s file
 663      Specify the scoring matrix file.  fasta3 uses the same
 664      scoring matrices as Blast1.4/2.0.  Several scoring matrix
 665      files are included in the standard distribution.  For
 666      protein sequences: codaa.mat - based on minimum mutation
 667      matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
 668      matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
 669      pam120.mat - a PAM120 matrix.  The default scoring matrix is
 670      BLOSUM50 ("-s BL50"). Other matrices available from within
 671      the program are: PAM250/"-s P250", PAM120/"-s P120",
 672      PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
 673      (MDM are modern PAM matrices from Jones et al. (Jones et
 674      al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
 675      BL80".
 676
 677 -S   Treat lower-case characters in the query or library
 678      sequences as "low-complexity" ("seg"-ed) residues.
 679      Traditionally, the "seg" program (Wootton and
 680      Federhen, 1993) is used to remove low complexity regions in
 681      DNA sequences by replacing the residues with an "X".  When
 682      the "-S" option is used, the FASTA33 programs provide a
 683      potentially more informative approach.  With "-S", lower
 684      case characters in the query or database sequences are
 685      treated as "X"'s during the initial scan, but are treated as
 686      normal residues during the final alignment display.  Since
 687      statistical significance is calculated from the similarity
 688      score calculated during the library search, when the lower
 689      case residues are "X"'s, low complexity regions will not
 690      produce statistically significant matches.  However, if a
 691      significant alignment contains low complexity regions, their
 692      alignmen is shown.  With "-S", lower case characters may be
 693      included in the alignment to indicate low complexity
 694      regions, and the final alignment score may be higher than
 695      the score obtained during the search.
 696
 697      The pseg program can be used to produce databases (or query
 698      sequences) with lower case residues indicating low
 699      complexity regions using the command:
 700
 701          pseg database.fasta -z 1 -q  > database.lc_seg
 702
 703      (seg can also be used with some post processing, see
 704      readme.v33tx.)
 705
 706 -U   Treat the query sequence an RNA sequence.  In addition to
 707      selecting a DNA/RNA alphabet, this option causes changes to
 708      the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
 709      as 'G:G'.
 710
 711 -V str
 712      It is now possible to specify some annotation characters
 713      that can be included (and will be ignored), in the query
 714      sequence file.  Thus, One might have a file with:
 715      "ACVS*ITRLFT?", where "*" and "?"  are used to indicate
 716      phosphorylation.  By giving the option -V '*?', those
 717      characters in the query will be moved to an "annotation
 718      string", and alignments that include the annotated residues
 719      will be highlighted with the appropriate character above the
 720      sequence (on the number line).
 721
 722 -w # Line length (width) = number (<200)
 723
 724 -W #  context length (default is 1/2 of line width -w) for
 725      alignment, like fasta and ssearch, that provide additional
 726      sequence context.
 727
 728 -x # Specify the penalty for a match to an 'X', independently of
 729      the PAM matrix.  Particularly useful for fastx3/fasty3,
 730      where termination codons are encoded as 'X'.
 731
 732 -X   Specifies offsets for the beginning of the query and library
 733      sequence.  For example, if you are comparing upstream
 734      regions for two genes, and the first sequence contains 500
 735      nt of upstream sequence while the second contains 300 nt of
 736      upstream sequence, you might try:
 737
 738          fasta -X "-500 -300" seq1.nt seq2.nt
 739
 740      If the -X option is not used, FASTA assumes numbering starts
 741      with 1.  (You should double check to be certain the negative
 742      numbering works properly.)
 743
 744 -y   Set the width of the band used for calculating "optimized"
 745      scores.  For proteins and ktup=2, the width is 16.  For
 746      proteins with ktup=1, the width is 32 by default.  For DNA
 747      the width is 16.
 748
 749 -z -1,0,1,2,3,4,5
 750      -z -1 turns off statistical calculations. z 0 estimates the
 751      significance of the match from the mean and standard
 752      deviation of the library scores, without correcting for
 753      library sequence length.  -z 1 (the default) uses a weighted
 754      regression of average score vs library sequence length; -z 2
 755      uses maximum likelihood estimates of Lambda and K; -z 3 uses
 756      Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
 757      uses two variations on the -z 1 strategy. -z 1 and -z 2 are
 758      the best methods, in general.
 759
 760 -z 11,12,14,15
 761      estimate the statistical parameters from shuffled copies of
 762      each library sequence.  This doubles the time required for a
 763      search, but allows accurate statistics to be estimated for
 764      libraries comprised of a single protein family.
 765
 766 -Z db_size
 767      set the apparent size of the database to be used when
 768      calculating expectation E() values.  If you searched a
 769      database with 1,000 sequences, but would like to have the
 770      E()-values calculated in the context of a 100,000 sequence
 771      database, use '-Z 100000'.
 772
 773 -1   sort output by init1 score (for compatibility with FASTP -
 774      do not use).
 775
 776 -3   translate only three forward frames
 777
 778 For example:
 779
 780     fasta -w 80 -a seq1.aa seq.aa
 781
 782 would compare the sequence in seq1.aa to that in seq2.aa and
 783 display the results with 80 residues on an output line, showing
 784 all of the residues in both sequences.  Be sure to enter the
 785 options before entering the file names, or just enter the options
 786 on the command line, and the program will prompt for the file
 787 names.
 788
 789      (November, 1997) In addition, it is now possible to provide
 790 the fasta programs with the query sequence (fasta, fasty,
 791 ssearch, tfastx), or two sequences (prss, lalign, plalign) from
 792 the unix "stdin" stream.  This makes it much easier to set up
 793 FASTA or PRSS WWW pages.  To specify that stdin be used, rather
 794 than a file, the file name should be specified as '-' or '@' (the
 795 latter file name makes it possible to specify a subset of the
 796 sequence).  Thus:
 797
 798     cat query.aa | fasta -q @:25-75 s
 799
 800 would take residues 25-75 from query.aa and search the 's'
 801 library (see the discussion of FASTLIBS).
 802
 803 5.2.  Environment variables
 804
 805      Because the current version of the program allows the user
 806 to set virtually every option on the command line (except the
 807 ktup, which must be set as the third command line argument), only
 808 the FASTLIBS environment variable is routinely used.
 809
 810 FASTLIBS
 811      specifies the location of the file which contains the list
 812      of library descriptions, locations, and library types (see
 813      section on finding library files).
 814
 815 6.  Frequently Asked Questions
 816
 817  (1)   Which program should I use? See Table I.
 818
 819  (2)   How do I search with both DNA strands with fasta3 and
 820        fastx3? With version 32 of the FASTA program package, all
 821        searches that use DNA queries (e.g. fasta3, fastx3/y3)
 822        examine both strands. To revert to earlier FASTA behavior
 823        - only looking at the forward or reverse strand - use -3
 824        to search only the forward strand and -i -3 to search only
 825        the reverse strand.
 826
 827  (3)   When I search Genbank - the program reports: 0 residues in
 828        0 sequences.  This typically happens because the program
 829        does not know that you are searching a Genbank flatfile
 830        database and is looking for a FASTA format database.  Be
 831        certain to specify the library type ("1" for Genbank
 832        flatfile) with the database name.
 833
 834  (4)   What is the difference between fastx3 and fasty3 (or
 835        tfastx3 and tfasty3).  [t]fastx3 uses a simpler codon
 836        based model for alignments that does not allow frameshifts
 837        in some codon positions (see ref. (Zhang et al., 1997)).
 838        tfastx3 is about 30% faster, but tfasty3 can produce
 839        higher quality alignments in some cases.
 840
 841  (5)   When I run fasta3 -q, I don't see any (or very little)
 842        output, but I get lots of scores when I run interactively.
 843        With the -Q option, the number of high scores displayed is
 844        limited by the -E # cutoff, which is 10.0 for protein
 845        comparisons, 2.0 for DNA comparisons, and 5.0 for
 846        translated DNA:protein comparisons.  In interactive mode
 847        (without -Q), by default you see 20 high scores,
 848        regardless of E() value.
 849
 850  (6)   What is ktup - All of the programs with fast in their name
 851        use a computer science method called a lookup table to
 852        speed the search.  For proteins with ktup=2, this means
 853        that the program does not look at any sequence alignment
 854        that does not involve matching two identical residues in
 855        both sequences.  Likewise with DNA and ktup = 6, the
 856        initial alignment of the sequences looks for 6 identical
 857        adjacent nucleotides in both sequences.  Because it is
 858        less likely that two identical amino-acids will line up by
 859        chance in two unrelated proteins, this speeds up the
 860        comparison.  But very distantly related sequences may
 861        never have two identical residues in a row but will have
 862        single aligned identities.  In this case, ktup = 1 may
 863        find alignments that ktup=2 misses.
 864
 865  (7)   Sometimes, in the list of best scores, the same sequence
 866        is shown twice with exactly the same score.  Sometimes,
 867        the sequence is there twice, but the scores are slightly
 868        different. When any of the fasta3 programs searches a long
 869        sequence, it breaks the sequence up into overlapping
 870        pieces.  The length of the piece depends on the length of
 871        the query and the particular program being used (it can
 872        also be controlled with the -N #### option).  Since the
 873        pieces overlap by the length of the query sequence (or
 874        3*query_length for fastx/y3 and tfasta/x/y3), if the
 875        highest scoring alignment is at the end of one piece, it
 876        will be scored again at the beginning of the next piece.
 877        If the alignment is not be completely included in the
 878        overlap region, one of the pieces will give a higher score
 879        than the other.  These duplications can be detected by
 880        looking at the coordinates of the alignment.  If either
 881        the beginning or end coordinate is identical in two
 882        alignments, the alignments are at least partially
 883        duplicates.
 884
 885 As always, please inform me of bugs as soon as possible.
 886
 887 William R. Pearson
 888 Department of Biochemistry
 889 Jordan Hall Box 800733
 890 U. of Virginia
 891 Charlottesville, VA
 892
 893 wrp@virginia.EDU
 894
 895 7.  References
 896
 897 Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
 898 (1994). Issues in searching molecular sequence databases. Nature
 899 Genet. 6,119-129.
 900
 901 Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
 902 Methods Enzymol. 266,460-480.
 903
 904 Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
 905 sequence data bank and its new supplement TrEMBL. Nucleic Acids.
 906 Res. 24,21-25.
 907
 908 Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
 909 Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
 910 Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
 911 (1998). The PIR-International Protein Sequence Database. Nucleic
 912 Acids Res 26,27-32.
 913
 914 Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
 915 of evolutionary change in proteins. In Atlas of Protein Sequence
 916 and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
 917 Spring, MD: National Biomedical Research Foundation), pp.
 918 345-352.
 919
 920 Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
 921 rapid generation of mutation data matrices from protein
 922 sequences. Comp. Appl. Biosci. 8,275-282.
 923
 924 Pearson, W. R. (2000). Flexible similarity searching with the
 925 FASTA3 program package. In Bioinformatics Methods and Protocols,
 926 S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
 927 185-219.
 928
 929 Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
 930 biological sequence comparison. Proc. Natl. Acad. Sci. USA
 931 85,2444-2448.
 932
 933 Pearson, W. R. (1995). Comparison of methods for searching
 934 protein sequence databases. Prot. Sci. 4,1145-1160.
 935
 936 Pearson, W. R. (1996). Effective protein sequence comparison.
 937 Methods Enzymol. 266,227-258.
 938
 939 Pearson, W. R. (1998). Empirical statistical estimates for
 940 sequence similarity searches. J. Mol. Biol. 276,71-84.
 941
 942 Smith, T. F. and Waterman, M. S. (1981). Identification of common
 943 molecular subsequences. J. Mol. Biol. 147,195-197.
 944
 945 Wootton, J. C. and Federhen, S. (1993). Statistics of local
 946 complexity in amino acid sequences and sequence databases.
 947 Comput. Chem. 17,149-163.
 948
 949 Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
 950 sequence with a protein sequence. J. Computational Biology
 951 4,339-349.
 952