binaries/src/fasta34/fasta3x.me

   1 .nr pp 11
   2 .nr sp 11
   3 .nr tp 11
   4 .nr fp 10
   5 .nr fi 0n
   6 .sz 11
   7 .if t \{
   8 .po 1i
   9 .he 'FASTA3.DOC''Release 3.4, Fall, 2003'
  10 .fo ''- % -''
  11 \}
  12 .if n \{
  13 .po 0
  14 .na
  15 .nh
  16 \}
  17 .ll 6.5i
  18 .ce
  19 \fBCOPYRIGHT NOTICE\fP
  20 .lp
  21 Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William
  22 R. Pearson and the University of Virginia.  All rights reserved. The
  23 FASTA program and documentation may not be sold or incorporated into a
  24 commercial product, in whole or in part, without written consent of
  25 William R. Pearson and the University of Virginia.  For further
  26 information regarding permission for use or reproduction, please
  27 contact: David Hudson, Assistant Provost for Research, University of
  28 Virginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (434)
  29 924-6853
  30 .uh "\s+2The FASTA program package\s0"
  31 .uh "Introduction"
  32 .pp
  33 This documentation describes the version 3 of the FASTA program
  34 package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools
  35 for Biological Sequence Analysis", PNAS 85:2444-2448 [.wrp881.]; W. R.
  36 Pearson (1996) "Effective protein sequence comparison"
  37 Meth. Enzymol. 266:227-258;[.wrp960.] Pearson et. al. (1997) Genomics
  38 46:24-36;[.wrp971.]  Pearson, (1999) Meth. in Molecular Biology
  39 132:185-219.[.wrp000.]  Version 3 of the FASTA packages contains many
  40 programs for searching DNA and protein databases and one program
  41 (prss3) for evaluating statistical significance from randomly shuffled
  42 sequences.  Several additional analysis programs, including programs
  43 that produce local alignments, are available as part of version 2 of
  44 the FASTA package, which is still available.
  45 .pp
  46 This document is divided into three sections: (1) A summary overview of
  47 the programs in the FASTA3 package; (2) A guide to installing the
  48 programs and databases; (3) A guide to using the FASTA programs. The
  49 revision history of the programs can be found in the
  50 \fCreadme.v30..v34\fP, files. The programs are easy to use, so if
  51 you are using them on a machine that is administered by someone else,
  52 you can skip section (2) and focus on (1) and (3) to learn how to use
  53 the programsIf you are installing the programs on your own
  54 machine, you will need to read section (2) carefully.
  55 .sh 1 "An overview of the \f(CBFASTA\fP programs"
  56 .pp
  57 Although there are a large number of programs in this package, they
  58 belong to three groups: (1)
  59 "Conventional" Library search programs:
  60 FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3;
  61 (2)
  62 Programs for searching with short fragments:
  63 FASTS3, FASTF3, TFASTS3, TFASTF3;
  64 (3)
  65 Statistical significance: PRSS3.
  66 Programs that start with \f(CBfast\fP search protein
  67 databases, while \f(CBtfast\fP programs search translated DNA databases.
  68 Table I gives a brief description of the programs.
  69 .lp
  70 .(z
  71 .TS
  72 center;
  73 c s
  74 c s
  75 = =
  76 l lw(5.5i).
  77 \d\fBTable I. Comparison programs in the FASTA3 package\fP\u
  78
  79 \fCfasta3\fP    T{
  80 Compare a protein sequence to a protein sequence
  81 database or a DNA sequence to a DNA sequence database using the FASTA
  82 algorithm.[.wrp881,wrp960.]  Search speed and selectivity are
  83 controlled with the \fIktup\fP(wordsize) parameter.  For protein
  84 comparisons, \fIktup\fP = 2 by default; \fIktup\fP =1 is more sensitive
  85 but slower.  For DNA comparisons, \fIktup\fP=6 by default; \fIktup\fP=3 or
  86 \fIktup\fP=4 provides higher sensitivity; \fIktup\fP=1 should be used for
  87 oligonucleotides (DNA query lengths < 20).
  88 T}
  89
  90 \fCssearch3\fP  T{
  91 Compare a protein sequence to a protein sequence
  92 database or a DNA sequence to a DNA sequence database using the
  93 Smith-Waterman algorithm.[.wat815.]  \fCssearch3\fP is about 10-times
  94 slower than FASTA3, but is more sensitive for full-length protein
  95 sequence comparison.
  96 T}
  97
  98 \fCfastx3\fP/ \fCfasty3\fP      T{
  99 Compare a DNA sequence to a protein
 100 sequence database, by comparing the translated DNA sequence in three
 101 frames and allowing gaps and frameshifts.  \fCfastx3\fP uses a
 102 simpler, faster algorithm for alignments that allows frameshifts only
 103 between codons; \fCfasty3\fP is slower but produces better alignments
 104 with poor quality sequences because frameshifts are allowed within
 105 codons.
 106 T}
 107
 108 \fCtfastx3\fP/ \fCtfasty3\fP    T{
 109 Compare a protein sequence to a DNA sequence
 110 database, calculating similarities with frameshifts to the forward and
 111 reverse orientations.
 112 T}
 113
 114 \fCtfasta3\fP   T{
 115 Compare a protein sequence to a DNA sequence database, calculating
 116 similarities (without frameshifts) to the 3 forward and three reverse
 117 reading frames.  \fCtfastx3\fP and \fCtfasty3\fP are preferred because
 118 they calculate similarity over frameshifts.
 119 T}
 120
 121 \fCfastf3/tfastf3\fP    T{
 122 Compares an ordered peptide mixture, as would be obtained by
 123 Edman degredation of a CNBr cleavage of a protein, against a protein
 124 (\fCfastf\fP) or DNA (\fCtfastf\fP) database.
 125 T}
 126
 127 \fCfasts3/tfasts3\fP    T{
 128 Compares set of short peptide fragments, as would be obtained
 129 from mass-spec. analysis of a protein, against a
 130 protein (\fCfasts\fP) or DNA (\fCtfasts\fP) database.
 131 T}
 132 =       =
 133 .TE
 134 .)z
 135 .sh 1 "Installing FASTA and the sequence databases"
 136 .sh 2 "Obtaining the libraries"
 137 .pp
 138 The FASTA program package does not include any protein or DNA sequence
 139 libraries.  Protein databases are available on CD-ROM from the PIR and
 140 EMBL (see below), or via anonymouse FTP from many different sources.
 141 As this document is updated in the fall of 1999, no DNA databases are
 142 available on CD-ROM from the major sequence databases: Genbank at the
 143 National for Biotechnology Information (\fCwww.ncbi.nlm.nih.gov\fP and
 144 \fCftp://ncbi.nlm.nih.gov\fP) and EMBL at the European Bioinformatics
 145 Institute (\fCwww.ebi.ac.uk\fP). However, the databases are available
 146 via anonymous FTP from both sites.
 147 .sh 3 "The GENBANK DNA sequence library"
 148 .pp
 149 Because of the large size of DNA databases, you will probably want to
 150 keep DNA databases in only one, or possibly two, formats.  The FASTA3
 151 programs that search DNA databases - \fCfasta3\fP, \fCtfastx/y3\fP,
 152 and \fCtfasta3\fP can read DNA databases in Genbank flatfile (not
 153 ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (\fCpressdb\fP), and
 154 BLAST2.0 (\fCformatdb\fP) formats, as well as EMBL format.  If you are
 155 also running the GCG suite of sequence analysis programs, you should
 156 use GCG/compressed-binary format or BLAST2.0 format for your
 157 \fCfasta3\fP searches.  If not, BLAST2.0 is a good choice.  These
 158 files are considerably more compact than Genbank flat files, and are
 159 preferred.  The NCBI does not provide software for converting from
 160 Genbank flat files to Blast2.0 DNA databases, but you can use the
 161 Blast \fCformatdb\fP program to convert ASN.1 formated Genbank files,
 162 which are available from the NCBI \fCftp\fP site.
 163 .pp
 164 The NCBI also provides the \fCnr\fP, \fCswissprot\fP, and several EST
 165 databases that are used by BLAST in FASTA format from:
 166 \fCftp://ncbi.nlm.nih.gov/blast/db\fP.  These databases are updated
 167 nightly.
 168 .sh 3 "The NBRF protein sequence library"
 169 .pp
 170 You can obtain the PIR protein sequence database
 171 [.pir980.] from:
 172 .(l
 173 National  Biomedical Research Foundation
 174 Georgetown  University  Medical  Center
 175 3900 Reservoir Rd, N.W.
 176 Washington, D.C. 20007
 177 .)l
 178 or via ftp from \fCnbrf.georgetown.edu\fP or from the NCBI
 179 (\fCncbi.nlm.nih.gov/repository/PIR\fP). The data in the \fCascii\fP
 180 directory is in PIR Codata format, which is not widely used.  I
 181 recommend the PIR/VMS format data (libtype=5) in the \fCvms\fP
 182 directory.
 183 .sh 3 "The EBI/EMBL CD-ROM libraries"
 184 .pp
 185 The European Bioinformatics Institute (EBI) distributes both the EMBL
 186 DNA database and the SwissProt database on CD-ROM,[.apw961.] and they
 187 are available from:
 188 .(l
 189 EMBL-Outstation  European Bioinformatics Institute
 190 Wellcome Trust Genome Campus,
 191 Hinxton Hall
 192 Hinxton,
 193 Cambridge CB10 1SD
 194 United Kingdom
 195 Tel: +44 (0)1223 494444
 196 Fax: +44 (0)1223 494468
 197 Email: DATALIB@ebi.ac.uk
 198 .)l
 199 In addition, the SWISS-PROT protein sequence database is available via
 200 anonymous FTP from \fCftp://ftp.expasy.ch/databases/swiss-prot/\fP
 201 (also see \fCwww.expasy.ch\fP).
 202 .sh 2 "Finding the libraries: FASTLIBS"
 203 .pp
 204 The major problem that most new users of the FASTA package have is in
 205 setting up the program to find the databases and their library type.
 206 In general, if you cannot get \fCfasta3\fP to read a sequence
 207 database, it is likely that something is wrong with the \fCFASTLIBS\fP
 208 file.  A common problem is that the database file is found, but either
 209 no sequences are read, or an incorrect number of entries is read.
 210 This is almost always because the library format (\fClibtype\fP) is
 211 incorrect.  Note that a type 5 file (PIR/VMS format) can be read
 212 as a type 0 (default FASTA) format file, and the number of entries
 213 will be correct, but the sequence lengths will not.
 214 .pp
 215 All the search programs in the FASTA3 package use the environment
 216 variable \fCFASTLIBS\fP to find the protein and DNA sequence libraries.  The
 217 \fCFASTLIBS\fP variable contains the name of a file that has the actual
 218 filenames of the libraries.  The \fCfastlibs\fP file included with the
 219 distribution on is an example of a file that can be referred to by
 220 FASTLIBS. To use the \fCfastlibs\fP file, type:
 221 .(l
 222 \fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh)
 223 or
 224 \fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh)
 225 .)l
 226 Then edit the \fCfastlibs\fP file to indicate where the protein and DNA
 227 sequence libraries can be found.  If you have a hard disk and your
 228 protein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP and
 229 your Genbank DNA sequence library is kept in the directory:
 230 \fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain:
 231 .ne 8
 232 .(l
 233 .ft C
 234 NBRF Protein$0P/usr/lib/seq/aabank.lib 0
 235 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
 236 GB Primate$1P@/usr/lib/genbank/gpri.nam
 237 GB Rodent$1R@/usr/lib/genbank/grod.nam
 238 GB Mammal$1M@/usr/lib/genbank/gmammal.nam
 239 ^   1    ^^^^       4                   ^     ^
 240           23                             (5)
 241 .ft R
 242 .)l
 243 The first line of this file says that there is a copy of the NBRF
 244 protein sequence database (which is a protein database) that can be
 245 selected by typing "P" on the command line or when the database menu
 246 is presented in the file \fC/usr/lib/seq/aabank.lib\fP.
 247 .pp
 248 Note that there are 4 or 5 fields in the lines in \fCfastgbs\fP.  The first
 249 field is the description of the library which will be displayed by
 250 FASTA; it ends with a '$'.  The second field (1 character), is a 0 if
 251 the library is a protein library and 1 if it is a DNA library.  The
 252 third field (1 character) is the character to be typed to select the
 253 library.
 254 .pp
 255 The fourth field is the name of the library file.  In the example
 256 above, the \fC/usr/lib/seq/aabank.lib\fP file contains the entire
 257 protein sequence library.  However the DNA library file names are
 258 preceded by a '@', because these files (\fCgpri.nam, grod.nam,
 259 gmammal.nam\fP) do not contain the sequences; instead they contain the names
 260 of the files which contain the sequences.  This is done because the
 261 GENBANK DNA database is broken down in to a large number of smaller
 262 files.  In order to search the entire primate database, you must
 263 search more than a dozen files.
 264 .pp
 265 In addition, an optional fifth field can be used to specify the format
 266 of the library file.  Alternatively, you can specify the library
 267 format in a file of file names (a file preceded by an '@').  This
 268 field must be separated from the file name by a space character ('\ ')
 269 from the filename.  In the example above, the \fCaabank.lib\fP file is
 270 in Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format
 271 (from the EMBL CD-ROM). Currently, FASTA can read the following formats:
 272 .(l I
 273 .ft C
 274 0 Pearson/FASTA (>SEQID - comment/sequence)
 275 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
 276 2 NBRF CODATA (ENTRY/SEQUENCE)
 277 3 EMBL/SWISS-PROT (ID/DE/SQ)
 278 4 Intelligenetics (;comment/SEQID/sequence)
 279 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
 280 6 GCG (version 8.0) Unix Protein and DNA (compressed)
 281 11 NCBI Blast1.3.2 format  (unix only)
 282 12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)
 283 .ft R
 284 .)l
 285 In particular, this version will work with the EMBL and PIR VMS
 286 formats that are distributed on the EMBL CD-ROM. The latter format
 287 (PIR VMS) is much faster to search than EMBL format.  This release
 288 also works with the protein and DNA database formats created for the
 289 BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI
 290 search format.  If a library format is not specified, for example,
 291 because you are just comparing two sequences, Pearson/FASTA (format 0)
 292 is used by default. To specify a library type on the command line,
 293 add it to the library filename and surround the filename and library
 294 type in quotes:
 295 .(l
 296 .ft C
 297 fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
 298 .ft P
 299 .)l
 300 .pp
 301 You can specify a group of library files by putting a '@' symbol
 302 before a file that contains a list of file names to be searched.  For
 303 example, if @gmam.nam is in the fastgbs file, the file "gmam.nam"
 304 might contain the lines:
 305 .(l
 306 .ft C
 307 </seqdb/genbank
 308 gbpri1.seq 1
 309 gbpri2.seq 1
 310 gbpri3.seq 1
 311 gbpri4.seq 1
 312 gbrod.seq 1
 313 gbmam.seq 1
 314 .ft R
 315 .)l
 316 In this case, the line beginning with a '<' indicates the directory
 317 the files will be found in.  The remaining lines name the actual
 318 sequence files.  So the first sequence file to be searched would be:
 319 .(l
 320 .ft C
 321 /usr/lib/genbank/gbpri.seq
 322 .ft R
 323 .)l
 324 The notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operating
 325 system. Under UNIX, the trailing '/' is left off, so the library
 326 directory might be written as "\fC</usr/seqlib\fP".
 327 .pp
 328 The FASTA programs can search a database composed of different files
 329 in different sequence formats.  For example, you may wish to search
 330 the Genbank files (in GenBank flat file format) and the EMBL DNA
 331 sequence database on CD-ROM.  To do this, you simply list the names
 332 and filetypes of the files to be searched in a file of filenames.  For
 333 example, to search the mammalian portion of Genbank, the unannotated
 334 portion of Genbank, and the unannotated portion of the EMBL library,
 335 you could use the file:
 336 .(l I
 337 .ft C
 338 </usr/lib/DNA
 339 gbpri.seq 1
 340 \&#  (this '#' causes the program to display the size of the library)
 341 gbrod.seq 1
 342 \&...
 343 gbmam.seq 1
 344 \&...
 345 gbuna.seq 1
 346 \&...
 347 unanno.seq 5
 348 \&#
 349 .ft R
 350 .)l
 351 .(l I F
 352 You do not need to include library format numbers if you only use the
 353 Pearson/FASTA version of the PIR protein sequence library.  If no
 354 library type is specified, the program assumes that type 0 is being
 355 used.
 356 .)l
 357 .pp
 358 Test the setup by running FASTA.  Enter the sequence
 359 file '\fCmgstm1.aa\fP' when the program requests it (this file is
 360 included with the programs).  The program should then ask you to
 361 select a protein sequence library.  Alternatively, if you run the
 362 TFASTA program and use the mgstm1.aa query sequence, the program
 363 should show you a selection of DNA sequence libraries.
 364 Once the fastgbs file has been set up correctly, you can
 365 set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to
 366 remember where the libraries are kept or how they are named.
 367 .ne 8
 368 .sh 1 "Using the FASTA Package"
 369 .sh 2 "Overview"
 370 .pp
 371 The FASTA sequence comparison programs all require similar
 372 information, the name of a query sequence file, a library file, and
 373 the \fIktup\fP parameter.  All of the programs can accept arguments
 374 on the command line, or they will prompt for the file names and
 375 \fIktup\fP value.
 376 .lp
 377 To use FASTA, simply type:
 378 .(l
 379 .ft C
 380 \f(CBFASTA\fP
 381 and you will be prompted for :
 382 .in +0.5i
 383 the name of the test sequence file
 384 the name of the library file
 385 and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
 386 (ktup of 2 is about 5 times faster than ktup = 1)
 387 .ft R
 388 .)l
 389 The program can also be run by typing
 390 .(l
 391 .ft C
 392 FASTA test.aa /lib/bigfile.lib \fIktup\fP (1 or 2)
 393 .ft R
 394 .)l
 395 .lp
 396 Included with the package are several test files.
 397 To check to make certain that everything is working, you can try:
 398 .(l
 399 .ft C
 400 fasta musplfm.aa prot_test.lib
 401 and
 402 tfastx mgstm1.aa gst.nlib
 403 .ft R
 404 .)l
 405 .sh 2 "Sequence files"
 406 .pp
 407 The \fCfasta3\fP programs know about three kinds of sequence files:
 408 (1) plain sequence files - files that contain nothing but
 409 sequence residues - can only be used as query sequences. (2) FASTA
 410 format files.  These are the same as plain sequence files, each
 411 sequence is preceded by a comment line with a '>' in the first
 412 column. (3) distributed sequence libraries (this is a broad class that
 413 includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file
 414 format, EMBL flat-file format, and Intelligenetics format.  All of the
 415 files that you create should be of type (1) or (2).  FASTA format
 416 files (ones with a '>' and comment before the sequence) are preferred,
 417 because they can be used as query or library sequence files by all of
 418 the programs.
 419 .pp
 420 I have included several sample test files, \fC*.aa\fP and \fC*.seq\fP
 421 as well as two small sequence libraries, \fCprot_test.lib\fP and
 422 \fCgst.nlib\fP.  The first line may begin with a '>' by a comment.
 423 Spaces and tabs (and anything else that is not an amino-acid code) are
 424 ignored.
 425 .pp
 426 Library files should have the form:
 427 .(l
 428 .ft C
 429 >Sequence name and identifier
 430 A F A S Y T .... actual sequence.
 431 F S S       .... second line of sequence.
 432 >Next sequence name and identifier
 433 .ft R
 434 .)l
 435 This is often referred to as "FASTA" or format.  You can
 436 build your own library by concatenating several sequence files.  Just
 437 be sure that each sequence is preceded by a line beginning with a '>'
 438 with a sequence name.
 439 .pp
 440 The test file should not have lines longer than 120 characters, and
 441 sequences entered with word processors should use a document
 442 mode, with normal carriage returns at the end of lines.
 443 .pp
 444 A different format is required to specify the ordered peptide mixture for \fCfastf3/tfastf3\fP. For example:
 445 .(l I
 446 .ft C
 447 >mgstm1
 448 MGCEN,
 449 MIDYP,
 450 MLLAY,
 451 MLLGY
 452 .ft P
 453 .)l
 454 indicates \fCm\fP in the first position of all three peptides (as
 455 from CNBr), \fCG, I, L\fP (twice) in the second position (first cycle),
 456 \fCC,D,L\fP (twice) in the third position, etc.  The commas (\fC,\fP)
 457 are required to indicate the number of fragments in the mixture, but
 458 there should be no comma after the last residue.
 459 .pp
 460 For the \fCfasts3/tfasts3\fP program, the format is the same, except that there
 461 is no requirement for the peptides to be the same length.
 462 .sh 1 "Statistical Significance"
 463 .pp
 464 All the programs in the FASTA3 package attempt to calculate accurate
 465 estimates of the statistical significance of a match. For
 466 \fCfasta3\fP, \fCssearch3\fP, and \fCfastx3/y3\fP, these estimates are
 467 very accurate.[.wrp971,wrp981.].  Altschul et al. [.alt940.] provides
 468 an excellent review of the statistics of local similarity scores.
 469 Local sequence similarity scores follow the extreme value
 470 distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u =
 471 ln(Kmn)/lambda and m,m are the lengths of the query and library
 472 sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda
 473 x), which shows that the average score for an unrelated library
 474 sequence increases with the logarithm of the length of the library
 475 sequence.  The \fCfasta3\fP programs use simple linear regression
 476 against the the log of the library sequence length to calculate a
 477 normalized "z-score" with mean 50, regardless of library sequence
 478 length, and variance 10. (Several other estimation methods are
 479 available with the \fC\-z\fP option.) These z-scores can then be used
 480 with the extreme value distribution and the poisson distribution (to
 481 account for the fact that each library sequence comparison is an
 482 independent test) to calculate the number of library sequences to
 483 obtain a score greater than or equal to the score obtained in the
 484 search. The original idea and routines to do the linear regression on
 485 library sequence length were provided Phil Green, U. Washington.  This
 486 version uses a slightly different strategy for fitting the data than
 487 those originally provided by Dr. Green.
 488 .pp
 489 The expected number of sequences is plotted in the histogram using an
 490 "*". Since the parameters for the extreme value distribution are not
 491 calculated directly from the distribution of similarity scores, the
 492 pattern of "*'s" in the histogram gives a qualitative view of how well
 493 the statistical theory fits the similarity scores calculated by the
 494 programs.  For \fCfasta3\fP, if optimized scores are calculated for
 495 each sequence in the database (the default), the agreement between the
 496 actual distribution of "z-scores" and the expected distribution based
 497 on the length dependence of the score and the extreme value
 498 distribution is usually very good.  Likewise, the distribution of
 499 \fCssearch3\fP Smith-Waterman scores typically agrees closely with the
 500 <actual distribution of "z-scores."  The agreement with unoptimized
 501 scores, \fIktup=2\fP, is often not very good, with too many high
 502 scoring sequences and too few low scoring sequences compared with the
 503 predicted relationship between sequence length and similarity score.
 504 In those cases, the expectation values may be overestimates.
 505 .pp
 506 With version 33t01, all the FASTA programs also report a "bit" score,
 507 which is equivalent to the bit score reported by BLAST2.  The
 508 FASTA33/BLAST2 bit score is calculated as: (lambda*S - ln K)/ln 2,
 509 where S is the raw similarity score, lambda and K are statistical
 510 parameters estimated from the distribution of unrelated sequence
 511 similarity scores.  The statistical signficance of a given bit score
 512 depends on the lengths of the query and library sequences and the size
 513 of the library, but a 1 bit increase in score corresponds to a 2-fold
 514 reduction in expectation; a 10-bit increase implies 1000-fold lower
 515 expectation, etc.
 516 .pp
 517 The statistical routines assume that the library contains a large
 518 sample of unrelated sequences.  If this is not true, then statistical
 519 parameters can be estimated by using the \fC\-z 11\-15\fP, options.
 520 \fC\-z\fP options greater than 10 calculate a shuffled similarity score
 521 for each library sequence, in addition to the unshuffled score, and
 522 estimate the statistical parameters from the scores of the shuffled
 523 sequences.  If there are fewer than 20 sequences in the library, the
 524 statistical calculations are not done.
 525 .pp
 526 For protein searches, library sequences with E() values < 0.01 for
 527 searches of a 10,000 entry protein database are almost always
 528 homologous. Frequently sequences with E()-values from 1 - 10 are
 529 related as well, but unrelated sequences ( 1 \- 10 per search) will
 530 have scores in this renage as well. Remember, however, that these E()
 531 values also reflect differences between the amino acid composition of
 532 the query sequence and that of the "average" library sequence.  Thus,
 533 when searches are done with query sequences with "biased" amino-acid
 534 composition, unrelated sequences may have "significant" scores because
 535 of sequence bias.  \fCPRSS3\fP can address this problem by calculating
 536 similarity scores for random sequences with the same length and amino
 537 acid composition.
 538 .sh 1 "Options"
 539 .pp
 540 Command line options are available to change the scoring parameters
 541 and output display. \fBCommand line options must preceed other program
 542 arguments, such as the query and library file names.\fP
 543 .sh 2 "Command line options"
 544 .ip "-a"
 545 (fasta3, ssearch3 only) show both sequences in their entirety.
 546 .ip "-A"
 547 force Smith-Waterman alignments for fasta3 DNA sequences.  By default,
 548 only fasta3 protein sequence comparisons use Smith-Waterman alignments.
 549 .ip "-B"
 550 Show normalized score as a z-score, rather than a bit-score in the list
 551 of best scores.
 552 .ip "-b #"
 553 Number of sequence scores to be shown on output.  In the absence of
 554 this option, fasta (and tfasta and ssearch) display all library
 555 sequences obtaining similarity scores with expectations less than
 556 10.0 if optimized score are used, or 2.0 if they are not. The -b
 557 option can limit the display further, but it will not cause additional
 558 sequences to be displayed.
 559 .ip "-c #"
 560 Threshold score for optimization (OPTCUT).  Set "-c 1" to
 561 optimize every sequence in a database.
 562 .ip "-E #"
 563 Limit the number of scores and alignments shown based on the
 564 expected number of scores.  Used to override the expectation value of 10.0
 565 used by default.  When used with -Q, -E 2.0 will show all library sequences
 566 with scores with an expectation value <= 2.0.
 567 .ip "-d #"
 568 Maximum number of alignments to be displayed.  Ignored if "-Q" is not
 569 used.
 570 .ip "-f"
 571 Penalty for the first residue in a gap (-12 by default for proteins,
 572 -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
 573 .ip "-F #"
 574 Limit the number of scores and alignments shown based on the expected
 575 number of scores. "-E #" sets the highest E()-value shown; "-F #" sets
 576 the lowest E()-value. Thus, "-F 0.0001" will not show any matches or
 577 alignments with E() < 0.0001.  This allows one to skip over close
 578 relationships in searches for more distant relationships.
 579 .ip "-g"
 580 Penalty for additional residues in a gap (-2 by default for proteins,
 581 -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
 582 .ip "-h"
 583 Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
 584 .ip "-H"
 585 Omit histogram.
 586 .ip "-i"
 587 Invert (reverse complement) the query sequence if it is DNA.  For
 588 tfasta3/x3/y3, search the reverse complement of the library sequence
 589 only.
 590 .ip "-j #"
 591 Penalty for frameshift within a codon (fasty3/tfasty3 only).
 592 .ip "-l file"
 593 Location of library menu file (FASTLIBS).
 594 .ip "-L"
 595 Display more information about the library sequence in the alignment.
 596 .ip "-M low-high"
 597 Range of amino acid sequence lengths to be included in the search.
 598 .ip "-m #"
 599 Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
 600 .(l I
 601 .ft C
 602     \-m 0        \-m 1          \-m 2          \-m 3        \-m 4
 603 .ft C
 604 MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
 605 ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
 606 MWKSCGYPYT   MWKSCGYPYT
 607 .ft P
 608 .)l
 609 .ip
 610 \fC\-m 5\fP provides a combination of \fC\-m 4\fP and
 611 \fC\-m 0. \fC\-m 6 provides \fC\-m 5\fP plus HTML formatting.
 612 .ip "-m 9"
 613 provides coordinates and scores with the best score information.
 614 A simple "\fC -m 9\fP extends the normal best score information:
 615 .(l
 616 .ft C
 617 The best scores are:                                      opt bits E(14548)
 618 XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79
 619 .ft P
 620 .)l
 621 to include the additional information (on the same line, separated by
 622 a <tab>):
 623 .(l
 624 .ft C
 625 %_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
 626 0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0
 627 .ft P
 628 .)l
 629 \fC -m 9c\fP provides additional information: an encoded alignment string.  Thus:
 630 .(l I
 631 .ft C
 632        10        20        30        40        50          60         70
 633 GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
 634        :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
 635 XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
 636                20        30                 40        50        60
 637 .ft P
 638 .)l
 639 would be encoded:
 640 .(l
 641 .ft C
 642 =23+9=13-2=10-1=3+1=5
 643 .ft P
 644 .)l
 645 The alignment encoding is with repect to the alignment, not the
 646 sequences.  The coordinate of the alignment is given earlier in the
 647 "\fC -m 9c\fP" line.
 648 .ip "-m 10"
 649 \fC\-m 10\fP is a new, parseable format for use
 650 with other programs.  See the file "readme.v20u4" for a more complete
 651 description.
 652 .ip
 653 As of version "fa34t23b2", it has become possible to combine independent
 654 "\fC\-m\fP" options.  Thus, one can use "\fC\-m 1 -m 6 -m 9\fP".
 655 .ip "-M low\-high"
 656 Include library sequences (proteins only) with lengths between low and
 657 high.
 658 .ip "-n"
 659 Force the query sequence to be treated as a DNA sequence.  This is
 660 particularly useful for query sequences that contain a large number of
 661 ambiguous residues, e.g. transcription factor binding sites.
 662 .ip "-O"
 663 Send copy of results to "filename."  Helpful for environments without
 664 STDOUT (mostly for the Macintosh).
 665 .ip "-o "
 666 Turn off default optimization of all scores greater than OPTCUT. Sort
 667 results by "initn" scores (reduces the accuracy of statistical
 668 estimates).
 669 .ip "-p"
 670 Force query to be treated as protein sequence.
 671 .ip "-Q,-q"
 672 Quiet - does not prompt for any input.  Writes scores and alignments
 673 to the terminal or standard output file.
 674 .ip "-r"
 675 Specify match/mismatch scores for DNA comparisons.  The default is
 676 "+5/-4". "+3/-2" can perform better in some cases.
 677 .ip "-R file"
 678 Save a results summary line for every sequence in the sequence
 679 library.  The summary line includes the sequence identifier,
 680 superfamily number (if available) position
 681 in the library, and the similarity scores calculated.  This option can
 682 be used to evaluate the sensitivity and selectivity of different
 683 search strategies.[.wrp951,wrp981.]
 684 .ip "-s file"
 685 Specify the scoring matrix file.  \fCfasta3\fP uses the same scoring
 686 matrices as Blast1.4/2.0.  Several scoring matrix files are included
 687 in the standard distribution.  For protein sequences: \fCcodaa.mat\fP
 688 - based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix;
 689 \fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff et
 690 al.;[.day787.]  \fCpam120.mat\fP - a PAM120 matrix.  The default
 691 scoring matrix is BLOSUM50 ("-s BL50"). Other matrices available from
 692 within the program are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-s
 693 P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 \- M40" (MDM are modern
 694 PAM matrices from Jones et al.,[.tay925.]), BLOSUM50, 62, and 80/"-s
 695 BL50", "-s BL62", "-s BL80".
 696 .ip "-S"
 697 Treat lower-case characters in the query or library sequences as
 698 "low-complexity" ("seg"-ed) residues.  Traditionally, the "seg"
 699 program [.woo935.] is used to remove low complexity regions in DNA
 700 sequences by replacing the residues with an "X".  When the "-S" option
 701 is used, the FASTA33 (and later) programs provide a potentially more
 702 informative approach.  With "-S", lower case characters in the query
 703 or database sequences are treated as "X"'s during the initial scan,
 704 but are treated as normal residues during the final alignment display.
 705 Since statistical significance is calculated from the similarity score
 706 calculated during the library search, when the lower case residues are
 707 "X"'s, low complexity regions will not produce statistically
 708 significant matches.  However, if a significant alignment contains low
 709 complexity regions, their alignmen is shown.  With "-S", lower case
 710 characters may be included in the alignment to indicate low complexity
 711 regions, and the final alignment score may be higher than the score
 712 obtained during the search.
 713 .ip
 714 The \fCpseg\fP program can be used to produce databases (or query
 715 sequences) with lower case residues indicating low complexity regions
 716 using the command:
 717 .(l I
 718 \fCpseg database.fasta -z 1 -q  > database.lc_seg\fP
 719 .)l
 720 (\fCseg\fP can also be used with some post processing, see readme.v33tx.)
 721 .ip
 722 The \fC-S\fP option should always be used with \fCFASTX/Y\fCP and
 723 \fCTFASTX/Y\fP because out of frame translations often generate
 724 low-complexity protein sequences.  However, only lower case characters
 725 in the protein sequence (or protein database) are masked; lower case
 726 DNA sequences are translated into upper case protein sequences, and
 727 not treated as low complexity by the translated alignment programs.
 728 .ip "-t #"
 729 Translation table - tfasta3, fastx3, tfastx3, fasty3, and
 730 tfasty3 now support the BLAST tranlation tables.  See
 731 \fChttp://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi\fP.
 732 .ip
 733 In addition, "\-t t" or "\-t t#" turns on the addition of an implicit termination
 734 codon to a protein:translated DNA match.  That is, each protein
 735 sequence implicitly ends with "*", which matches the termination codes
 736 for the appropriate genetic code.  "\-t t#" sets implicit termination
 737 and a different genetic code.
 738 .ip "-U"
 739 Treat the query sequence an RNA sequence.  In addition to selecting a
 740 DNA/RNA alphabet, this option causes changes to the scoring matrix so
 741 that 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'.
 742 .ip "-V str"
 743 It is now possible to specify some annotation characters that can be
 744 included (and will be ignored), in the query sequence file.  Thus, One
 745 might have a file with: \fC"ACVS*ITRLFT?"\fP, where "*" and "?"  are
 746 used to indicate phosphorylation.  By giving the option \fC\-V '*?'\fP,
 747 those characters in the query will be moved to an "annotation string",
 748 and alignments that include the annotated residues will be highlighted
 749 with the appropriate character above the sequence (on the number line).
 750 .ip "-w #"
 751 Line length (width) = number (<200)
 752 .ip "-W #"
 753  context length (default is 1/2 of line width -w) for alignment,
 754 like fasta and ssearch, that provide additional sequence context.
 755 .ip "-x #match,#mismatch"
 756 Specify the penalty for a match to an 'X', and mismatch to 'X',
 757 independently of the PAM matrix.  Particularly useful for
 758 \fCfastx3/fasty3\fP, where termination codons are encoded as 'X'.
 759 .ip "-X \"off1 off2\""
 760 Specifies offsets for the beginning of the query and library sequence.
 761 For example, if you are comparing upstream regions for two genes, and
 762 the first sequence contains 500 nt of upstream sequence while the
 763 second contains 300 nt of upstream sequence, you might try:
 764 .(l I
 765 \fCfasta -X "-500 -300" seq1.nt seq2.nt\fP
 766 .)l
 767 If the -X option is not used, FASTA assumes numbering starts with 1.
 768 (You should double check to be certain the negative numbering works
 769 properly.)
 770 .ip "-y"
 771 Set the width of the band used for calculating "optimized" scores.
 772 For proteins and ktup=2, the width is 16.  For proteins with ktup=1,
 773 the width is 32 by default.  For DNA the width is 16.
 774 .ip "-z -1,0,1,2,3,4,5"
 775 \fC\-z -1\fP turns off statistical calculations. \fCz 0\fP estimates
 776 the significance of the match from the mean and standard deviation of
 777 the library scores, without correcting for library sequence length.
 778 \fC\-z 1\fP (the default) uses a weighted regression of average score
 779 vs library sequence length; \fC\-z 2\fP uses maximum likelihood
 780 estimates of
 781 .if t \(*l
 782 .if n Lambda
 783 and \fIK\fP; \fC\-z 3\fP uses Altschul-Gish
 784 parameters;[.alt960.] \fC\-z 4 \- 5\fP uses two variations on the
 785 \fC\-z 1\fP strategy. \fC\-z 1\fP and \fC\-z 2\fP are the best methods,
 786 in general.
 787 .ip "-z 11,12,14,15"
 788 estimate the statistical parameters from shuffled copies of each
 789 library sequence.  This doubles the time required for a search, but
 790 allows accurate statistics to be estimated for libraries comprised of
 791 a single protein family.
 792 .ip "-Z db_size"
 793 set the apparent size of the database to be used when calculating
 794 expectation E() values.  If you searched a database with 1,000
 795 sequences, but would like to have the E()-values calculated in the
 796 context of a 100,000 sequence database, use '-Z 100000'.
 797 .ip "-1"
 798 sort output by init1 score (for compatibility with FASTP - do not
 799 use).
 800 .ip "-3"
 801 translate only three forward frames
 802 .sp
 803 .lp
 804 For example:
 805 .(l
 806 \fCfasta -w 80 -a seq1.aa seq.aa\fP
 807 .)l
 808 would compare the sequence in seq1.aa to that in seq2.aa and display the
 809 results with 80 residues on an output line, showing all of the residues
 810 in both sequences.  Be sure to enter the options before entering the file
 811 names, or just enter the options on the command line, and the program will
 812 prompt for the file names.
 813 .sp
 814 .pp
 815 (November, 1997) In addition, it is now possible to provide the fasta
 816 programs with the query sequence (fasta, fasty, ssearch, tfastx), or
 817 two sequences (prss, lalign, plalign) from the unix "stdin" stream.  This
 818 makes it much easier to set up FASTA or PRSS WWW pages.  To specify
 819 that stdin be used, rather than a file, the file name should be
 820 specified as '-' or '@' (the latter file name makes it possible to
 821 specify a subset of the sequence).
 822 Thus:
 823 .(l
 824 cat query.aa | fasta -q @:25-75 s
 825 .)l
 826 would take residues 25-75 from query.aa and search the 's' library
 827 (see the discussion of FASTLIBS).
 828 .sh 2 "Environment variables"
 829 .pp
 830 Because the current version of the program allows the user to set
 831 virtually every option on the command line (except the \fIktup\fP,
 832 which must be set as the third command line argument), only the
 833 \fCFASTLIBS\fP environment variable is routinely used.
 834 .ip "FASTLIBS"
 835 specifies the location of the file which contains the list of library
 836 descriptions, locations, and library types (see section on finding
 837 library files).
 838 .sh 1 "Frequently Asked Questions (FAQs)"
 839 .np
 840 \fIWhich program should I use?\fP See Table I.
 841 .np
 842 \fIHow do I search with both DNA strands with\fP \fCfasta3\fP \fIand\fP
 843 \fCfastx3\fP? With version 32 of the FASTA program package, all
 844 searches that use DNA queries (e.g. \fCfasta3\fP, \fCfastx3/y3\fP)
 845 examine both strands. To revert to earlier FASTA behavior - only
 846 looking at the forward or reverse strand - use \fC\-3\fP to search only
 847 the forward strand and \fC\-i -3\fP to search only the reverse strand.
 848 .np
 849 \fIWhen I search Genbank - the program reports:\fP \fC0 residues in 0
 850 sequences\fP.  This typically happens because the program does not
 851 know that you are searching a Genbank flatfile database and is looking
 852 for a FASTA format database.  Be certain to specify the library type
 853 ("1" for Genbank flatfile) with the database name.
 854 .np
 855 What is the difference between \fCfastx3\fP and \fCfasty3\fP (or
 856 \fCtfastx3\fP and \fCtfasty3\fP).  \fC[t]fastx3\fP uses a simpler
 857 codon based model for alignments that does not allow frameshifts in
 858 some codon positions (see ref. [.wrp971.]).  \fCtfastx3\fP is about
 859 30% faster, but \fCtfasty3\fP can produce higher quality alignments in
 860 some cases.
 861 .np
 862 \fIWhen I run\fP \fCfasta3 -q\fP, I don't see any (or very little)
 863 output, but I get lots of scores when I run interactively. With the
 864 \fC\-Q\fP option, the number of high scores displayed is limited by the
 865 \fC\-E #\fP cutoff, which is 10.0 for protein comparisons, 2.0 for DNA
 866 comparisons, and 5.0 for translated DNA:protein comparisons.  In
 867 interactive mode (without \fC\-Q\fP), by default you see 20 high
 868 scores, regardless of \fCE()\fP value.
 869 .np
 870 \fIWhat is ktup\fP \- All of the programs with \fCfast\fP in their
 871 name use a computer science method called a lookup table to speed the
 872 search.  For proteins with \fIktup\fP=2, this means that the program
 873 does not look at any sequence alignment that does not involve matching
 874 two identical residues in both sequences.  Likewise with DNA and
 875 \fIktup\fP = 6, the initial alignment of the sequences looks for 6
 876 identical adjacent nucleotides in both sequences.  Because it is less
 877 likely that two identical amino-acids will line up by chance in two
 878 unrelated proteins, this speeds up the comparison.  But very distantly
 879 related sequences may never have two identical residues in a row but
 880 will have single aligned identities.  In this case, \fIktup\fP = 1 may
 881 find alignments that \fIktup\fP=2 misses.
 882 .np
 883 \fISometimes, in the list of best scores, the same sequence is shown
 884 twice with exactly the same score.  Sometimes, the sequence is there
 885 twice, but the scores are slightly different.\fP When any of the
 886 \fCfasta3\fP programs searches a long sequence, it breaks the sequence
 887 up into \fIoverlapping\fP pieces.  The length of the piece depends on
 888 the length of the query and the particular program being used (it can
 889 also be controlled with the -N #### option).  Since the pieces overlap
 890 by the length of the query sequence (or 3*query_length for fastx/y3
 891 and tfasta/x/y3), if the highest scoring alignment is at the end of
 892 one piece, it will be scored again at the beginning of the next piece.
 893 If the alignment is not be completely included in the overlap region,
 894 one of the pieces will give a higher score than the other.  These
 895 duplications can be detected by looking at the coordinates of the
 896 alignment.  If either the beginning or end coordinate is identical in
 897 two alignments, the alignments are at least partially duplicates.
 898 .lp
 899 As always, please inform me of bugs as soon as possible.
 900 .sp
 901 .nf
 902 William R. Pearson
 903 Department of Biochemistry
 904 Jordan Hall Box 800733
 905 U. of Virginia
 906 Charlottesville, VA
 907
 908 wrp@virginia.EDU
 909
 910 .sh 1 "References"
 911 .[]