binaries/src/fasta34/fasta20.doc

   1
   2                         COPYRIGHT NOTICE
   3
   4 Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
   5 Pearson and the University of Virginia.  All rights reserved. The
   6 FASTA program and documentation may not be sold or incorporated
   7 into a commercial product, in whole or in part, without written
   8 consent of William R. Pearson and the University of Virginia.
   9 For further information regarding permission for use or
  10 reproduction, please contact: David Hudson, Assistant Provost for
  11 Research, University of Virginia, P.O. Box 9025, Charlottesville,
  12 VA 22906-9025, (434) 924-6853
  13
  14
  15 The FASTA program package
  16
  17 Introduction
  18
  19      This documentation describes the version 2.0x of the FASTA
  20 program package (see W. R. Pearson and D. J. Lipman (1988),
  21 "Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
  22 2448, and W. R.  Pearson (1990) "Rapid and Sensitive Sequence
  23 Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
  24 98). Version 2.0 modifies version 1.8 to include explicit
  25 statistical estimates for similarity scores based on the extreme
  26 value distribution.  In addition, FASTA protein alignments now
  27 use the Smith-Waterman algorithm with no limitation on gap size.
  28 FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
  29 options to change gap penalties on the command line. Version 1.7
  30 replaces rdf2 and rss with prdf and prss, which use the extreme-
  31 value distribution to calculate accurate probability estimates.
  32
  33
  34 Although there are a large number of programs in this package,
  35 they belong to four groups:
  36
  37
  38     Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
  39
  40     Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
  41
  42     Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
  43
  44     Global alignment: ALIGN
  45
  46
  47
  48 In addition, I have included several programs for protein
  49 sequence analysis, including a Kyte-Doolittle hydropathicity
  50 plotting program (GREASE, TGREASE), and a secondary structure
  51 prediction package (GARNIER).
  52
  53      The FASTA sequence comparison programs on this disk are
  54 improved versions of the FASTP program, originally described in
  55 Science (Lipman and Pearson, (1985) Science 227:1435-1441).  We
  56 have made several improvements.  First, the library search
  57 programs use a more sensitive method for the initial comparison
  58 of two sequences which allows the scores of several similar
  59 regions to be combined.  As a result, the results of a library
  60 search are now given with three scores, initn (the new initial
  61 score which may include several similar regions), init1 (the old
  62 fastp initial score from the best initial region), and opt (the
  63 old fastp optimized score allowing gaps in a 32 residue wide
  64 band).
  65
  66      These programs have also been modified to become "universal"
  67 (hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
  68 FAST-N (nucleotides)); by changing the environment variable
  69 SMATRIX, the programs can be used to search protein sequences,
  70 DNA sequences, or whatever you like.  By default, FASTA, LFASTA,
  71 and the PRDF programs automatically recognize protein and DNA
  72 sequences.  Sequences are first read as amino acids, and then
  73 converted to nucleotides if the sequence is greater than 85%
  74 A,C,G,T (the '-n' option can be used to indicate DNA sequences).
  75 TFASTA compares protein sequences to a translated DNA sequence.
  76 Alternative scoring matrices can also be used.  In addition to
  77 the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
  78 based on simple identities or the genetic code can also be used
  79 for sequence comparisons or evaluation of significance.  Several
  80 different protein sequence matrices have been included;
  81 instructions for constructing your own scoring matrix are
  82 included in the file FORMAT.DOC.
  83
  84
  85 The remainder of this document is divided into three sections:
  86 (1) a brief history of the changes to the FASTA package; (2) A
  87 guide to installing the programs and databases; (3) A guide to
  88 using the FASTA programs. The programs are very easy to use, so
  89 if you are using them on a machine that is administered by
  90 someone else, you may want to skip to section (3) to learn how to
  91 use the programs, and then read section (1) to look at some of
  92 the more recent changes.  If you are installing the programs on
  93 your own machine, you will need to read section (2) carefully.
  94
  95
  96 1.  Revision History
  97
  98 1.1.  Changes with version 2.0u
  99
 100      Version 2.0u provides several major improvements over
 101 previous versions of FASTA (and SSEARCH).  The most important is
 102 the incorporation of explicit statistical estimates and
 103 appropriate normalization of similarity scores. This improvement
 104 is discussed in more detail below in the section entitled
 105 Statistical Significance.  In addition, all of the protein
 106 comparison programs now use the BLOSUM50 matrix, with gap
 107 penalties of -12, -2, by default.  BLOSUM50 performs
 108 significantly better than the older PAM250 matrix.  PAM250 can
 109 still be used with the command line option: -s 250.  (DNA
 110 sequence comparisons use a more stringent gap penalty of -16, -4,
 111 which produces excellent statistical estimates when optimized
 112 scores are used. TFASTA uses -16, -4 as well.)
 113
 114      The quality of the fit of the extreme value distribution to
 115 the actual distribution of similarity scores is summarized with
 116 the Kolmogorov-Smirnov statistic.  The acceptance limits for this
 117 statistic can be found in many statistics books.  In general,
 118 values <0.10 (N=30) indicate excellent agreement between the
 119 actual and theoretical distributions.  If this statistic is >
 120 0.2, consider using a higher (more stringent) gap penalty, e.g.
 121 -16, -4 rather than -12, -2.  The default scoring matrix for DNA
 122 has been changed to score +5 for an identity and -4 for a
 123 mismatch.  These are the same scores used by BLASTN.
 124
 125      With explicit expectation calculations, the program now
 126 shows all scores and alignments with expectations less than 10.0
 127 (with optimized scores, 2.0 without optimization) when the "-Q"
 128 (quiet) mode is used.  The expectation threshold can be changed
 129 with the "-E" option.
 130
 131      Finally, the algorithm used to produce the final alignments
 132 of protein sequences is now a full Smith-Waterman, with unlimited
 133 gaps.  (The older band-limited alignments are used for DNA
 134 sequences and TFASTA by default, because Smith-Waterman
 135 alignments are very slow for long sequences.)  Both the optimized
 136 and Smith-Waterman scores are reported; if the Smith-Waterman
 137 score is higher, then additional gaps allowed a better alignment
 138 and similarity score to be calculated.
 139
 140      FASTA searches now optimize similarity scores by default
 141 (this slows searches about 2-fold (worst case) for ktup=2). Thus,
 142 the meaning of the "-o" option has been reversed; "-o" now turns
 143 off optimization and reports results sorted by "initn" scores.
 144 Optimization significantly improves the sensitivity of FASTA, so
 145 that it almost matches Smith-Waterman.  With version 2.0, the
 146 default band width used for optimized calculations can be varied
 147 with the "-y" option.  For proteins with ktup=2, a width of 16
 148 (-y 16) is used; 16 is also used for DNA sequences.  For proteins
 149 and ktup=1, a width of 32 is used. Searches that disable
 150 optimization with the "-o" option will work fine for sequences
 151 that share 25% or more identity in general, but to detect
 152 evolutionary relationships with 20% - 25% identity, the more
 153 sensitive default optimization is often required.  Optimization
 154 is required for accurate statistical estimates with either
 155 protein or DNA sequences.
 156
 157      The FASTA package now includes FASTX, a program that
 158 compares a DNA sequence to a protein sequence database by
 159 translating the DNA sequence in three frames (the reverse frames
 160 are selected with the -i option) and aligning the three-frame
 161 translation with the sequences in the protein database.
 162 Alignment scores allow frameshifts so that a cDNA or EST sequence
 163 with insertion/deletion errors can be aligned with its homologues
 164 from beginning to end.
 165
 166      With release 20u6, there is also a TFASTX program, which is
 167 a replacement for TFASTA.  TFASTA treats each of the six reading
 168 frames of a DNA library sequence as a different sequence; TFASTX
 169 compares a protein sequence against only two sequences from each
 170 DNA sequence - the forward and reverse orientation.  For a given
 171 orientation, TFASTX calculates a similarity score for alignments
 172 that allow frameshifts, thus considering all possible reading
 173 frames.
 174
 175      Another new program is included - randseq - which will
 176 produce a randomly shuffled (uniform or local shuffle) from an
 177 input sequence.  This randomly shuffled sequence can be used to
 178 evaluate the statistical estimates produced by FASTA, SSEARCH, or
 179 BLAST.
 180
 181 1.2.  Changes with version 1.7
 182 Version 1.7 has been released to provide the PRDF and PRSS
 183 programs for shuffling sequences and estimating accurately the
 184 probabilities of the unshuffled-sequence scores.
 185
 186 PRDF      a version of RDF2 that uses calculates the probability
 187           of a similarity score more accurately by using a fit to
 188           an extreme value distribution.  Code to fit the extreme
 189           value distribution parameters and the impetus to update
 190           RDF2 was provided by Phil Green, U. of Washington.
 191
 192 PRSS      a version of PRDF that uses a rigorous Smith-Waterman
 193           calculation to score similarities
 194
 195 1.3.  Changes with version 1.6
 196
 197      FASTA version 1.6 uses a new method for calculating optimal
 198 scores in a band (the optimization or last step in the FASTA
 199 algorithm). In addition, it uses a linear-space method for
 200 calculating the actual alignments.  FASTA v1.6 package includes
 201 several new programs:
 202
 203 SSEARCH   a program to search a sequence database using the
 204           rigorous Smith-Waterman algorithm (this program is
 205           about 100-fold slower than FASTA with ktup=2 (for
 206           proteins).
 207
 208 LALIGN    A rigorous local sequence alignment program that will
 209           display the N-best local alignments (N=10 by default).
 210
 211 PLALIGN   a version of lalign that plots the local alignments to
 212           a tektronix display.
 213
 214 FLALIGN   a version of lalign that plots the local alignments to
 215           a GCG Figure file.
 216
 217      The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
 218 algorithm described by Huang and Miller (1991) Adv. Appl. Math.
 219 12:337-357.  The SSEARCH and PRSS programs incorporate algorithms
 220 described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
 221
 222      LFASTA and PLFASTA now calculate a different number of local
 223 similarities; they now behave more like LALIGN/PLALIGN.  Since
 224 local alignments of identical sequences produce "mirror-image"
 225 alignments, lalign and lfasta consider only one-half of the
 226 potential alignments between sequences from identical file names.
 227 Thus
 228
 229     lfasta mchu.aa mchu.aa
 230
 231 Displays only two alignments, with earlier versions of the
 232 program, it would have displayed five, including the identity
 233 alignment.  PLFASTA does display five alignments; when two
 234 identical filenames are given, it draws the identity alignment,
 235 calculates the two unique local alignments, draws them, and draws
 236 their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
 237 filenames, rather than the actual sequences, to determine whether
 238 sequences are identical; you can "trick" the programs into
 239 behaving the old way by putting the same sequence in two
 240 different files.
 241
 242 1.4.  Changes with version 1.5
 243
 244      FASTA version 1.5 includes a number of substantial revisions
 245 to improve the performance and sensitivity of the program.  It is
 246 now possible to tell the program to optimize all of the initn
 247 scores greater than a threshold.  The threshold is set at the
 248 same value as the old FASTA cutoff score.  Alternatively, you can
 249 tell FASTA to sort the results by the init1, rather than the
 250 initn, score by using the -1 option.  FASTA -1 ... will report
 251 the results the way the older FASTP program did.
 252
 253      A new method has been provided for selecting libraries. In
 254 the past, one could enter the name of a sequence file to be
 255 searched or a single letter that would specify a library from the
 256 list included in the $FASTLIBS file. Now, you can specify a set
 257 of library files with a string of letters preceded by a '%'.
 258 Thus, if the FASTLIBS file has the lines:
 259
 260     Genbank 70 primates$1P/seqlib/gbpri.seq 1
 261     Genbank 70 rodents$1R/seqlib/gbrod.seq 1
 262     Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
 263     Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
 264
 265 Then the string: "%PRMB" would tell FASTA to search the four
 266 libraries listed above.  The %PRMB string can be entered either
 267 on the command line or when the program asks for a filename or
 268 library letter.
 269
 270      FASTA1.5 also provides additional flexibility for specifying
 271 the number of results and alignments to be displayed with the -Q
 272 (quiet) option.  The -b number option allows you to specify the
 273 number of sequence scores to show when the search is finished.
 274 Thus
 275
 276
 277     FASTA -b 100 ...
 278
 279
 280 tells the program to display the top 100 sequence scores. In the
 281 past, if you displayed 100 scores (in -Q mode), you would also
 282 have store 100 alignments. The -d option allows you to limit the
 283 number of alignments shown.  FASTA -b 100 -d 20 would show 100
 284 scores and 20 alignments.
 285
 286      Finally, FASTA can provide a complete list of all of the
 287 sequences and scores calculated to a file with the -r (results)
 288 option.  FASTA -r results.out ... creates a file with a list of
 289 scores for every sequence in the library.  The list is not
 290 sorted, and only includes those scores calculated during the
 291 initial scan of the library.
 292
 293 2.  Installing the FASTA package
 294
 295 2.1.  Installing the programs
 296
 297 2.1.1.  Unix version
 298
 299      The FASTA distribution comes with several makefile's that
 300 can be used to compile the FASTA programs.  Over the years, as
 301 ATT Unix System 5 and BSD unix have converged, these files have
 302 become very similar. To begin with, I recommend using the
 303 standard Makefile.  There are two values in the makefile that
 304 should be checked against the values used on your system: the HZ
 305 value, which is the frequency in ticks per second used by the
 306 times() system call, this value can usually be found by running:
 307
 308     grep HZ /usr/include/sys/*
 309
 310 and the functions available to return random numbers.  If you
 311 have a rand48() function that returns a 32-bit random number, use
 312 it and use the lines:
 313
 314     NRAND=nrand48
 315     RANFLG= -DRAND32
 316
 317 If not, you will need to use the rand() function call and
 318 determine whether it returns a 16-bit or a 32-bit value.  These
 319 functions are used by PRDF and PRSS.  If you have problems
 320 compiling the programs, you may want to examine the makefile.unx
 321 and makefile.sun files, to look for differences.  I have tried to
 322 use very standard unix functions in these programs, and they have
 323 been successfully compiled, with very small changes to the
 324 Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
 325 machines (under the BSD environment).
 326
 327 2.1.2.  IBM-PC/DOS version
 328
 329      For the IBM-PC/DOS version, the FASTA source code disk
 330 contains the complete source code to all of the programs on the
 331 other disks.  The programs were compiled with Borland's Turbo
 332 'C++', using Borland's MAKE utility.  The graphics programs
 333 (PLFASTA, TGREASE) use the graphics device drivers supplied with
 334 the Turbo 'C' V2.0 package.  Also included are the documentation
 335 files PROGRAMS.DOC and FORMAT.DOC.  You do not need any of the
 336 files the source code disk to run the programs.  The files on
 337 this disk are identical to the UNIX and VMS versions that run on
 338 larger machines.  Also included is the code to compile
 339 ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalize
 340 for end-gaps.
 341
 342      If you have the DOS or Macintosh version of the FASTA
 343 package, to install the programs you should:
 344
 345  (1)   Make a new directory (folder) for the FASTA programs.
 346        This need not be the same as the directory for your
 347        sequence databases.
 348
 349  (2)   Copy the files from the FASTA source disk to the new
 350        directory.
 351
 352  (3)   (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
 353        PATH command to include the FASTA directory and (b) add
 354        the line:
 355
 356            set FASTLIBS=c:\yourfastadirectory\fastgbs
 357
 358        On the Macintosh, you may need to edit the "environment"
 359        file and change the line that reads:
 360
 361            FASTLIBS=fastgbs
 362
 363        to indicate the full directory path for the fastgbs file,
 364        for example:
 365
 366            FASTLIBS=Q105:FASTA:fastgbs
 367
 368
 369  (4)   Finally, you will need to edit the fastgbs file.  This is
 370        usually the most confusing part of the installation.  An
 371        example of this file is shown below; to customize this
 372        file for your machine, you will need to change the file
 373        names from those provided in the fastgbs file to ones that
 374        reflect the directory names and file names you use on your
 375        machine. This is explained in more detail below.  In
 376        addition, some entries in the fastgbs file refer to other
 377        files of file names.  These files of file names (as
 378        opposed to actual database files) may also need to be
 379        edited.
 380
 381 2.2.  Installing the libraries
 382
 383 2.2.1.  The NBRF protein sequence library
 384
 385      The FASTA program package does not include any protein or
 386 DNA sequence libraries.  You can obtain the PIR protein sequence
 387 database from:
 388
 389     National  Biomedical Research Foundation
 390     Georgetown  University  Medical  Center
 391     3900 Reservoir Rd, N.W.
 392     Washington, D.C. 20007
 393
 394 In addition, this database is available via anonymous ftp from
 395 the host "ftp.bchs.uh.edu". It is available in two formats, VMS
 396 and CODATA format.  The "VMS" format (library type 5 below) can
 397 be searched much faster, can be easily reformatted for use by the
 398 "BLAST" rapid searching program, and is compatible with the
 399 Genetics Computer Group package of programs.  The CODATA format
 400 is used by the EUGENE/MBIR computing package from Baylor (library
 401 type 2).
 402
 403 2.2.2.  The GENBANK DNA sequence library
 404
 405      FASTA, and TFASTA search sequences from the GENBANK
 406 "flatfile" (not ASN.1) DNA sequence library in the flat-file
 407 format distributed by the National Center for Biotechnology
 408 Information and the PIR format used by EBI/EMBL.  CD-ROMs can be
 409 obtained from:
 410
 411     Genbank
 412     National Center for Biotechnology Information
 413     National Library of Medicine
 414     National Institutes of Health
 415     8600 Rockville Pike
 416     Bethesda, MD  20894
 417
 418
 419      The GenBank DNA sequence library is also available via
 420 anonymous FTP from ncbi.nlm.nih.gov.
 421
 422 2.2.3.  The EBI/EMBL CD-ROM libraries
 423
 424      The European Bioinformatics Institute (EBI) is now
 425 distributing the EMBL CD-ROM that contains both the complete EMBL
 426 DNA sequence database (which should be essentially identical to
 427 the GenBank DNA sequence database) and the SWISS-PROT protein
 428 sequence database. SWISS-PROT is derived from the NBRF Protein
 429 sequence database with additions from the EBI/EMBL DNA sequence
 430 database.  This CD-ROM is a "best-buy," since it provides both
 431 DNA and protein sequence libraries.  It is available from:
 432
 433
 434     European Bioinformatics Institute
 435     Hinxton Genome Campus, Hinxton Hall
 436     Hinxton, Cambridge CB10 1RQ,
 437     United Kingdom
 438     Tel: +44 1223 4944
 439     Fax: +44 1223 494468
 440     Email: DATALIB@ebi.ac.uk
 441
 442
 443
 444      In addition, the SWISS-PROT protein sequence database is
 445 available via anonymous FTP from ncbi.nlm.nih.gov.
 446
 447 2.3.  Finding the libraries: FASTLIBS
 448
 449      FASTA and TFASTA use the environment variable FASTLIBS to
 450 find the protein and DNA sequence libraries.  The FASTLIBS
 451 variable contains the name of a file that has the actual
 452 filenames of the libraries.  The FASTGBS file on is an example of
 453 a file that can be referred to by FASTLIBS. To use the FASTGBS
 454 file, type:
 455
 456     setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
 457     or
 458     export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
 459
 460 Then edit the FASTGBS file to indicate where the protein and DNA
 461 sequence libraries can be found.  If you have a hard disk and
 462 your protein sequence library is kept in the file
 463 /usr/lib/aabank.lib and your Genbank DNA sequence library is kept
 464 in the directory: /usr/lib/genbank, then fastgbs might contain:
 465
 466     NBRF Protein$0P/usr/lib/seq/aabank.lib 0
 467     SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
 468     GB Primate$1P@/usr/lib/genbank/gpri.nam
 469     GB Rodent$1R@/usr/lib/genbank/grod.nam
 470     GB Mammal$1M@/usr/lib/genbank/gmammal.nam
 471     ^   1    ^^^^       4                   ^     ^
 472               23                             (5)
 473
 474 The first line of this file says that there is a copy of the NBRF
 475 protein sequence database (which is a protein database) that can
 476 be selected by typing "P" on the command line or when the
 477 database menu is presented in the file /usr/lib/seq/aabank.lib.
 478
 479      Note that there are 4 or 5 fields in the lines in fastgbs.
 480 The first field is the description of the library which will be
 481 displayed by FASTA; it ends with a '$'.  The second field (1
 482 character), is a 0 if the library is a protein library and 1 if
 483 it is a DNA library.  The third field (1 character) is the
 484 character to be typed to select the library.
 485
 486      The fourth field is the name of the library file.  In the
 487 example above, the /usr/lib/seq/aabank.lib file contains the
 488 entire protein sequence library.  However the DNA library file
 489 names are preceded by a '@', because these files (gpri.nam,
 490 grod.nam, gmammal.nam) do not contain the sequences; instead they
 491 contain the names of the files which contain the sequences.  This
 492 is done because the GENBANK DNA database is broken down in to a
 493 large number of smaller files.  In order to search the entire
 494 primate database, you must search more than a dozen files.
 495
 496      In addition, an optional fifth field can be used to specify
 497 the format of the library file.  Alternatively, you can specify
 498 the library format in a file of file names (a file preceded by an
 499 '@').  This field must be separated from the file name by a space
 500 character (' ') from the filename.  In the example above, the
 501 aabank.lib file is in Pearson/FASTA format, while the swiss.seq
 502 file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
 503 FASTA can read the following formats:
 504
 505     0 Pearson/FASTA (>SEQID - comment/sequence)
 506     1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
 507     2 NBRF CODATA (ENTRY/SEQUENCE)
 508     3 EMBL/SWISS-PROT (ID/DE/SQ)
 509     4 Intelligenetics (;comment/SEQID/sequence)
 510     5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
 511     6 GCG (version 8.0) Unix Protein and DNA (compressed)
 512     11 NCBI Blast1.3.2 format  (unix only)
 513
 514 In particular, this version will work with the EMBL and PIR VMS
 515 formats that are distributed on the EMBL CD-ROM. The latter
 516 format (PIR VMS) is much faster to search than EMBL format.  This
 517 release also works with the protein and DNA database formats
 518 created for the BLASTP and BLASTN programs by SETDB and PRESSDB
 519 and with the new NCBI search format.  If a library format is not
 520 specified, for example, because you are just comparing two
 521 sequences, Pearson/FASTA (format 0) is used by default.  To
 522 change this default, you may set the LIBTYPE environment variable
 523 to a number.  For example,
 524
 525     setenv LIBTYPE 1
 526
 527 would cause the program to use the GenBank LOCUS format by
 528 default for libraries (or the second sequence file), but the
 529 Pearson/FASTA format would still be used for the query sequence.
 530
 531      You can specify a group of library files by putting a '@'
 532 symbol before a file that contains a list of file names to be
 533 searched.  For example, if @gmam.nam is in the fastgbs file, the
 534 file "gmam.nam" might contain the lines:
 535
 536     </usr/lib/genbank
 537     gbpri.seq 1
 538     gbrod.seq 1
 539     gbmam.seq 1
 540
 541 In this case, the line beginning with a '<' indicates the
 542 directory the files will be found in.  The remaining lines name
 543 the actual sequence files.  So the first sequence file to be
 544 searched would be:
 545
 546     /usr/lib/genbank/gbpri.seq
 547
 548 The notation "<PIRNAQ:" might be used under the VAX/VMS operating
 549 system. Under UNIX, the trailing '/' is left off, so the library
 550 directory might be written as "</usr/seqlib".
 551
 552      With version 1.4 of the FASTA package, the FASTA and TFASTA
 553 programs can search a library composed of different files in
 554 different sequence formats.  For example, you may wish to search
 555 the Genbank files (in GenBank flat file format) and the EMBL DNA
 556 sequence database on CD-ROM.  To do this, you simply list the
 557 names and filetypes of the files to be searched in a file of
 558 filenames.  For example, to search the mammalian portion of
 559 Genbank, the unannotated portion of Genbank, and the unannotated
 560 portion of the EMBL library, you could use the file:
 561
 562     </usr/lib/DNA
 563     gbpri.seq 1
 564     #  (this '#' causes the program to display the size of the library)
 565     gbrod.seq 1
 566     gbmam.seq 1
 567     gbuna.seq 1
 568     unanno.seq 5
 569     #
 570
 571     You do not need to include library format numbers if  you
 572     only use the Pearson/FASTA version of the PIR protein se-
 573     quence library.  If no library  type  is  specified,  the
 574     program  assumes  that  type  0 is being used (unless you
 575     have set LIBTYPE).
 576
 577 Support for the old compressed GenBank files, which have not been
 578 distributed for more than four years, has been removed from
 579 programs in the FASTA package.
 580
 581
 582      Test the setup by running FASTA.  Enter the sequence file
 583 'MUSPLFM.AA' when the program requests it (this file is included
 584 with the programs).  The program should then ask you to select a
 585 protein sequence library.  Alternatively, if you run the TFASTA
 586 program and use the MUSPLFM.AA query sequence, the program should
 587 show you a selection of DNA sequence libraries.  Once the fastgbs
 588 file has been set up correctly, you can set FASTLIBS=fastgbs in
 589 your AUTOEXEC.BAT file, and you will not need to remember where
 590 the libraries are kept or how they are named.
 591
 592      FASTA and TFASTA must open a large number of files when
 593 searching and reporting the results of a GENBANK floppy disk
 594 format library search.  You may have problems with the large
 595 number of files under DOS on IBM-PC's (Unix and VMS users will
 596 not have these problems).  If you are going to search the GENBANK
 597 floppy disk format DNA sequence library under DOS, you should add
 598 the line:
 599
 600     FILES=16
 601
 602 to your CONFIG.SYS file.  (Typically this is already done for
 603 programs like Windows or WordPerfect.)
 604
 605 3.  Using the FASTA Package
 606
 607 3.1.  Overview
 608
 609      The FASTA sequence comparison programs all require similar
 610 information, the name of a query sequence file, a library file,
 611 and the ktup parameter.  All of the programs can accept arguments
 612 on the command line, or they will prompt for the file names and
 613 ktup value.
 614
 615 To use FASTA, simply type:
 616
 617     FASTA
 618     and you will be prompted for :
 619          the name of the test sequence file
 620          the name of the library file
 621          and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
 622
 623              ktup of 2 is about 5 times faster than ktup = 1.
 624              For  a  200  aa sequence against a 10,000,000 aa
 625              library, the program takes  about  30  min  with
 626              ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
 627              IBM-PC.
 628
 629
 630 The program can also be run by typing
 631
 632     FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
 633
 634
 635 Included with the package are the test files, MUSPLFM.AA,
 636 LCBO.AA, MCHU.AA and BOVPRL.SEQ.  To check to make certain that
 637 everything is working, you can try:
 638
 639     fasta musplfm.aa lcbo.aa
 640     and
 641     tfasta musplfm.aa bovprl.seq
 642
 643 To test the local similarity programs LFASTA and PLFASTA, try:
 644
 645     lfasta mchu.aa mchu.aa
 646     and
 647     plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
 648     or on a Tektronix terminal under UNIX or VMS)
 649
 650 MCHU (calmodulin) has four duplicated calcium binding sites that
 651 are clearly detected by LFASTA.  For a more complicated example,
 652 try MWRTC1.aa, myosin heavy chain.
 653
 654 3.2.  Sequence files
 655
 656      The FASTA programs know about three kinds of sequence files
 657 (four under VMS): (1) plain sequence files that can only be used
 658 as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
 659 library files.  These are the same as plain sequence files, each
 660 sequence is preceded by a comment line with a '>' in the first
 661 column. (3) distributed sequence libraries (this is a broad class
 662 that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
 663 flat-file format, EMBL flat-file format, and Intelligenetics
 664 format.  All of the files that you create should be of type (1)
 665 or (2).  Type (2) files (ones with a be used as query or library
 666 sequence files by all of the programs.
 667
 668      I have included several sample test files, *.AA.  The first
 669 line may begin with a '>'  or ';' followed by a comment.  The
 670 text after ';' in other lines will  be  ignored.   Spaces  and
 671 tabs  (and anything else that  is  not  an amino-acid code) are
 672 ignored.
 673
 674      Library files should have the form:
 675
 676     >Sequence name and identifier
 677     A F A S Y T .... actual sequence.
 678     F S S       .... second line of sequence.
 679     >Next sequence name and identifier
 680
 681 This is often referred to as "FASTA" or "Pearson" format.  You
 682 can build your own library by concatenating several sequence
 683 files.  Just be sure that each sequence is preceded by a line
 684 beginning with a '>' with a sequence name.
 685
 686      The test file should not have lines longer than 120
 687 characters, and sequences entered with word processors should use
 688 a document mode, with normal carriage returns at the end of
 689 lines.
 690
 691 Program Summary
 692
 693 3.3.  Sequence search programs
 694
 695 FASTA     universal sequence comparison. Defaults to comparing
 696           protein sequences; if the sequences are > 85% A+C+G+T
 697           or the -n option is used, a DNA sequence is assumed.
 698
 699 FASTX     Search a protein sequence library using amino acid
 700           sequence comparison to the forward three frames of a
 701           translated DNA query sequence. (The reverse frames are
 702           specified with the -i option.) Alignment scores allow
 703           frameshifts; the final alignment uses a Smith-Waterman
 704           type alignment routine (no limit on gaps) that allows
 705           frameshifts.
 706
 707 TFASTA    Search DNA library for a protein sequence by
 708           translating the DNA sequence to protein in all six
 709           frames (three forward frames with the -3 command line
 710           option). TFASTA with ktup=2 is about as fast as a DNA
 711           FASTA with ktup=4, and is substantially more sensitive.
 712           (also reads the GENBANK library)
 713
 714 TFASTX    Search DNA library for a protein sequence by
 715           translating the DNA sequence to protein in all six
 716           frames (three forward frames with the -3 command line
 717           option) calculating similarity scores that allow
 718           frameshifts. TFASTX produces an optimal Smith-Waterman
 719           alignment of the query and translated-library sequence.
 720
 721 SSEARCH   Universal sequence comparison using the Smith-Waterman
 722           algorithm ( T. F. Smith and M. S. Waterman (1981) J.
 723           Mol. Biol. 147:195-197).  This program uses code
 724           developed by Huang and Miller (X. Huang, R. C.
 725           Hardison, W. Miller (1990) CABIOS 6:373-381) for
 726           calculating the local similarity score and code from
 727           the ALIGN program (see below) for calculating the local
 728           alignment.  SSEARCH is about 50-times slower than FASTA
 729           with ktup=2 (for proteins).
 730
 731 ALIGN     optimal global alignment of two sequences with no
 732           short-cuts.  This program is a slightly modified
 733           version of one taken from E.  Myers and W. Miller. The
 734           algorithm is described in E. Myers and W.  Miller,
 735           "Optimal Alignments in Linear Space" (CABIOS (1988)
 736           4:11-17).
 737
 738 3.4.  Local similarity programs
 739
 740 LFASTA    local similarity searches showing local alignments.
 741           The algorithm used to calculate the local alignment in
 742           a band has been improved (Chao, Pearson, and Miller,
 743           submitted).
 744
 745 PLFASTA   local similarity searches with plot output (on the IBM,
 746           this program requires that the environment variable
 747           BGIDIR be set).
 748
 749 PCLFASTA  (unix only) local similarity searches with plot output
 750           using pic commands.
 751
 752 LALIGN    Calculates the N-best local alignments using a rigorous
 753           algorithm.  (N=10 by default.) The algorithm was
 754           developed by Huang and Miller (X.  Huang and W.  Miller
 755           (1991) Adv. Appl. Math. 12:337-357), which is a
 756           linear-space version of an algorithm described by M. S.
 757           Waterman and M. Eggert (J.  Mol. Biol. 197:723-728).
 758           Like SSEARCH, LALIGN is rigorous, but also very slow.
 759
 760 PLALIGN   A version of LALIGN that plots its output to a screen
 761           or to a Tektronix terminal emulator.
 762
 763 3.5.  Statistical Significance
 764
 765      With version 2.0 of the FASTA program distribution, FASTA,
 766 TFASTA, and SSEARCH now provide estimates of statistical
 767 significance for library searches.  Work by Altschul, Arratia,
 768 Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
 769 Nature Genetics 6:119 for an excellent review) suggests that
 770 local sequence similarity scores follow the extreme value
 771 distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
 772 u = ln(Kmn)/lambda and m,m are the lengths of the query and
 773 library sequence. This formula can be rewritten as: 1 - exp(-Kmn
 774 exp(-lambda x), which shows that the average score for an
 775 unrelated library sequence increases with the logarithm of the
 776 length of the library sequence.  FASTA and SSEARCH use simple
 777 linear regression against the the log of the library sequence
 778 length to calculate a normalized "z-score" with mean 50,
 779 regardless of library sequence length, and variance 10.  These
 780 z-scores can then be used with the extreme value distribution and
 781 the poisson distribution (to account for the fact that each
 782 library sequence comparison is an independent test) to calculate
 783 the number of library sequences to obtain a score greater than or
 784 equal to the score obtained in the search. The original idea and
 785 routines to do the linear regression on library sequence length
 786 were provided Phil Green, U. Washington.  This version of FASTA
 787 and SSEARCH uses a slightly different strategy for fitting the
 788 data than those originally provided by Dr. Green.
 789
 790      The expected number of sequences is plotted in the histogram
 791 using an "*". Since the parameters for the extreme value
 792 distribution are not calculated directly from the distribution of
 793 similarity scores, the pattern of "*'s" in the histogram gives a
 794 qualitative view of how well the statistical theory fits the
 795 similarity scores calculated by FASTA and SSEARCH.  For FASTA, if
 796 optimized scores are calculated for each sequence in the database
 797 (the default), the agreement between the actual distribution of
 798 "z-scores" and the expected distribution based on the length
 799 dependence of the score and the extreme value distribution is
 800 usually very good.  Likewise, the distribution of SSEARCH Smith-
 801 Waterman scores typically agrees closely with the actual
 802 distribution of "z-scores."  The agreement with unoptimized
 803 scores, ktup=2, is often not very good, with too many high
 804 scoring sequences and too few low scoring sequences compared with
 805 the predicted relationship between sequence length and similarity
 806 score.  In those cases, the expectation values may be
 807 overestimates.
 808
 809      The statistical routines assume that the library contains a
 810 large sample of unrelated sequences.  If this is not the case,
 811 then the expectation values are meaningless.  Likewise, if there
 812 are fewer than 20 sequences in the library, the statistical
 813 calculations are not done.
 814
 815      For protein searches, library sequences with E() values <
 816 0.01 for searches of a 10,000 entry protein database are almost
 817 always homologous. Frequently sequences with E()-values from 1 -
 818 10 are related as well. Remember, however, that these E() values
 819 also reflect differences between the amino acid composition of
 820 the query sequence and that of the "average" library sequence.
 821 Thus, when searches are done with query sequences with "biased"
 822 amino-acid composition, unrelated sequences may have
 823 "significant" scores because of sequence bias.  The programs
 824 below, PRDF and PRSS, can address this problem by calculating
 825 similarity scores for random sequences with the same length and
 826 amino acid composition.
 827
 828      If optimization is not used ("-o"), E-values for DNA
 829 sequences overestimate the significance of the scores that are
 830 obtained and unrelated sequences frequently have E()-values <
 831 0.0005. With optimization, the agreement between E()-value
 832 compares favorably with protein sequence comparison.  This is in
 833 part due to the use of more stringent gap penalties for DNA
 834 sequence comparison, -16, -4 rather than -12, -2.  With the
 835 latter penalties, many unrelated sequences appear to have
 836 significant similarity. Nevertheless, since protein sequence
 837 comparison is much more sensitive, DNA sequence comparison should
 838 not be used to identify sequences that encode protein.  Even with
 839 ktup=6, optimization rarely increases run-times more than 50%
 840 with mRNA-size query sequences.  Optimization should be used
 841 whenever possible.
 842
 843      Similar comments apply to TFASTA, where  higher gap
 844 penalties (-16,-4) are required for accurate statistical
 845 estimates.  Because TFASTA produces so many artificial "coding"
 846 sequences with atypical amino acid compositions, the statistical
 847 estimates with TFASTA are often over estimates.  With optimized
 848 scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
 849 will sometimes have E() values of 0.1.  If initn scores are used,
 850 unrelated sequences may have have E() values < 0.01.
 851
 852 PRDF      improved version of RDF program that includes accurate
 853           probability estimates for all three scoring methods
 854           (includes local or window shuffle routine)
 855
 856 PRSS      A version of PRDF that uses the rigorous Smith-Waterman
 857           calculation used by SSEARCH.
 858
 859 RANDSEQ   produces a randomly shuffled sequence from a query
 860           sequence.
 861
 862 RELATE    significance program described by Dayhoff (Atlas of
 863           Protein Sequence and Structure, Vol. 5, Supplement 3).
 864           Each chunk of 25 residues in one sequence is compared
 865           to every 25 residue fragment of the second sequence.
 866           Sequences which are genuinely related will have a large
 867           number of scores greater than 3 standard deviations
 868           above the mean score of all of the comparisons.
 869
 870 3.6.  Other analysis programs
 871
 872 AACOMP    calculate the amino acid composition and molecular
 873           weight of a sequence.
 874
 875 BESTSCOR  calculate the best self-comparison score.
 876
 877 GREASE    Kyte-Doolittle hydropathicity profile
 878
 879 TGREASE   graphic plot of Kyte-Doolittle profile
 880
 881 FROMGB    convert from GenBank LOCUS format (also used by the
 882           IBI-Pustell programs) to Pearson/FASTA format.
 883
 884 GARNIER   A secondary structure prediction program using the
 885           method of Garnier, Osgusthorpe, and Robson, J. Mol.
 886           Biol., (1978) 120:97-120.
 887
 888 3.7.  Options
 889
 890      These programs have a number of output options, which are
 891 invoked by the environment variables LINLEN, SHOWALL, and MARKX.
 892 Alternatively, these values can be controlled by command line
 893 options.  The number of sequence residues per output line is now
 894 adjustable by setting the environment variable LINLEN, or the
 895 command line option -w.  LINLEN is normally 60, to change it set
 896 LINLEN=80 before running the program or add -w 80 to the command
 897 line.  LINLEN can be set up to 200.  SHOWALL (-a) determines
 898 whether all, or just a portion, of the aligned sequences are
 899 displayed.  Previously, FASTP would show the entire length of
 900 both sequences in an alignment while FASTN would only show the
 901 portions of the two sequences that overlapped. Now the default is
 902 to show only the overlap between the two sequences, to show
 903 complete sequences, set SHOWALL=1, or use the -a option on the
 904 command line.
 905
 906      The differences between the two aligned sequences can be
 907 highlighted in three different ways by changing the environment
 908 variable MARKX or the -m option.  Normally (MARKX=0) the program
 909 uses ':' do denote identities and '.' to denote conservative
 910 replacements.  If MARKX=1, the program will not mark identities;
 911 instead conservative replacements are denoted by a 'x' and non-
 912 conservative substitutions by a 'X'.  If MARKX=2, the residues in
 913 the second sequence are only shown if they are different from the
 914 first. MARKX=3 displays the aligned library sequences without the
 915 query sequence; these can be used to build a primitive multiple
 916 alignment.  MARKX=4 provides a graphical display of the
 917 boundaries of the alignments. Thus the five options are:
 918
 919
 920      MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4
 921
 922     MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
 923     ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
 924     MWKSCGYPYT   MWKSCGYPYT
 925
 926
 927 (fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
 928 format for use with other programs.  See the file"readme.v20u4"
 929 for a more complete description.
 930
 931 3.8.  Command line options
 932
 933      It is now possible to specify  several options on the
 934 command line, instead of using environment variables.  The
 935 command line options are preceded by a dash; the following
 936 options are available:
 937
 938 -a        same as showall=1
 939
 940 -A        force Smith-Waterman alignments for DNA sequences and
 941           TFASA.  By default, only FASTA protein sequence
 942           comparisons use Smith-Waterman alignments.
 943
 944 -b #      Number of sequence scores to be shown on output.  In
 945           the absence of this option, fasta (and tfasta and
 946           ssearch) display all library sequences obtaining
 947           similarity scores with expectations less than 10.0 if
 948           optimized score are used, or 2.0 if they are not. The
 949           -b option can limit the display further, but it will
 950           not cause additional sequences to be displayed.
 951
 952 -c #      Threshold score for optimization (OPTCUT).  Set "-c 1"
 953           to optimize every sequence in a database.  (This slows
 954           the program down about 5-fold).
 955
 956 -E #      Limit the number of scores and alignments shown based
 957           on the expected number of scores.  Used to override the
 958           expectation value of 10.0 used by default.  When used
 959           with -Q, -E 2.0 will show all library sequences with
 960           scores with an expectation value <= 2.0.
 961
 962 -d #      Number of alignments to be reported by default. (Used
 963           in conjunction with -Q).  No longer necessary, see "-b"
 964           above.
 965
 966 -f        Penalty for the first residue in a gap (-12 by default
 967           for proteins, -16 for DNA or for TFASTA).
 968
 969 -g        Penalty for additional residues in a gap (-2 by default
 970           for proteins, -4 for DNA and TFASTA ).
 971
 972 -h        Penalty for frameshift (FASTX, TFASTX only).
 973
 974 -H        Omit histogram.
 975
 976 -i        Invert (reverse complement) the query sequence if it is
 977           DNA.  For TFASTX, search the reverse complement of the
 978           library sequence only.
 979
 980 -k #      Threshold for joining init1 segments to build an initn
 981           score (GAPCUT).
 982
 983 -l file   Location of library menu file (FASTLIBS).
 984
 985 -L        Display more information about the library sequence in
 986           the alignment.
 987
 988 -m #      MARKX = # (0, 1, 2, 3, 4, 10)
 989
 990 -n        Force the query sequence to be treated as a DNA
 991           sequence.  This is particularly useful for query
 992           sequences that contain a large number of ambiguous
 993           residues, e.g. transcription factor binding sites.
 994
 995 -O        Send copy of results to "filename."  Helpful for
 996           environments without STDOUT.
 997
 998 -o        Turn off default optimization of all scores greater
 999           than OPTCUT. Sort results by "initn" scores.
1000
1001 -Q,-q     Quiet - does not prompt for any input.  Writes scores
1002           and alignments to the terminal or standard output file.
1003
1004 -r file   Save a results summary line for every sequence in the
1005           sequence library.  The summary line includes the
1006           sequence identifier, superfamily number (if available)
1007           position in the library, and the similarity scores
1008           calculated.  This option can be used to evaluate the
1009           sensitivity and selectivity of different search
1010           strategies (see W. R. Pearson (1991) Genomics 11:635-
1011           650.)
1012
1013 -s file   SMATRIX is read from file.  Several SMATRIX files are
1014           provided with the standard distribution.  For protein
1015           sequences: codaa.mat - based on minimum mutation
1016           matrix; idnaa.mat - identity matrix; pam250.mat - the
1017           PAM250 matrix developed by Dayhoff et al (Atlas of
1018           Protein Sequence and Structure, vol. 5, suppl. 3,
1019           1978); pam120.mat - a PAM120 matrix.  The default
1020           scoring matrix is BLOSUM50, PAM250 is available with
1021           "-s 250", BLOSUM62 ("-s BL62") is also available.
1022
1023 -v        (LINEVAL) values used for line styles in plfasta
1024
1025 -w #      Line length (width) = number (<200)
1026
1027 -x        Specifies offsets for the beginning of the query and
1028           library sequence.  For example, if you are comparing
1029           upstream regions for two genes, and the first sequence
1030           contains 500 nt of upstream sequence while the second
1031           contains 300 nt of upstream sequence, you might try:
1032
1033               fasta -x "-500 -300" seq1.nt seq2.nt
1034
1035           If the -x option is not used, FASTA assumes numbering
1036           starts with 1.  This option will not work properly with
1037           the translated library sequence with tfasta.  (You
1038           should double check to be certain the negative
1039           numbering works properly.)
1040
1041 -y        Set the width of the band used for calculating
1042           "optimized" scores.  For proteins and ktup=2, the width
1043           is 16.  For proteins with ktup=1, the width is 32 by
1044           default.  For DNA the width is 16.
1045
1046 -z        Turn off statistical calculations.
1047
1048 -1        sort output by init1 score (as FASTP used to do).
1049
1050 -3        (TFASTA, TFASTX only) translate only three forward
1051           frames
1052
1053
1054 For example:
1055
1056     fasta -w 80 -a seq1.aa seq.aa
1057
1058 would compare the sequence in seq1.aa to that in seq2.aa and
1059 display the results with 80 residues on an output line, showing
1060 all of the residues in both sequences.  Be sure to enter the
1061 options before entering the file names, or just enter the options
1062 on the command line, and the program will prompt for the file
1063 names.
1064
1065      Not all of these options are appropriate for all of the
1066 programs.  The options above are used by FASTA and TFASTA. RELATE
1067 uses the -s option, ALIGN uses the -w, -m, and -s options, and
1068 the PRDF program uses -c, -f, -k, and -s.
1069
1070 4.  Environment variable summary
1071
1072      Environment variables allow you to set search parameters
1073 that will be used frequently when you run a program; for example,
1074 if you prefer to use the PAM250 scoring matrix, you might "set
1075 SMATRIX=250."  Command line parameters, if used, always override
1076 environment variable settings. The following environment
1077 variables are used by this program:
1078
1079 AABANK    the file name  of the default sequence library.
1080
1081 FASTLIBS  the location of the file which contains the list of
1082           library files to be searched.
1083
1084 GAPCUT    threshold used for joining init1 regions in the second
1085           step of FASTA.  Normally set based on sequence length
1086           and ktup.
1087
1088 LIBTYPE   used to specify the format of the library sequence for
1089           FASTA and TFASTA.
1090
1091 LINLEN    output line length - can go up to 200
1092
1093 LINEVAL   used by plfasta to determine the relationship between
1094           line style and similarity score (-v).  This should be a
1095           string of three numbers, e.g.  "200 100 50"
1096
1097 MARKX     symbol for denoting matches, mismatches. Note that this
1098           symbol is only used across the optimized local region;
1099           sequences that are outside this region are not marked.
1100
1101 OPTCUT    Set the threshold to be used for optimization in a band
1102           around the best initial region.  Normally the OPTCUT
1103           value is calculated from the length of the sequence and
1104           the ktup value (for a 200 residue sequence, it is about
1105           28).  If OPTCUT=1, every sequence in the database will
1106           be optimized.  This is the most sensitive option.
1107
1108 PAMFACT   This version of fasta uses a more sensitive method for
1109           identifying initial regions. Instead of using a
1110           constant factor (fact) for each match in a ktup, it
1111           uses the scoring matrix (PAM) scores.  While this works
1112           well for protein sequences, it has not been as
1113           carefully tested for DNA sequences, so by default, this
1114           modification is used for proteins but not for DNA.
1115           Setting the PAMFACT environment variable to 1 forces
1116           the option on; PAMFACT=0 turns it off.
1117
1118 SHOWALL   on output, show the complete sequence instead of just
1119           the overlap of the two aligned sequences.
1120
1121 SMATRIX   alternative scoring matrix file.
1122
1123 TEKPLOT   (IBM-PC only, Unix and VMS versions generate Tektronix
1124           graphics by default) Generate Tektronix output.
1125           Normally, PLFASTA and TGREASE plot graphs using the
1126           Turbo C graphics library.  Unfortunately, often these
1127           plots cannot be printed out without special programs.
1128           However, if you set TEKPLOT=1, tektronix graphics
1129           commands will be used.  Tektronix commands can be used
1130           together with the PLOTDEV program, available from
1131           Microplot Systems.  They no lonter sell this program,
1132           but it can be downloaded from
1133           http://iquest.com/~microplt/index1.html.  PLOTDEV also
1134           allows you to print out graphics on the screen.
1135
1136 As always, please inform me of bugs as soon as possible.
1137
1138 William R. Pearson
1139 Department of Biochemistry
1140 Box 440, Jordan Hall
1141 U. of Virginia
1142 Charlottesville, VA
1143
1144 wrp@virginia.EDU