Delete unneeded directory

[jabaws.git] / website / archive / binaries / mac / src / fasta34 / fasta3x.me
diff --git a/website/archive/binaries/mac/src/fasta34/fasta3x.me b/website/archive/binaries/mac/src/fasta34/fasta3x.me

deleted file mode 100644 (file)

index df315be..0000000
--- a/website/archive/binaries/mac/src/fasta34/fasta3x.me
+++ /dev/null
@@ -1,911 +0,0 @@
-.nr pp 11
-.nr sp 11
-.nr tp 11
-.nr fp 10
-.nr fi 0n
-.sz 11
-.if t \{
-.po 1i
-.he 'FASTA3.DOC''Release 3.4, Fall, 2003'
-.fo ''- % -''
-\}
-.if n \{
-.po 0
-.na
-.nh
-\}
-.ll 6.5i
-.ce
-\fBCOPYRIGHT NOTICE\fP
-.lp
-Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William
-R. Pearson and the University of Virginia.  All rights reserved. The
-FASTA program and documentation may not be sold or incorporated into a
-commercial product, in whole or in part, without written consent of
-William R. Pearson and the University of Virginia.  For further
-information regarding permission for use or reproduction, please
-contact: David Hudson, Assistant Provost for Research, University of
-Virginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (434)
-924-6853
-.uh "\s+2The FASTA program package\s0"
-.uh "Introduction"
-.pp
-This documentation describes the version 3 of the FASTA program
-package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools
-for Biological Sequence Analysis", PNAS 85:2444-2448 [.wrp881.]; W. R.
-Pearson (1996) "Effective protein sequence comparison"
-Meth. Enzymol. 266:227-258;[.wrp960.] Pearson et. al. (1997) Genomics
-46:24-36;[.wrp971.]  Pearson, (1999) Meth. in Molecular Biology
-132:185-219.[.wrp000.]  Version 3 of the FASTA packages contains many
-programs for searching DNA and protein databases and one program
-(prss3) for evaluating statistical significance from randomly shuffled
-sequences.  Several additional analysis programs, including programs
-that produce local alignments, are available as part of version 2 of
-the FASTA package, which is still available.
-.pp
-This document is divided into three sections: (1) A summary overview of
-the programs in the FASTA3 package; (2) A guide to installing the
-programs and databases; (3) A guide to using the FASTA programs. The
-revision history of the programs can be found in the
-\fCreadme.v30..v34\fP, files. The programs are easy to use, so if
-you are using them on a machine that is administered by someone else,
-you can skip section (2) and focus on (1) and (3) to learn how to use
-the programsIf you are installing the programs on your own
-machine, you will need to read section (2) carefully.
-.sh 1 "An overview of the \f(CBFASTA\fP programs"
-.pp
-Although there are a large number of programs in this package, they
-belong to three groups: (1) 
-"Conventional" Library search programs:
-FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3;
-(2)
-Programs for searching with short fragments:
-FASTS3, FASTF3, TFASTS3, TFASTF3;
-(3)
-Statistical significance: PRSS3.
-Programs that start with \f(CBfast\fP search protein
-databases, while \f(CBtfast\fP programs search translated DNA databases.
-Table I gives a brief description of the programs.
-.lp
-.(z
-.TS
-center;
-c s
-c s
-= =
-l lw(5.5i).
-\d\fBTable I. Comparison programs in the FASTA3 package\fP\u
-
-\fCfasta3\fP   T{
-Compare a protein sequence to a protein sequence 
-database or a DNA sequence to a DNA sequence database using the FASTA 
-algorithm.[.wrp881,wrp960.]  Search speed and selectivity are 
-controlled with the \fIktup\fP(wordsize) parameter.  For protein 
-comparisons, \fIktup\fP = 2 by default; \fIktup\fP =1 is more sensitive 
-but slower.  For DNA comparisons, \fIktup\fP=6 by default; \fIktup\fP=3 or 
-\fIktup\fP=4 provides higher sensitivity; \fIktup\fP=1 should be used for 
-oligonucleotides (DNA query lengths < 20).
-T}
-
-\fCssearch3\fP T{
-Compare a protein sequence to a protein sequence 
-database or a DNA sequence to a DNA sequence database using the 
-Smith-Waterman algorithm.[.wat815.]  \fCssearch3\fP is about 10-times 
-slower than FASTA3, but is more sensitive for full-length protein 
-sequence comparison.
-T}
-
-\fCfastx3\fP/ \fCfasty3\fP     T{
-Compare a DNA sequence to a protein
-sequence database, by comparing the translated DNA sequence in three
-frames and allowing gaps and frameshifts.  \fCfastx3\fP uses a
-simpler, faster algorithm for alignments that allows frameshifts only
-between codons; \fCfasty3\fP is slower but produces better alignments
-with poor quality sequences because frameshifts are allowed within
-codons.
-T}
-
-\fCtfastx3\fP/ \fCtfasty3\fP   T{
-Compare a protein sequence to a DNA sequence
-database, calculating similarities with frameshifts to the forward and
-reverse orientations.
-T}
-
-\fCtfasta3\fP  T{
-Compare a protein sequence to a DNA sequence database, calculating 
-similarities (without frameshifts) to the 3 forward and three reverse 
-reading frames.  \fCtfastx3\fP and \fCtfasty3\fP are preferred because 
-they calculate similarity over frameshifts.
-T}
-
-\fCfastf3/tfastf3\fP   T{
-Compares an ordered peptide mixture, as would be obtained by
-Edman degredation of a CNBr cleavage of a protein, against a protein
-(\fCfastf\fP) or DNA (\fCtfastf\fP) database.
-T}
-
-\fCfasts3/tfasts3\fP   T{
-Compares set of short peptide fragments, as would be obtained
-from mass-spec. analysis of a protein, against a
-protein (\fCfasts\fP) or DNA (\fCtfasts\fP) database.
-T}
-=      =
-.TE
-.)z
-.sh 1 "Installing FASTA and the sequence databases"
-.sh 2 "Obtaining the libraries"
-.pp
-The FASTA program package does not include any protein or DNA sequence
-libraries.  Protein databases are available on CD-ROM from the PIR and
-EMBL (see below), or via anonymouse FTP from many different sources.
-As this document is updated in the fall of 1999, no DNA databases are
-available on CD-ROM from the major sequence databases: Genbank at the
-National for Biotechnology Information (\fCwww.ncbi.nlm.nih.gov\fP and
-\fCftp://ncbi.nlm.nih.gov\fP) and EMBL at the European Bioinformatics
-Institute (\fCwww.ebi.ac.uk\fP). However, the databases are available
-via anonymous FTP from both sites.
-.sh 3 "The GENBANK DNA sequence library"
-.pp
-Because of the large size of DNA databases, you will probably want to
-keep DNA databases in only one, or possibly two, formats.  The FASTA3
-programs that search DNA databases - \fCfasta3\fP, \fCtfastx/y3\fP,
-and \fCtfasta3\fP can read DNA databases in Genbank flatfile (not
-ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (\fCpressdb\fP), and
-BLAST2.0 (\fCformatdb\fP) formats, as well as EMBL format.  If you are
-also running the GCG suite of sequence analysis programs, you should
-use GCG/compressed-binary format or BLAST2.0 format for your
-\fCfasta3\fP searches.  If not, BLAST2.0 is a good choice.  These
-files are considerably more compact than Genbank flat files, and are
-preferred.  The NCBI does not provide software for converting from
-Genbank flat files to Blast2.0 DNA databases, but you can use the
-Blast \fCformatdb\fP program to convert ASN.1 formated Genbank files,
-which are available from the NCBI \fCftp\fP site.
-.pp
-The NCBI also provides the \fCnr\fP, \fCswissprot\fP, and several EST
-databases that are used by BLAST in FASTA format from:
-\fCftp://ncbi.nlm.nih.gov/blast/db\fP.  These databases are updated
-nightly.
-.sh 3 "The NBRF protein sequence library"
-.pp
-You can obtain the PIR protein sequence database
-[.pir980.] from:
-.(l
-National  Biomedical Research Foundation
-Georgetown  University  Medical  Center
-3900 Reservoir Rd, N.W.
-Washington, D.C. 20007
-.)l
-or via ftp from \fCnbrf.georgetown.edu\fP or from the NCBI
-(\fCncbi.nlm.nih.gov/repository/PIR\fP). The data in the \fCascii\fP
-directory is in PIR Codata format, which is not widely used.  I
-recommend the PIR/VMS format data (libtype=5) in the \fCvms\fP
-directory.
-.sh 3 "The EBI/EMBL CD-ROM libraries"
-.pp
-The European Bioinformatics Institute (EBI) distributes both the EMBL
-DNA database and the SwissProt database on CD-ROM,[.apw961.] and they
-are available from:
-.(l
-EMBL-Outstation  European Bioinformatics Institute
-Wellcome Trust Genome Campus,
-Hinxton Hall
-Hinxton,
-Cambridge CB10 1SD
-United Kingdom
-Tel: +44 (0)1223 494444
-Fax: +44 (0)1223 494468
-Email: DATALIB@ebi.ac.uk
-.)l
-In addition, the SWISS-PROT protein sequence database is available via
-anonymous FTP from \fCftp://ftp.expasy.ch/databases/swiss-prot/\fP
-(also see \fCwww.expasy.ch\fP).
-.sh 2 "Finding the libraries: FASTLIBS"
-.pp
-The major problem that most new users of the FASTA package have is in
-setting up the program to find the databases and their library type.
-In general, if you cannot get \fCfasta3\fP to read a sequence
-database, it is likely that something is wrong with the \fCFASTLIBS\fP
-file.  A common problem is that the database file is found, but either
-no sequences are read, or an incorrect number of entries is read.
-This is almost always because the library format (\fClibtype\fP) is
-incorrect.  Note that a type 5 file (PIR/VMS format) can be read
-as a type 0 (default FASTA) format file, and the number of entries
-will be correct, but the sequence lengths will not.
-.pp
-All the search programs in the FASTA3 package use the environment
-variable \fCFASTLIBS\fP to find the protein and DNA sequence libraries.  The
-\fCFASTLIBS\fP variable contains the name of a file that has the actual
-filenames of the libraries.  The \fCfastlibs\fP file included with the
-distribution on is an example of a file that can be referred to by
-FASTLIBS. To use the \fCfastlibs\fP file, type:
-.(l
-\fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh)
-or
-\fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh)
-.)l
-Then edit the \fCfastlibs\fP file to indicate where the protein and DNA
-sequence libraries can be found.  If you have a hard disk and your
-protein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP and
-your Genbank DNA sequence library is kept in the directory:
-\fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain:
-.ne 8
-.(l
-.ft C
-NBRF Protein$0P/usr/lib/seq/aabank.lib 0
-SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
-GB Primate$1P@/usr/lib/genbank/gpri.nam
-GB Rodent$1R@/usr/lib/genbank/grod.nam 
-GB Mammal$1M@/usr/lib/genbank/gmammal.nam
-^   1    ^^^^       4                   ^     ^
-          23                             (5)
-.ft R
-.)l
-The first line of this file says that there is a copy of the NBRF
-protein sequence database (which is a protein database) that can be
-selected by typing "P" on the command line or when the database menu
-is presented in the file \fC/usr/lib/seq/aabank.lib\fP.
-.pp
-Note that there are 4 or 5 fields in the lines in \fCfastgbs\fP.  The first
-field is the description of the library which will be displayed by
-FASTA; it ends with a '$'.  The second field (1 character), is a 0 if
-the library is a protein library and 1 if it is a DNA library.  The
-third field (1 character) is the character to be typed to select the
-library.
-.pp
-The fourth field is the name of the library file.  In the example
-above, the \fC/usr/lib/seq/aabank.lib\fP file contains the entire
-protein sequence library.  However the DNA library file names are
-preceded by a '@', because these files (\fCgpri.nam, grod.nam,
-gmammal.nam\fP) do not contain the sequences; instead they contain the names
-of the files which contain the sequences.  This is done because the
-GENBANK DNA database is broken down in to a large number of smaller
-files.  In order to search the entire primate database, you must
-search more than a dozen files.
-.pp
-In addition, an optional fifth field can be used to specify the format
-of the library file.  Alternatively, you can specify the library
-format in a file of file names (a file preceded by an '@').  This
-field must be separated from the file name by a space character ('\ ')
-from the filename.  In the example above, the \fCaabank.lib\fP file is
-in Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format
-(from the EMBL CD-ROM). Currently, FASTA can read the following formats:
-.(l I
-.ft C
-0 Pearson/FASTA (>SEQID - comment/sequence)
-1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
-2 NBRF CODATA (ENTRY/SEQUENCE)
-3 EMBL/SWISS-PROT (ID/DE/SQ)
-4 Intelligenetics (;comment/SEQID/sequence)
-5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
-6 GCG (version 8.0) Unix Protein and DNA (compressed)
-11 NCBI Blast1.3.2 format  (unix only)
-12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)
-.ft R
-.)l
-In particular, this version will work with the EMBL and PIR VMS
-formats that are distributed on the EMBL CD-ROM. The latter format
-(PIR VMS) is much faster to search than EMBL format.  This release
-also works with the protein and DNA database formats created for the
-BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI
-search format.  If a library format is not specified, for example,
-because you are just comparing two sequences, Pearson/FASTA (format 0)
-is used by default. To specify a library type on the command line,
-add it to the library filename and surround the filename and library
-type in quotes:
-.(l
-.ft C
-fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"
-.ft P
-.)l
-.pp
-You can specify a group of library files by putting a '@' symbol
-before a file that contains a list of file names to be searched.  For
-example, if @gmam.nam is in the fastgbs file, the file "gmam.nam"
-might contain the lines:
-.(l
-.ft C
-</seqdb/genbank
-gbpri1.seq 1
-gbpri2.seq 1
-gbpri3.seq 1
-gbpri4.seq 1
-gbrod.seq 1
-gbmam.seq 1
-.ft R
-.)l
-In this case, the line beginning with a '<' indicates the directory
-the files will be found in.  The remaining lines name the actual
-sequence files.  So the first sequence file to be searched would be:
-.(l
-.ft C
-/usr/lib/genbank/gbpri.seq
-.ft R
-.)l
-The notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operating
-system. Under UNIX, the trailing '/' is left off, so the library
-directory might be written as "\fC</usr/seqlib\fP".
-.pp
-The FASTA programs can search a database composed of different files
-in different sequence formats.  For example, you may wish to search
-the Genbank files (in GenBank flat file format) and the EMBL DNA
-sequence database on CD-ROM.  To do this, you simply list the names
-and filetypes of the files to be searched in a file of filenames.  For
-example, to search the mammalian portion of Genbank, the unannotated
-portion of Genbank, and the unannotated portion of the EMBL library,
-you could use the file:
-.(l I
-.ft C
-</usr/lib/DNA
-gbpri.seq 1
-\&#  (this '#' causes the program to display the size of the library)
-gbrod.seq 1
-\&...
-gbmam.seq 1
-\&...
-gbuna.seq 1
-\&...
-unanno.seq 5
-\&#
-.ft R
-.)l
-.(l I F
-You do not need to include library format numbers if you only use the
-Pearson/FASTA version of the PIR protein sequence library.  If no
-library type is specified, the program assumes that type 0 is being
-used.
-.)l 
-.pp
-Test the setup by running FASTA.  Enter the sequence
-file '\fCmgstm1.aa\fP' when the program requests it (this file is
-included with the programs).  The program should then ask you to
-select a protein sequence library.  Alternatively, if you run the
-TFASTA program and use the mgstm1.aa query sequence, the program
-should show you a selection of DNA sequence libraries.
-Once the fastgbs file has been set up correctly, you can
-set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to
-remember where the libraries are kept or how they are named.
-.ne 8
-.sh 1 "Using the FASTA Package"
-.sh 2 "Overview"
-.pp
-The FASTA sequence comparison programs all require similar
-information, the name of a query sequence file, a library file, and
-the \fIktup\fP parameter.  All of the programs can accept arguments
-on the command line, or they will prompt for the file names and
-\fIktup\fP value.
-.lp
-To use FASTA, simply type:
-.(l
-.ft C
-\f(CBFASTA\fP
-and you will be prompted for :
-.in +0.5i
-the name of the test sequence file
-the name of the library file
-and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
-(ktup of 2 is about 5 times faster than ktup = 1)
-.ft R
-.)l
-The program can also be run by typing
-.(l
-.ft C
-FASTA test.aa /lib/bigfile.lib \fIktup\fP (1 or 2)
-.ft R
-.)l
-.lp
-Included with the package are several test files.
-To check to make certain that everything is working, you can try:
-.(l
-.ft C
-fasta musplfm.aa prot_test.lib
-and
-tfastx mgstm1.aa gst.nlib
-.ft R
-.)l
-.sh 2 "Sequence files"
-.pp
-The \fCfasta3\fP programs know about three kinds of sequence files:
-(1) plain sequence files - files that contain nothing but
-sequence residues - can only be used as query sequences. (2) FASTA
-format files.  These are the same as plain sequence files, each
-sequence is preceded by a comment line with a '>' in the first
-column. (3) distributed sequence libraries (this is a broad class that
-includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file
-format, EMBL flat-file format, and Intelligenetics format.  All of the
-files that you create should be of type (1) or (2).  FASTA format
-files (ones with a '>' and comment before the sequence) are preferred,
-because they can be used as query or library sequence files by all of
-the programs.
-.pp
-I have included several sample test files, \fC*.aa\fP and \fC*.seq\fP
-as well as two small sequence libraries, \fCprot_test.lib\fP and
-\fCgst.nlib\fP.  The first line may begin with a '>' by a comment.
-Spaces and tabs (and anything else that is not an amino-acid code) are
-ignored.
-.pp
-Library files should have the form:
-.(l
-.ft C
->Sequence name and identifier
-A F A S Y T .... actual sequence.
-F S S       .... second line of sequence.
->Next sequence name and identifier
-.ft R
-.)l
-This is often referred to as "FASTA" or format.  You can
-build your own library by concatenating several sequence files.  Just
-be sure that each sequence is preceded by a line beginning with a '>'
-with a sequence name.
-.pp
-The test file should not have lines longer than 120 characters, and
-sequences entered with word processors should use a document
-mode, with normal carriage returns at the end of lines.
-.pp
-A different format is required to specify the ordered peptide mixture for \fCfastf3/tfastf3\fP. For example:
-.(l I
-.ft C
->mgstm1
-MGCEN,
-MIDYP,
-MLLAY,
-MLLGY
-.ft P
-.)l
-indicates \fCm\fP in the first position of all three peptides (as
-from CNBr), \fCG, I, L\fP (twice) in the second position (first cycle),
-\fCC,D,L\fP (twice) in the third position, etc.  The commas (\fC,\fP)
-are required to indicate the number of fragments in the mixture, but
-there should be no comma after the last residue.
-.pp
-For the \fCfasts3/tfasts3\fP program, the format is the same, except that there
-is no requirement for the peptides to be the same length.
-.sh 1 "Statistical Significance"
-.pp
-All the programs in the FASTA3 package attempt to calculate accurate
-estimates of the statistical significance of a match. For
-\fCfasta3\fP, \fCssearch3\fP, and \fCfastx3/y3\fP, these estimates are
-very accurate.[.wrp971,wrp981.].  Altschul et al. [.alt940.] provides
-an excellent review of the statistics of local similarity scores.
-Local sequence similarity scores follow the extreme value
-distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u =
-ln(Kmn)/lambda and m,m are the lengths of the query and library
-sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda
-x), which shows that the average score for an unrelated library
-sequence increases with the logarithm of the length of the library
-sequence.  The \fCfasta3\fP programs use simple linear regression
-against the the log of the library sequence length to calculate a
-normalized "z-score" with mean 50, regardless of library sequence
-length, and variance 10. (Several other estimation methods are
-available with the \fC\-z\fP option.) These z-scores can then be used
-with the extreme value distribution and the poisson distribution (to
-account for the fact that each library sequence comparison is an
-independent test) to calculate the number of library sequences to
-obtain a score greater than or equal to the score obtained in the
-search. The original idea and routines to do the linear regression on
-library sequence length were provided Phil Green, U. Washington.  This
-version uses a slightly different strategy for fitting the data than
-those originally provided by Dr. Green.
-.pp
-The expected number of sequences is plotted in the histogram using an
-"*". Since the parameters for the extreme value distribution are not
-calculated directly from the distribution of similarity scores, the
-pattern of "*'s" in the histogram gives a qualitative view of how well
-the statistical theory fits the similarity scores calculated by the
-programs.  For \fCfasta3\fP, if optimized scores are calculated for
-each sequence in the database (the default), the agreement between the
-actual distribution of "z-scores" and the expected distribution based
-on the length dependence of the score and the extreme value
-distribution is usually very good.  Likewise, the distribution of
-\fCssearch3\fP Smith-Waterman scores typically agrees closely with the
-<actual distribution of "z-scores."  The agreement with unoptimized
-scores, \fIktup=2\fP, is often not very good, with too many high
-scoring sequences and too few low scoring sequences compared with the
-predicted relationship between sequence length and similarity score.
-In those cases, the expectation values may be overestimates.
-.pp
-With version 33t01, all the FASTA programs also report a "bit" score,
-which is equivalent to the bit score reported by BLAST2.  The
-FASTA33/BLAST2 bit score is calculated as: (lambda*S - ln K)/ln 2,
-where S is the raw similarity score, lambda and K are statistical
-parameters estimated from the distribution of unrelated sequence
-similarity scores.  The statistical signficance of a given bit score
-depends on the lengths of the query and library sequences and the size
-of the library, but a 1 bit increase in score corresponds to a 2-fold
-reduction in expectation; a 10-bit increase implies 1000-fold lower
-expectation, etc.
-.pp
-The statistical routines assume that the library contains a large
-sample of unrelated sequences.  If this is not true, then statistical
-parameters can be estimated by using the \fC\-z 11\-15\fP, options.
-\fC\-z\fP options greater than 10 calculate a shuffled similarity score
-for each library sequence, in addition to the unshuffled score, and
-estimate the statistical parameters from the scores of the shuffled
-sequences.  If there are fewer than 20 sequences in the library, the
-statistical calculations are not done.
-.pp
-For protein searches, library sequences with E() values < 0.01 for
-searches of a 10,000 entry protein database are almost always
-homologous. Frequently sequences with E()-values from 1 - 10 are
-related as well, but unrelated sequences ( 1 \- 10 per search) will
-have scores in this renage as well. Remember, however, that these E()
-values also reflect differences between the amino acid composition of
-the query sequence and that of the "average" library sequence.  Thus,
-when searches are done with query sequences with "biased" amino-acid
-composition, unrelated sequences may have "significant" scores because
-of sequence bias.  \fCPRSS3\fP can address this problem by calculating
-similarity scores for random sequences with the same length and amino
-acid composition. 
-.sh 1 "Options"
-.pp
-Command line options are available to change the scoring parameters
-and output display. \fBCommand line options must preceed other program
-arguments, such as the query and library file names.\fP
-.sh 2 "Command line options"
-.ip "-a"
-(fasta3, ssearch3 only) show both sequences in their entirety.
-.ip "-A"
-force Smith-Waterman alignments for fasta3 DNA sequences.  By default,
-only fasta3 protein sequence comparisons use Smith-Waterman alignments.
-.ip "-B"
-Show normalized score as a z-score, rather than a bit-score in the list
-of best scores.
-.ip "-b #"
-Number of sequence scores to be shown on output.  In the absence of
-this option, fasta (and tfasta and ssearch) display all library
-sequences obtaining similarity scores with expectations less than
-10.0 if optimized score are used, or 2.0 if they are not. The -b
-option can limit the display further, but it will not cause additional
-sequences to be displayed.
-.ip "-c #"
-Threshold score for optimization (OPTCUT).  Set "-c 1" to
-optimize every sequence in a database.
-.ip "-E #"
-Limit the number of scores and alignments shown based on the
-expected number of scores.  Used to override the expectation value of 10.0
-used by default.  When used with -Q, -E 2.0 will show all library sequences
-with scores with an expectation value <= 2.0.
-.ip "-d #"
-Maximum number of alignments to be displayed.  Ignored if "-Q" is not
-used.
-.ip "-f"
-Penalty for the first residue in a gap (-12 by default for proteins,
--16 for DNA, -15 for FAST[XY]/TFAST[XY]).
-.ip "-F #"
-Limit the number of scores and alignments shown based on the expected
-number of scores. "-E #" sets the highest E()-value shown; "-F #" sets
-the lowest E()-value. Thus, "-F 0.0001" will not show any matches or
-alignments with E() < 0.0001.  This allows one to skip over close
-relationships in searches for more distant relationships.
-.ip "-g"
-Penalty for additional residues in a gap (-2 by default for proteins,
--4 for DNA, -3 for FAST[XY]/TFAST[XY]).
-.ip "-h"
-Penalty for frameshift (fastx3/y3, tfastx3/y3 only).
-.ip "-H"
-Omit histogram.
-.ip "-i"
-Invert (reverse complement) the query sequence if it is DNA.  For
-tfasta3/x3/y3, search the reverse complement of the library sequence
-only.
-.ip "-j #"
-Penalty for frameshift within a codon (fasty3/tfasty3 only).
-.ip "-l file"
-Location of library menu file (FASTLIBS).
-.ip "-L"
-Display more information about the library sequence in the alignment.
-.ip "-M low-high"
-Range of amino acid sequence lengths to be included in the search.
-.ip "-m #"
-Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10
-.(l I
-.ft C
-    \-m 0        \-m 1          \-m 2          \-m 3        \-m 4
-.ft C
-MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
-::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
-MWKSCGYPYT   MWKSCGYPYT
-.ft P
-.)l
-.ip 
-\fC\-m 5\fP provides a combination of \fC\-m 4\fP and
-\fC\-m 0. \fC\-m 6 provides \fC\-m 5\fP plus HTML formatting.
-.ip "-m 9"
-provides coordinates and scores with the best score information.
-A simple "\fC -m 9\fP extends the normal best score information:
-.(l
-.ft C
-The best scores are:                                      opt bits E(14548)
-XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79
-.ft P
-.)l
-to include the additional information (on the same line, separated by
-a <tab>):
-.(l
-.ft C
-%_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
-0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0
-.ft P
-.)l
-\fC -m 9c\fP provides additional information: an encoded alignment string.  Thus:
-.(l I
-.ft C
-       10        20        30        40        50          60         70  
-GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
-       :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
-XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
-               20        30                 40        50        60        
-.ft P
-.)l
-would be encoded:
-.(l
-.ft C
-=23+9=13-2=10-1=3+1=5
-.ft P
-.)l
-The alignment encoding is with repect to the alignment, not the
-sequences.  The coordinate of the alignment is given earlier in the
-"\fC -m 9c\fP" line.
-.ip "-m 10"
-\fC\-m 10\fP is a new, parseable format for use
-with other programs.  See the file "readme.v20u4" for a more complete
-description. 
-.ip
-As of version "fa34t23b2", it has become possible to combine independent
-"\fC\-m\fP" options.  Thus, one can use "\fC\-m 1 -m 6 -m 9\fP".
-.ip "-M low\-high"
-Include library sequences (proteins only) with lengths between low and
-high.
-.ip "-n"
-Force the query sequence to be treated as a DNA sequence.  This is
-particularly useful for query sequences that contain a large number of
-ambiguous residues, e.g. transcription factor binding sites.
-.ip "-O"
-Send copy of results to "filename."  Helpful for environments without
-STDOUT (mostly for the Macintosh).
-.ip "-o "
-Turn off default optimization of all scores greater than OPTCUT. Sort
-results by "initn" scores (reduces the accuracy of statistical
-estimates).
-.ip "-p"
-Force query to be treated as protein sequence.
-.ip "-Q,-q"
-Quiet - does not prompt for any input.  Writes scores and alignments
-to the terminal or standard output file.
-.ip "-r"
-Specify match/mismatch scores for DNA comparisons.  The default is
-"+5/-4". "+3/-2" can perform better in some cases.
-.ip "-R file"
-Save a results summary line for every sequence in the sequence
-library.  The summary line includes the sequence identifier,
-superfamily number (if available) position 
-in the library, and the similarity scores calculated.  This option can
-be used to evaluate the sensitivity and selectivity of different
-search strategies.[.wrp951,wrp981.]
-.ip "-s file"
-Specify the scoring matrix file.  \fCfasta3\fP uses the same scoring
-matrices as Blast1.4/2.0.  Several scoring matrix files are included
-in the standard distribution.  For protein sequences: \fCcodaa.mat\fP
-- based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix;
-\fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff et
-al.;[.day787.]  \fCpam120.mat\fP - a PAM120 matrix.  The default
-scoring matrix is BLOSUM50 ("-s BL50"). Other matrices available from
-within the program are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-s
-P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 \- M40" (MDM are modern
-PAM matrices from Jones et al.,[.tay925.]), BLOSUM50, 62, and 80/"-s
-BL50", "-s BL62", "-s BL80".
-.ip "-S"
-Treat lower-case characters in the query or library sequences as
-"low-complexity" ("seg"-ed) residues.  Traditionally, the "seg"
-program [.woo935.] is used to remove low complexity regions in DNA
-sequences by replacing the residues with an "X".  When the "-S" option
-is used, the FASTA33 (and later) programs provide a potentially more
-informative approach.  With "-S", lower case characters in the query
-or database sequences are treated as "X"'s during the initial scan,
-but are treated as normal residues during the final alignment display.
-Since statistical significance is calculated from the similarity score
-calculated during the library search, when the lower case residues are
-"X"'s, low complexity regions will not produce statistically
-significant matches.  However, if a significant alignment contains low
-complexity regions, their alignmen is shown.  With "-S", lower case
-characters may be included in the alignment to indicate low complexity
-regions, and the final alignment score may be higher than the score
-obtained during the search.
-.ip
-The \fCpseg\fP program can be used to produce databases (or query
-sequences) with lower case residues indicating low complexity regions
-using the command:
-.(l I
-\fCpseg database.fasta -z 1 -q  > database.lc_seg\fP
-.)l
-(\fCseg\fP can also be used with some post processing, see readme.v33tx.)
-.ip
-The \fC-S\fP option should always be used with \fCFASTX/Y\fCP and
-\fCTFASTX/Y\fP because out of frame translations often generate
-low-complexity protein sequences.  However, only lower case characters
-in the protein sequence (or protein database) are masked; lower case
-DNA sequences are translated into upper case protein sequences, and
-not treated as low complexity by the translated alignment programs.
-.ip "-t #"
-Translation table - tfasta3, fastx3, tfastx3, fasty3, and
-tfasty3 now support the BLAST tranlation tables.  See
-\fChttp://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi\fP.
-.ip
-In addition, "\-t t" or "\-t t#" turns on the addition of an implicit termination
-codon to a protein:translated DNA match.  That is, each protein
-sequence implicitly ends with "*", which matches the termination codes
-for the appropriate genetic code.  "\-t t#" sets implicit termination
-and a different genetic code.
-.ip "-U"
-Treat the query sequence an RNA sequence.  In addition to selecting a
-DNA/RNA alphabet, this option causes changes to the scoring matrix so
-that 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'.
-.ip "-V str"
-It is now possible to specify some annotation characters that can be
-included (and will be ignored), in the query sequence file.  Thus, One
-might have a file with: \fC"ACVS*ITRLFT?"\fP, where "*" and "?"  are
-used to indicate phosphorylation.  By giving the option \fC\-V '*?'\fP,
-those characters in the query will be moved to an "annotation string",
-and alignments that include the annotated residues will be highlighted
-with the appropriate character above the sequence (on the number line).
-.ip "-w #"
-Line length (width) = number (<200)
-.ip "-W #"
- context length (default is 1/2 of line width -w) for alignment,
-like fasta and ssearch, that provide additional sequence context.
-.ip "-x #match,#mismatch"
-Specify the penalty for a match to an 'X', and mismatch to 'X',
-independently of the PAM matrix.  Particularly useful for
-\fCfastx3/fasty3\fP, where termination codons are encoded as 'X'.
-.ip "-X \"off1 off2\""
-Specifies offsets for the beginning of the query and library sequence.
-For example, if you are comparing upstream regions for two genes, and
-the first sequence contains 500 nt of upstream sequence while the
-second contains 300 nt of upstream sequence, you might try:
-.(l I
-\fCfasta -X "-500 -300" seq1.nt seq2.nt\fP
-.)l
-If the -X option is not used, FASTA assumes numbering starts with 1.
-(You should double check to be certain the negative numbering works
-properly.)
-.ip "-y"
-Set the width of the band used for calculating "optimized" scores.
-For proteins and ktup=2, the width is 16.  For proteins with ktup=1,
-the width is 32 by default.  For DNA the width is 16.
-.ip "-z -1,0,1,2,3,4,5"
-\fC\-z -1\fP turns off statistical calculations. \fCz 0\fP estimates
-the significance of the match from the mean and standard deviation of
-the library scores, without correcting for library sequence length.
-\fC\-z 1\fP (the default) uses a weighted regression of average score
-vs library sequence length; \fC\-z 2\fP uses maximum likelihood
-estimates of
-.if t \(*l
-.if n Lambda
-and \fIK\fP; \fC\-z 3\fP uses Altschul-Gish
-parameters;[.alt960.] \fC\-z 4 \- 5\fP uses two variations on the
-\fC\-z 1\fP strategy. \fC\-z 1\fP and \fC\-z 2\fP are the best methods,
-in general.
-.ip "-z 11,12,14,15" 
-estimate the statistical parameters from shuffled copies of each
-library sequence.  This doubles the time required for a search, but
-allows accurate statistics to be estimated for libraries comprised of
-a single protein family.
-.ip "-Z db_size"
-set the apparent size of the database to be used when calculating
-expectation E() values.  If you searched a database with 1,000
-sequences, but would like to have the E()-values calculated in the
-context of a 100,000 sequence database, use '-Z 100000'.
-.ip "-1"
-sort output by init1 score (for compatibility with FASTP - do not
-use).
-.ip "-3"
-translate only three forward frames
-.sp
-.lp
-For example:
-.(l
-\fCfasta -w 80 -a seq1.aa seq.aa\fP
-.)l
-would compare the sequence in seq1.aa to that in seq2.aa and display the
-results with 80 residues on an output line, showing all of the residues
-in both sequences.  Be sure to enter the options before entering the file
-names, or just enter the options on the command line, and the program will
-prompt for the file names.
-.sp
-.pp
-(November, 1997) In addition, it is now possible to provide the fasta
-programs with the query sequence (fasta, fasty, ssearch, tfastx), or
-two sequences (prss, lalign, plalign) from the unix "stdin" stream.  This
-makes it much easier to set up FASTA or PRSS WWW pages.  To specify
-that stdin be used, rather than a file, the file name should be
-specified as '-' or '@' (the latter file name makes it possible to
-specify a subset of the sequence).
-Thus:
-.(l
-cat query.aa | fasta -q @:25-75 s
-.)l
-would take residues 25-75 from query.aa and search the 's' library
-(see the discussion of FASTLIBS).
-.sh 2 "Environment variables"
-.pp
-Because the current version of the program allows the user to set
-virtually every option on the command line (except the \fIktup\fP,
-which must be set as the third command line argument), only the
-\fCFASTLIBS\fP environment variable is routinely used.
-.ip "FASTLIBS"
-specifies the location of the file which contains the list of library
-descriptions, locations, and library types (see section on finding
-library files).
-.sh 1 "Frequently Asked Questions (FAQs)"
-.np
-\fIWhich program should I use?\fP See Table I.
-.np
-\fIHow do I search with both DNA strands with\fP \fCfasta3\fP \fIand\fP
-\fCfastx3\fP? With version 32 of the FASTA program package, all
-searches that use DNA queries (e.g. \fCfasta3\fP, \fCfastx3/y3\fP)
-examine both strands. To revert to earlier FASTA behavior - only
-looking at the forward or reverse strand - use \fC\-3\fP to search only
-the forward strand and \fC\-i -3\fP to search only the reverse strand.
-.np
-\fIWhen I search Genbank - the program reports:\fP \fC0 residues in 0
-sequences\fP.  This typically happens because the program does not
-know that you are searching a Genbank flatfile database and is looking
-for a FASTA format database.  Be certain to specify the library type
-("1" for Genbank flatfile) with the database name.
-.np
-What is the difference between \fCfastx3\fP and \fCfasty3\fP (or
-\fCtfastx3\fP and \fCtfasty3\fP).  \fC[t]fastx3\fP uses a simpler
-codon based model for alignments that does not allow frameshifts in
-some codon positions (see ref. [.wrp971.]).  \fCtfastx3\fP is about
-30% faster, but \fCtfasty3\fP can produce higher quality alignments in
-some cases.
-.np
-\fIWhen I run\fP \fCfasta3 -q\fP, I don't see any (or very little)
-output, but I get lots of scores when I run interactively. With the
-\fC\-Q\fP option, the number of high scores displayed is limited by the
-\fC\-E #\fP cutoff, which is 10.0 for protein comparisons, 2.0 for DNA
-comparisons, and 5.0 for translated DNA:protein comparisons.  In
-interactive mode (without \fC\-Q\fP), by default you see 20 high
-scores, regardless of \fCE()\fP value.
-.np
-\fIWhat is ktup\fP \- All of the programs with \fCfast\fP in their
-name use a computer science method called a lookup table to speed the
-search.  For proteins with \fIktup\fP=2, this means that the program
-does not look at any sequence alignment that does not involve matching
-two identical residues in both sequences.  Likewise with DNA and
-\fIktup\fP = 6, the initial alignment of the sequences looks for 6
-identical adjacent nucleotides in both sequences.  Because it is less
-likely that two identical amino-acids will line up by chance in two
-unrelated proteins, this speeds up the comparison.  But very distantly
-related sequences may never have two identical residues in a row but
-will have single aligned identities.  In this case, \fIktup\fP = 1 may
-find alignments that \fIktup\fP=2 misses.
-.np
-\fISometimes, in the list of best scores, the same sequence is shown
-twice with exactly the same score.  Sometimes, the sequence is there
-twice, but the scores are slightly different.\fP When any of the
-\fCfasta3\fP programs searches a long sequence, it breaks the sequence
-up into \fIoverlapping\fP pieces.  The length of the piece depends on
-the length of the query and the particular program being used (it can
-also be controlled with the -N #### option).  Since the pieces overlap
-by the length of the query sequence (or 3*query_length for fastx/y3
-and tfasta/x/y3), if the highest scoring alignment is at the end of
-one piece, it will be scored again at the beginning of the next piece.
-If the alignment is not be completely included in the overlap region,
-one of the pieces will give a higher score than the other.  These
-duplications can be detected by looking at the coordinates of the
-alignment.  If either the beginning or end coordinate is identical in
-two alignments, the alignments are at least partially duplicates.
-.lp
-As always, please inform me of bugs as soon as possible.
-.sp
-.nf
-William R. Pearson
-Department of Biochemistry
-Jordan Hall Box 800733
-U. of Virginia
-Charlottesville, VA
-
-wrp@virginia.EDU
-
-.sh 1 "References"
-.[]