binaries/src/fasta34/prss3.1

   1 .TH PRSS3 1 local
   2 .SH NAME
   3 prss \- test a protein sequence similarity for significance
   4 .SH SYNOPSIS
   5 .B prss34
   6 \&[-Q -A -f # -g # -H -O file -s SMATRIX -w # -Z #
   7 .I -k # -v #
   8 ]
   9 sequence-file-1 sequence-file-2
  10 [
  11 .I #-of-shuffles
  12 ]
  13
  14 .B prfx34
  15 \&[-Q -A -f # -g # -H -O file -s SMATRIX -w # -z 1,3 -Z #
  16 .I -k # -v #
  17 ]
  18 sequence-file-1 sequence-file-2
  19 [
  20 .I ktup
  21 ]
  22 [
  23 .I #-of-shuffles
  24 ]
  25
  26 .B prss34(_t)/prfx34(_t)
  27 [-AfghksvwzZ]
  28 \- interactive mode
  29
  30 .SH DESCRIPTION
  31 .B prss34
  32 and
  33 .B prfx34
  34 are used to evaluate the significance of a protein:protein, DNA:DNA
  35 (
  36 .B prss34
  37 ), or translated-DNA:protein (
  38 .B prfx34
  39 ) sequence similarity score
  40 by comparing two sequences and calculating optimal similarity scores,
  41 and then repeatedly shuffling the second sequence, and calculating
  42 optimal similarity scores using the Smith-Waterman algorithm. An
  43 extreme value distribution is then fit to the shuffled-sequence
  44 scores.  The characteristic parameters of the extreme value
  45 distribution are then used to estimate the probability that each of
  46 the unshuffled sequence scores would be obtained by chance in one
  47 sequence, or in a number of sequences equal to the number of shuffles.
  48 This program is derived from
  49 .B rdf2\c
  50 \&, described by Pearson and Lipman, PNAS (1988) 85:2444-2448, and
  51 Pearson (Meth. Enz.  183:63-98).  Use of the extreme value
  52 distribution for estimating the probabilities of similarity scores was
  53 described by Altshul and Karlin, PNAS (1990) 87:2264-2268.  The
  54 'z-values' calculated by rdf2 are not as informative as the P-values
  55 and expectations calculated by prdf.
  56 .B prss34
  57 calculates optimal scores using the same rigorous Smith-Waterman
  58 algorithm (Smith and Waterman, J. Mol. Biol. (1983) 147:195-197) used by the
  59 .B ssearch34
  60 program.
  61 .B prfx34
  62 calculates scores using the FASTX algorithm (Pearson et al. (1997) Genomics 46:24-36.
  63 .PP
  64 .B prss34
  65 and
  66 .B prfx34
  67 also allow a more sophisticated shuffling method: residues can be shuffled
  68 within a local window, so that the order of residues 1-10, 11-20, etc,
  69 is destroyed but a residue in the first 10 is never swapped with a residue
  70 outside the first ten, and so on for each local window.
  71 .SH EXAMPLES
  72 .TP
  73 (1)
  74 .B prss34
  75 \& -v 10 musplfm.aa lcbo.aa
  76 .PP
  77 Compare the amino acid sequence in the file musplfm.aa with that
  78 in lcbo.aa, then shuffle lcbo.aa 200 times using a local shuffle with
  79 a window of 10.  Report the significance of the
  80 unshuffled musplfm/lcbo comparison scores with respect to the shuffled
  81 scores.
  82 .TP
  83 (2)
  84 .B prss34
  85 musplfm.aa lcbo.aa 1000
  86 .PP
  87 Compare the amino acid sequence in the file musplfm.aa with the sequences
  88 in the file lcbo.aa, shuffling \fClcbo.aa\fP 1000 times.  Shuffles can also be specified with the -k # option.
  89 .TP
  90 (3)
  91 .B prfx34
  92 mgstm1.esq xurt8c.aa 2 1000
  93 .PP
  94 Translate the DNA sequence in the \fCmgstm1.esq\fP file in all six
  95 frames and compare it to the amino acid sequence in the file
  96 \fCxurt8c.aa\fP, using ktup=2 and shuffling \fCxurt8c.aa\fP 1000
  97 times.  Each comparison considers the best forward or reverse
  98 alignment with frameshifts, using the fastx algorithm (Pearson et al
  99 (1997) Genomics 46:24-36).
 100 .TP
 101 (4)
 102 .B prss34/prfx34
 103 .PP
 104 Run prss in interactive mode.  The program will prompt for the file
 105 name of the two query sequence files and the number of shuffles to be
 106 used.
 107 .SH OPTIONS
 108 .PP
 109 .B prss34/prfx34
 110 can be directed to change the scoring matrix, gap penalties, and
 111 shuffle parameters by entering options on the command line (preceeded
 112 by a `\-'). All of the options should preceed the file names number of
 113 shuffles.
 114 .TP
 115 \-A
 116 Show unshuffled alignment.
 117 .TP
 118 \-f #
 119 Penalty for opening a gap (-10 by default for proteins).
 120 .TP
 121 \-g #
 122 Penalty for additional residues in a gap (-2 by default) for proteins.
 123 .TP
 124 \-H
 125 Do not display histogram of similarity scores.
 126 .TP
 127 \-k #
 128 Number of shuffles (200 is the default)
 129 .TP
 130 \-Q -q
 131 "quiet" - do not prompt for filename.
 132 .TP
 133 \-O filename
 134 send copy of results to "filename."
 135 .TP
 136 \-s str
 137 specify the scoring matrix.  BLOSUM50 is used by default for proteins;
 138 +5/-4 is used by defaul for DNA.
 139 .B prss34
 140 recognizes the same scoring matrices as fasta34, ssearch34, fastx34, etc;
 141 e.g. BL50, P250, BL62, BL80, MD10, MD20, and other matrices in BLAST1.4
 142 matrix format.
 143 .TP
 144 \-v #
 145 Use a local window shuffle with a window size of #.
 146 .TP
 147 \-z #
 148 Calculate statistical significance using the mean/variance
 149 (moments) approach used by fasta34/ssearch or from maximum likelihood
 150 estimates of lambda and K.
 151 .TP
 152 \-Z #
 153 Present statistical significance as if a '#' entry database had
 154 been searched (e.g. "-Z 50000" presents statistical significance as if
 155 50,000 sequences had been compared).
 156 .SH ENVIRONMENT VARIABLES
 157 .PP
 158 .B (SMATRIX)
 159 the filename of an alternative scoring matrix file.  For protein
 160 sequences, BLOSUM50 is used by default; PAM250 can be used with the
 161 command line option
 162 .B -s P250\c
 163 (or with -s pam250.mat).  BLOSUM62 (-s BL62) and PAM120 (-S P120).
 164 .SH "SEE ALSO"
 165 ssearch3(1), fasta3(1).
 166 .SH AUTHOR
 167 Bill Pearson
 168 .br
 169 wrp@virginia.EDU
 170