3 prss \- test a protein sequence similarity for significance
6 \&[-Q -A -f # -g # -H -O file -s SMATRIX -w # -Z #
9 sequence-file-1 sequence-file-2
15 \&[-Q -A -f # -g # -H -O file -s SMATRIX -w # -z 1,3 -Z #
18 sequence-file-1 sequence-file-2
26 .B prss34(_t)/prfx34(_t)
34 are used to evaluate the significance of a protein:protein, DNA:DNA
37 ), or translated-DNA:protein (
39 ) sequence similarity score
40 by comparing two sequences and calculating optimal similarity scores,
41 and then repeatedly shuffling the second sequence, and calculating
42 optimal similarity scores using the Smith-Waterman algorithm. An
43 extreme value distribution is then fit to the shuffled-sequence
44 scores. The characteristic parameters of the extreme value
45 distribution are then used to estimate the probability that each of
46 the unshuffled sequence scores would be obtained by chance in one
47 sequence, or in a number of sequences equal to the number of shuffles.
48 This program is derived from
50 \&, described by Pearson and Lipman, PNAS (1988) 85:2444-2448, and
51 Pearson (Meth. Enz. 183:63-98). Use of the extreme value
52 distribution for estimating the probabilities of similarity scores was
53 described by Altshul and Karlin, PNAS (1990) 87:2264-2268. The
54 'z-values' calculated by rdf2 are not as informative as the P-values
55 and expectations calculated by prdf.
57 calculates optimal scores using the same rigorous Smith-Waterman
58 algorithm (Smith and Waterman, J. Mol. Biol. (1983) 147:195-197) used by the
62 calculates scores using the FASTX algorithm (Pearson et al. (1997) Genomics 46:24-36.
67 also allow a more sophisticated shuffling method: residues can be shuffled
68 within a local window, so that the order of residues 1-10, 11-20, etc,
69 is destroyed but a residue in the first 10 is never swapped with a residue
70 outside the first ten, and so on for each local window.
75 \& -v 10 musplfm.aa lcbo.aa
77 Compare the amino acid sequence in the file musplfm.aa with that
78 in lcbo.aa, then shuffle lcbo.aa 200 times using a local shuffle with
79 a window of 10. Report the significance of the
80 unshuffled musplfm/lcbo comparison scores with respect to the shuffled
85 musplfm.aa lcbo.aa 1000
87 Compare the amino acid sequence in the file musplfm.aa with the sequences
88 in the file lcbo.aa, shuffling \fClcbo.aa\fP 1000 times. Shuffles can also be specified with the -k # option.
92 mgstm1.esq xurt8c.aa 2 1000
94 Translate the DNA sequence in the \fCmgstm1.esq\fP file in all six
95 frames and compare it to the amino acid sequence in the file
96 \fCxurt8c.aa\fP, using ktup=2 and shuffling \fCxurt8c.aa\fP 1000
97 times. Each comparison considers the best forward or reverse
98 alignment with frameshifts, using the fastx algorithm (Pearson et al
99 (1997) Genomics 46:24-36).
104 Run prss in interactive mode. The program will prompt for the file
105 name of the two query sequence files and the number of shuffles to be
110 can be directed to change the scoring matrix, gap penalties, and
111 shuffle parameters by entering options on the command line (preceeded
112 by a `\-'). All of the options should preceed the file names number of
116 Show unshuffled alignment.
119 Penalty for opening a gap (-10 by default for proteins).
122 Penalty for additional residues in a gap (-2 by default) for proteins.
125 Do not display histogram of similarity scores.
128 Number of shuffles (200 is the default)
131 "quiet" - do not prompt for filename.
134 send copy of results to "filename."
137 specify the scoring matrix. BLOSUM50 is used by default for proteins;
138 +5/-4 is used by defaul for DNA.
140 recognizes the same scoring matrices as fasta34, ssearch34, fastx34, etc;
141 e.g. BL50, P250, BL62, BL80, MD10, MD20, and other matrices in BLAST1.4
145 Use a local window shuffle with a window size of #.
148 Calculate statistical significance using the mean/variance
149 (moments) approach used by fasta34/ssearch or from maximum likelihood
150 estimates of lambda and K.
153 Present statistical significance as if a '#' entry database had
154 been searched (e.g. "-Z 50000" presents statistical significance as if
155 50,000 sequences had been compared).
156 .SH ENVIRONMENT VARIABLES
159 the filename of an alternative scoring matrix file. For protein
160 sequences, BLOSUM50 is used by default; PAM250 can be used with the
163 (or with -s pam250.mat). BLOSUM62 (-s BL62) and PAM120 (-S P120).
165 ssearch3(1), fasta3(1).