binaries/src/fasta34/pvcomp.1

   1 .TH PVCOMPFA/PVCOMPSW/v3.4 1 "January, 2003"
   2 .SH NAME
   3 .B pv34compfa
   4 \- scan a protein or DNA sequence library for similar
   5 sequences using the FASTA algorithm in parallel on a network of
   6 machines running pvm3.
   7
   8 .B pv34compsw
   9 \- scan a protein or DNA sequence library for similar
  10 sequences using the Smith-Waterman algorithm in parallel on a network
  11 of machines running pvm3.
  12
  13 .B ps34compfa
  14 \- evaluate sequence comparison parameters using the FASTA
  15 algorithm and super-family-annotated libraries.
  16
  17 .B ps34compsw
  18 \- evaluate sequence comparison parameters using the
  19 Smith-Waterman algorithm and super-family-annotated libraries.
  20
  21 .SH SYNOPSIS
  22 .B pv34compfa
  23 [-Q|q -B -b # -d # -E # -f # -g # -H -i J # -n -o -p #
  24 \& -R
  25 .I STATFILE
  26 \& -r "+n/-m" \& -S -s
  27 .I SMATRIX
  28 \& -w # -1 ] query-library reference-library [
  29 .I ktup
  30 ]
  31 .B pv34compfa
  32 [\-QBbcefgHiJnopRrSsw1] \- interactive mode
  33
  34 .B pv34compsw
  35 [-Q|q -B -b # -e -f delval -g gapval -i
  36 \& -n -p # -R -R
  37 .I STATFILE
  38 \& -r "+n/-m" \& -S -s
  39 \& -s
  40 .I SMATRIX
  41  ] query-library reference-library [
  42 .I ktup
  43 ]
  44
  45 .B pv34compsw
  46 [\-QBbefgnpRrsS] \- interactive mode
  47
  48 .SH DESCRIPTION
  49 .B pv34compfa
  50 and
  51 .B pv34compsw
  52 compare all of the sequences in one DNA or protein sequence library
  53 (the query library) with to all of the entries in a reference sequence
  54 library using the FASTA (pv34compfa) or Smith-Waterman (pv34compsw)
  55 algorithms.  For example,
  56 .B pv34compfa
  57 can compare a library of protein sequences to all of the sequences in
  58 the NBRF PIR protein sequence database.
  59 .B pv34compfa
  60 and
  61 .B pv34compsw
  62 are designed to run in parallel on networks of unix workstations using
  63 the PVM parallel programming system. (For more information on PVM,
  64 send email to "netlib@ornl.gov" with the message "send index for pvm3").
  65 .PP
  66 .B pv34compfa
  67 uses the rapid sequence comparison algorithm
  68 described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444.
  69 The program can be invoked either with command line arguments or in
  70 interactive mode.  The optional third argument,
  71 .I ktup
  72 sets the sensitivity and speed of the search.  If
  73 .I ktup=2,
  74 similar regions in the two sequences being compared are found by
  75 looking at pairs of aligned residues; if
  76 .I ktup=1,
  77 single aligned amino acids are examined.
  78 .I ktup
  79 can be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA sequences.
  80 The default if
  81 .I
  82 ktup
  83 is not specified is 2 for proteins and 6 for DNA.
  84 .PP
  85 .B pv34compfa
  86 compares a library of query sequences (there need be only one) to a
  87 reference sequence library.  Normally
  88 .B pv34compfa
  89 sorts the output by the
  90 .I initn
  91 score.  By using the
  92 .I \-1
  93 option, sequences are ranked by their
  94 .B init1
  95 score.  Alternative, the
  96 .I \-o
  97 option causes optimized scores to be calculated for every sequence
  98 greater than a threshold and the output to be sorted by the optimized
  99 scores.
 100 .PP
 101 .B pv34compsw
 102 uses the rigorous Smith-Waterman algorithm to compare protein or
 103 DNA sequences. The gap penalties and scoring matrices can be
 104 modified with the
 105 .I -f\c
 106 \&,
 107 .I -k\c
 108 \&, and
 109 .I -s
 110 options.
 111 .PP
 112 .B pv34compfa
 113 (and
 114 .B pv34compsw\c
 115 \&) will automatically decide whether the query sequence is DNA or
 116 protein by reading the query sequence as protein and determining
 117 whether the `amino-acid composition' is more than 85% A+C+G+T.
 118 .PP
 119 .B ps34compfa
 120 and
 121 .B ps34compsw
 122 are versions of
 123 .B pv34compfa
 124 and
 125 .B pv34compsw
 126 that evaluate the quality of a search by reporting how many
 127 high-scoring related sequences and low-scoring unrelated sequences
 128 were found.  These programs require that both the query library and
 129 the reference library be annotated with superfamily numbers for every
 130 sequence in the library.
 131 .SH OPTIONS
 132 .LP
 133 .B Pv34compfa
 134 and
 135 .B pv34compsw
 136 now support all the options of the fasta3(_t) programs.
 137 .TP
 138 \-B
 139 Report z-score, rather than bit-score, in list of best hits.
 140 .TP
 141 \-b #
 142 The number of similarity scores to be shown (10 by default).
 143 .TP
 144 \-E #
 145 Expectation value limit for displaying best scores.
 146 .TP
 147 \-d #
 148 The number of alignments to be shown.
 149 .TP
 150 \-f #
 151 (delval) penalty for the first residue in a gap. -12 by default for proteins.
 152 .TP
 153 \-g #
 154 (gapval) penalty for additional residues in a gap after the first. -2
 155 by default for proteins.
 156 .TP
 157 \-H #
 158 turn on histogram display (off by default).
 159 .TP
 160 \-i
 161 invert (reverse complement) DNA sequence.
 162 .TP
 163 \-J M:N
 164 start at the M-th sequence in the query library and continue to the
 165 "N-th".  By default, J=1 and the search begins with the first sequence
 166 and ends with the last, but sometimes it makes sense to start in the
 167 middle of the query library if a run partially completed, and to
 168 finish "early" if the analysis will be run on several parallel
 169 clusters.
 170 .TP
 171 \-n
 172 Force the program to use DNA sequence parameters.
 173 .TP
 174 \-p #
 175 Number of "slave" processors to use.  Typically, one less than
 176 the number of processors available with
 177 .B pv34compfa
 178 so that one processor can be used to collate results.  With
 179 .B pv34compsw\c
 180 \&, it is more efficient to use every processor as a slave and
 181 not use this option.
 182 .TP
 183 \-Q \-q
 184 Quiet option.  The programs will not prompt for input.
 185 .TP
 186 \-R file
 187 (STATFILE) Causes
 188 .B pv34compfa
 189 and
 190 .B pv34compsw
 191 to write out the sequence identifier, superfamily number (if available),
 192 and similarity scores to
 193 .I STATFILE
 194 for every sequence in the library.  These results are not sorted.
 195 .TP
 196 \-r
 197 specify DNA match/mismatch ratio as "+3/-2".  Default is "+5/-4".
 198 The "+" and "-" are required.
 199 .TP
 200 \-S
 201 Treat lower case residues as low complexity regions.
 202 .TP
 203 \-s file
 204 the filename of an alternative scoring matrix file.
 205 .LP
 206 .B
 207 pv34compfa
 208 only
 209 .TP
 210 \-1
 211 sort similarity scores by
 212 .I init1
 213 scores instead of
 214 .I initn
 215 scores.
 216 .TP
 217 \-c #
 218 (OPTCUT) the threshold for optimization with the
 219 .B -o
 220 option.
 221 .TP
 222 \-o
 223 (no-optimize); causes
 224 .B pv34compfa
 225 not to perform the default optimization on all of the sequences in the library
 226 with
 227 .B initn
 228 scores greater than
 229 .B OPTCUT\c
 230 \&.
 231 .TP
 232 \-y #
 233 Width for limited optimization (32 by default).
 234 .SH FILES
 235 .LP
 236 Query library files must be in Pearson/FASTA format, e.g.
 237 .in +0.5i
 238 .nf
 239 >seq-id | sfnum descriptive line
 240 tmlyrghi... (sequence)
 241
 242 .fi
 243 .in -0.5i
 244 .PP
 245 .B pv34compfa
 246 and
 247 .B pv34compsw
 248 recognize the following library formats: 0 - Pearson/FASTA; 1 - Genbank tape;
 249 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 5 - NBRF/PIR VMS.
 250 .PP
 251 .I Scoring matrices \-
 252 These programs use a different format for the scoring (PAM) matrix
 253 file from FASTA; they use the PAM matrix file that is used by BLASTP
 254 and produced by Altshul's "pam.c" program in the BLAST package.
 255 .SH BUGS
 256 The program has been tested extensively only with type 0 and type 5
 257 files.  This documentation file may not be up to date.
 258 .SH AUTHOR
 259 Bill Pearson
 260 .br
 261 wrp@virginia.EDU