.TH PVCOMPFA/PVCOMPSW/v3.4 1 "January, 2003"
.SH NAME
.B pv34compfa
\- scan a protein or DNA sequence library for similar
sequences using the FASTA algorithm in parallel on a network of
machines running pvm3.

.B pv34compsw
\- scan a protein or DNA sequence library for similar
sequences using the Smith-Waterman algorithm in parallel on a network
of machines running pvm3.

.B ps34compfa
\- evaluate sequence comparison parameters using the FASTA
algorithm and super-family-annotated libraries.

.B ps34compsw
\- evaluate sequence comparison parameters using the
Smith-Waterman algorithm and super-family-annotated libraries.

.SH SYNOPSIS
.B pv34compfa
[-Q|q -B -b # -d # -E # -f # -g # -H -i J # -n -o -p #
\& -R
.I STATFILE
\& -r "+n/-m" \& -S -s
.I SMATRIX
\& -w # -1 ] query-library reference-library [
.I ktup
]
.B pv34compfa
[\-QBbcefgHiJnopRrSsw1] \- interactive mode

.B pv34compsw
[-Q|q -B -b # -e -f delval -g gapval -i
\& -n -p # -R -R
.I STATFILE
\& -r "+n/-m" \& -S -s
\& -s
.I SMATRIX
 ] query-library reference-library [
.I ktup
]

.B pv34compsw
[\-QBbefgnpRrsS] \- interactive mode

.SH DESCRIPTION
.B pv34compfa
and
.B pv34compsw
compare all of the sequences in one DNA or protein sequence library
(the query library) with to all of the entries in a reference sequence
library using the FASTA (pv34compfa) or Smith-Waterman (pv34compsw)
algorithms.  For example,
.B pv34compfa
can compare a library of protein sequences to all of the sequences in
the NBRF PIR protein sequence database.
.B pv34compfa
and
.B pv34compsw
are designed to run in parallel on networks of unix workstations using
the PVM parallel programming system. (For more information on PVM,
send email to "netlib@ornl.gov" with the message "send index for pvm3").
.PP
.B pv34compfa
uses the rapid sequence comparison algorithm
described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444.
The program can be invoked either with command line arguments or in
interactive mode.  The optional third argument,
.I ktup
sets the sensitivity and speed of the search.  If
.I ktup=2,
similar regions in the two sequences being compared are found by
looking at pairs of aligned residues; if
.I ktup=1,
single aligned amino acids are examined.
.I ktup
can be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA sequences.
The default if
.I
ktup
is not specified is 2 for proteins and 6 for DNA.
.PP
.B pv34compfa
compares a library of query sequences (there need be only one) to a
reference sequence library.  Normally
.B pv34compfa
sorts the output by the
.I initn
score.  By using the
.I \-1
option, sequences are ranked by their
.B init1
score.  Alternative, the
.I \-o
option causes optimized scores to be calculated for every sequence
greater than a threshold and the output to be sorted by the optimized
scores.
.PP
.B pv34compsw
uses the rigorous Smith-Waterman algorithm to compare protein or
DNA sequences. The gap penalties and scoring matrices can be
modified with the 
.I -f\c
\&, 
.I -k\c
\&, and 
.I -s
options.
.PP
.B pv34compfa
(and
.B pv34compsw\c
\&) will automatically decide whether the query sequence is DNA or
protein by reading the query sequence as protein and determining
whether the `amino-acid composition' is more than 85% A+C+G+T.
.PP
.B ps34compfa
and
.B ps34compsw
are versions of
.B pv34compfa
and
.B pv34compsw
that evaluate the quality of a search by reporting how many
high-scoring related sequences and low-scoring unrelated sequences
were found.  These programs require that both the query library and
the reference library be annotated with superfamily numbers for every
sequence in the library.
.SH OPTIONS
.LP
.B Pv34compfa
and
.B pv34compsw
now support all the options of the fasta3(_t) programs.
.TP
\-B
Report z-score, rather than bit-score, in list of best hits.
.TP
\-b #
The number of similarity scores to be shown (10 by default).
.TP
\-E #
Expectation value limit for displaying best scores.
.TP
\-d #
The number of alignments to be shown.
.TP
\-f #
(delval) penalty for the first residue in a gap. -12 by default for proteins.
.TP
\-g #
(gapval) penalty for additional residues in a gap after the first. -2
by default for proteins.
.TP
\-H #
turn on histogram display (off by default).
.TP
\-i
invert (reverse complement) DNA sequence.
.TP
\-J M:N
start at the M-th sequence in the query library and continue to the
"N-th".  By default, J=1 and the search begins with the first sequence
and ends with the last, but sometimes it makes sense to start in the
middle of the query library if a run partially completed, and to
finish "early" if the analysis will be run on several parallel
clusters.
.TP
\-n
Force the program to use DNA sequence parameters.
.TP
\-p #
Number of "slave" processors to use.  Typically, one less than
the number of processors available with
.B pv34compfa
so that one processor can be used to collate results.  With
.B pv34compsw\c
\&, it is more efficient to use every processor as a slave and
not use this option.
.TP
\-Q \-q
Quiet option.  The programs will not prompt for input.
.TP
\-R file
(STATFILE) Causes
.B pv34compfa
and
.B pv34compsw
to write out the sequence identifier, superfamily number (if available),
and similarity scores to 
.I STATFILE
for every sequence in the library.  These results are not sorted.
.TP
\-r
specify DNA match/mismatch ratio as "+3/-2".  Default is "+5/-4".
The "+" and "-" are required.
.TP
\-S
Treat lower case residues as low complexity regions.
.TP
\-s file
the filename of an alternative scoring matrix file.
.LP
.B
pv34compfa
only
.TP
\-1
sort similarity scores by
.I init1
scores instead of
.I initn
scores.
.TP
\-c #
(OPTCUT) the threshold for optimization with the
.B -o
option.
.TP
\-o
(no-optimize); causes 
.B pv34compfa
not to perform the default optimization on all of the sequences in the library
with
.B initn
scores greater than
.B OPTCUT\c
\&.
.TP
\-y #
Width for limited optimization (32 by default).
.SH FILES
.LP
Query library files must be in Pearson/FASTA format, e.g.
.in +0.5i
.nf
>seq-id | sfnum descriptive line
tmlyrghi... (sequence)

.fi
.in -0.5i
.PP
.B pv34compfa
and
.B pv34compsw
recognize the following library formats: 0 - Pearson/FASTA; 1 - Genbank tape;
2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 5 - NBRF/PIR VMS.
.PP
.I Scoring matrices \-
These programs use a different format for the scoring (PAM) matrix
file from FASTA; they use the PAM matrix file that is used by BLASTP
and produced by Altshul's "pam.c" program in the BLAST package.
.SH BUGS
The program has been tested extensively only with type 0 and type 5
files.  This documentation file may not be up to date.
.SH AUTHOR
Bill Pearson
.br
wrp@virginia.EDU