RIO - Phylogenomic Protein Function Analysis

____________________________________________


RIO/FORESTER : http://www.genetics.wustl.edu/eddy/forester/
RIO webserver: http://www.rio.wustl.edu/

Reference: Zmasek C.M. and Eddy S.R. (2002)
           RIO: Analyzing proteomes by automated phylogenomics using
           resampled inference of orthologs.
           BMC Bioinformatics 3:14
           http://www.biomedcentral.com/1471-2105/3/14/

           It is highly recommended that you read this paper before
           installing and/or using RIO. (Included in the RIO 
           distribution as PDF: "RIO.pdf".)
           

Preconditions: A Unix system, Java 1.2 or higher, Perl, gcc or cc,  
               ... and some experience with Perl and Unix.


1. Compilation
______________


This describes how to compile the various components of RIO.


  "gunzip RIO1.x.tar.gz"
  
  "tar -xvf RIO1.x.tar"


in directory "RIO1.x/C":

 "make"


in directory "RIO1.x/hmmer" (version of HMMER is "2.2g"):

(if you already have a local copy of HMMER 2.2g installed, this step
is not necessary, but in this case you need to change variables "$HMMALIGN",
"$HMMSEARCH", "$HMMBUILD", "$HMMFETCH", and "$SFE" to point to the 
corresponding HMMER programs)

 "./configure"

 "make"


in directory "RIO1.x/java" (requires JDK 1.2 or greater):

 "javac forester/tools/*java"

 "javac ATVapp.java"


in directory "RIO1.x/puzzle_dqo":

 "./configure"

 "make"


in directory "RIO1.x/puzzle_mod":

 "./configure"

 "make"


in directory "RIO1.x/phylip_mod/src":
 
  "make install"


2. Setting the variables in "RIO1.x/perl/rio_module.pm"
_______________________________________________________


Most global variables used in "RIO1.x/perl/rio.pl" are set in
the perl module "RIO1.x/perl/rio_module.pm".
This module pretty much "controls everything".

It is necessary to set the variables which point to:

-- the rio directory itself: $PATH_TO_FORESTER

   (example: $PATH_TO_FORESTER = "/home/czmasek/linux/RIO1.1/";)


-- your Java virtual machine: $JAVA

   (example: $JAVA = "/home/czmasek/linux/j2sdk1.4.0/bin/java";) 


-- a directory where temporary files can be created: $TEMP_DIR_DEFAULT 


   Example:
   Now that $PATH_TO_FORESTER, $JAVA, $TEMP_DIR_DEFAULT are set,
   it is posssible to run rio.pl based on the example precalculated distances
   in "/example_data/":

     % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out0 p I

   To use RIO to analyze your protein sequences, please continue setting
   variables and preparing data......


-- your local copy of the Pfam database (see http://pfam.wustl.edu/)
   (if only precalculated distances are being used, these variables do not
    matter):

   $PFAM_FULL_DIRECTORY -- the directory containing the "full" alignments
                           (Pfam-A.full) see below (3.)

   $PFAM_SEED_DIRECTORY -- the directory containing the "seed" alignments
                           (Pfam-A.seed) see below (3.)

   $PFAM_HMM_DB         -- the Pfam HMM library file (Pfam_ls)
                           see below (3.)


-- $TREMBL_ACDEOS_FILE and $SWISSPROT_ACDEOS_FILE: see below (4. and 5.).


-- list of species (SWISS-PROT codes) which can be analyzed: $SPECIES_NAMES_FILE
   (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4_species_list"
   should be sufficient, hence this variable does not necessarly need to be changed)


-- a default species tree in NHX format: $SPECIES_TREE_FILE_DEFAULT
   (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4.nhx"
   should be sufficient, hence this variable does not necessarly need to be changed)


-- Only if precalculated distances are being used:
   $MATRIX_FOR_PWD, $RIO_PWD_DIRECTORY, $RIO_BSP_DIRECTORY,
   $RIO_NBD_DIRECTORY, $RIO_ALN_DIRECTORY, and $RIO_HMM_DIRECTORY:
   please see below (6.) 


IMPORTANT:  Need to redo steps 3., 4., 5., and 6. if species
            in the master species tree and/or the species list
            are added and/or changed or if a new version of Pfam is used!!


3. Downloading and processing of Pfam
_____________________________________


Please note: Even if you already have a local copy of the
Pfam database, you still need to perform steps c. through k. 

a. download 
   - "Pfam_ls" (PFAM HMM library, glocal alignment models)
   - "Pfam-A.full" (full alignments of the curated families)
   - "Pfam-A.seed" (seed alignments of the curated families)
     [and ideally "prior.tar.gz"]
   from http://pfam.wustl.edu/ or ftp.genetics.wustl.edu/pub/eddy/pfam-x/

b. "gunzip" and "tar -xvf" these downloaded files, if necessary

c. create a new directory named "Full" and move "Pfam-A.full" into it

d. in directory "Full" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.full"

e. set variable $PFAM_FULL_DIRECTORY in "RIO1.x/perl/rio_module.pm"
   to point to this "Full" directory

f. create a new directory named "Seed" and move "Pfam-A.seed" into it

g. in directory "Seed" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.seed"

h. set variable $PFAM_SEED_DIRECTORY in "RIO1.x/perl/rio_module.pm"
   to point to this "Seed" directory

i. execute "RIO1.x/hmmer/binaries/hmmindex Pfam_ls" (in same
   directory as "Pfam_ls") resulting in "Pfam_ls.ssi"

j. set environment variable HMMERDB to point to the directory where
   "Pfam_ls" and "Pfam_ls.ssi" reside
   (for example "setenv HMMERDB /home/czmasek/PFAM7.3/")

k. set variable $PFAM_HMM_DB in "RIO1.x/perl/rio_module.pm"
   to point to the "Pfam_ls" file
   (for example $PFAM_HMM_DB = "/home/czmasek/PFAM7.3/Pfam_ls";)


4. Extraction of ID, DE, and species from a SWISS-PROT sprot.dat file
_____________________________________________________________________


This creates the file from which RIO will get the sequence descriptions for
sequences from SWISS-PROT.
(RIO1.x/data/ does not contain an example for this, since SWISS-PROT is
copyrighted.)


a. download SWISS-PROT "sprotXX.dat" from
   "ftp://ca.expasy.org/databases/swiss-prot/release/"

b. "extractSWISS-PROT.pl <infile> <outfile> [species list]"

    ("extractSWISS-PROT.pl" is in "RIO1.x/perl")

    example:
    "extractSWISS-PROT.pl sprot40.dat sp40_ACDEOS RIO1.x/data/species/tree_of_life_bin_1-4_species_list"

c. the output file should be placed in "RIO1.x/data" and the
   variable $SWISSPROT_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
   to this output.


5. Extraction of AC, DE, and species from a TrEMBL trembl.dat file
__________________________________________________________________


This creates the file from which RIO will get the sequence descriptions for
sequences from TrEMBL.
(RIO1.x/data/ already contains an example: "trembl20_ACDEOS_1-4")

a. download TrEMBL "trembl.dat.gz" from
   "ftp://ca.expasy.org/databases/sp_tr_nrdb/"

b. "gunzip trembl.dat.gz"

c. "extractTrembl.pl <infile> <outfile> [species list]"

    ("extractTrembl.pl" is in "RIO1.x/perl")

    example:
    "extractTrembl.pl trembl.dat trembl17.7_ACDEOS_1-4 RIO1.x/data/species/tree_of_life_bin_1-4_species_list"

d. the output file should be placed in "RIO1.x/data/" and the
   variable $TREMBL_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
   to this output.


Now, you could go to directly to 7. to run the examples......


6. Precalculation of pairwise distances (optional): pfam2pwd.pl
_______________________________________________________________


This step is of course only necessary if you want to use RIO on
precalculated pairwise distances. The precalculation is time consuming
(range of one or two weeks on ten processors).
It is best to run it on a few machines, dividing up the input data.

The program to do this, is "RIO1.x/perl/pfam2pwd.pl".

Please note: "pfam2pwd.pl" creates a logfile in the same directory
             where is places the pairwise distance output ($MY_RIO_PWD_DIRECTORY).
              

The following variables in "RIO1.x/perl/pfam2pwd.pl" need to be set
("pfam2pwd.pl" gets most of its information from "rio_module.pm"):


"$MY_PFAM_FULL_DIRECTORY":
  This is the directory where the Pfam full alignments reside, processed
  as described in 3.a to 3.d.


"$ALGNS_TO_USE_LIST_FILE":
  If left empty, all alignments in $MY_PFAM_FULL_DIRECTORY are being
  used the calculate pairwise distances from.
  If this points to a file listing names of Pfam alignments,
  only those listed are being used.
  The file can either be a simple new-line deliminated list, or can have
  the same format as the "Summary of changes" list
  ("FI   PF03214 RGP   NEW  SEED HMM_ls HMM_fs FULL DESC")
  which is part of the Pfam distribution.
  One purpose of this is to use the list of "too large" alignments
  in the logfile produced by "pfam2pwd.pl" to run "pfam2pwd.pl" with 
  a smaller species list (as can be set with "$MY_SPECIES_NAMES_FILE")
  on large alignments.

  
"$MY_SPECIES_NAMES_FILE" -- Dealing with too large alignments:

  This is most important. It determines the species whose sequences 
  are being used (sequences from species not listed in $MY_SPECIES_NAMES_FILE
  are ignored). Normally, one would use the same list as RIO uses
  ($SPECIES_NAMES_FILE in "rio_module.pm"):

  my $MY_SPECIES_NAMES_FILE = $SPECIES_NAMES_FILE;

  For certain large families (such as protein kinases, one must use
  a species file which contains less species in order to be able to finish
  the calculations in reasonable time:

  my $MY_SPECIES_NAMES_FILE = $PATH_TO_FORESTER."data/tree_of_life_bin_1-4_species_list_NO_RAT_RABBIT_MONKEYS_APES_SHEEP_GOAT_HAMSTER
  
  An additional way to reduce the number of sequences in an alignment is
  to only use sequences originating from SWISS-PROT. This is done by
  placing the following line of code into pfam2pwd.pl:

  $TREMBL_ACDEOS_FILE = $PATH_TO_FORESTER."data/NO_TREMBL"; 


"$MY_RIO_PWD_DIRECTORY",
"$MY_RIO_BSP_DIRECTORY",
"$MY_RIO_NBD_DIRECTORY",
"$MY_RIO_ALN_DIRECTORY",
"$MY_RIO_HMM_DIRECTORY":
  These determine where to place the output.
  After all the data has been calculated, the corresponding variables
  in RIO1.x/perl/rio_module.pm ("$RIO_PWD_DIRECTORY", etc.) need to be set
  so that they point to the appropriate values. Having different variables
  allows to precalculate distances and at the same time use RIO on
  previously precalculated distances.


"$MY_TEMP_DIR":
  A directory to create temporary files in.


"$MIN_SEQS":
  Alignments in which the number of sequences after pruning (determined
  by "$MY_SPECIES_NAMES_FILE") is lower than $MIN_SEQS, are ignored
  (no calculation of pwds).


"$MAX_SEQS":
  Alignments in which the number of sequences after pruning (determined
  by "$MY_SPECIES_NAMES_FILE") is greater than $MAX_SEQS, are ignored
  (no calculation of pwds).


"$MY_SEED":
  Seed for the random number generator for bootstrapping (must be 4n+1).


"$MY_MATRIX":
  This is used to choose the model to be used for the (ML)
  distance calculation:
  0 = JTT
  2 = BLOSUM 62
  3 = mtREV24
  5 = VT
  6 = WAG
  PAM otherwise
  After all the data has been calculated, variable "$MATRIX_FOR_PWD"
  in RIO1.x/perl/rio_module.pm needs to be set to the same value.


Once pairwise distances are calculated, the following variables in
"rio_module.pm" need to be set accordingly:
$MATRIX_FOR_PWD     : corresponds to $MY_MATRIX in pfam2pwd.pl
$RIO_PWD_DIRECTORY  : corresponds to $MY_RIO_PWD_DIRECTORY in pfam2pwd.pl
$RIO_BSP_DIRECTORY  : corresponds to $MY_RIO_BSP_DIRECTORY in pfam2pwd.pl
$RIO_NBD_DIRECTORY  : corresponds to $MY_RIO_NBD_DIRECTORY in pfam2pwd.pl
$RIO_ALN_DIRECTORY  : corresponds to $MY_RIO_ALN_DIRECTORY in pfam2pwd.pl
$RIO_HMM_DIRECTORY  : corresponds to $MY_RIO_HMM_DIRECTORY in pfam2pwd.pl
...of course, if Pfam has been updated, the corresponding variables in rio_module.pm
($PFAM_FULL_DIRECTORY, etc.) need to be updated, too.


IMPORTANT:  Need to redo steps 3., 4., 5., and 6. if species
            in the master species tree and/or the species list
            are added and/or changed or if a new version of Pfam is used!


7. Example of a phylogenomic analysis using "rio.pl"
____________________________________________________


Without using precalculated distances (for this, all the variables above
need to point to the correct loctions, in particular to your local and processed
Pfam database): 

  % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E


Without using precalculated distances (for this, all the variables above
need to point to the correct loctions, in particular to your local and processed
Pfam database) using a query sequence which is already in the alignment:

  % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E


Using the example precalculated distances in "/example_data/"
($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):

  % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E


Using a query sequence which is already in the precalculated distances in "/example_data/"
($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):

  % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E


for detailed instructions on how to use rio.pl see the source code, 
or type "rio.pl" without any arguments


Christian Zmasek 
zmasek@genetics.wustl.edu
05/26/02