RIO - Phylogenomic Protein Function Analysis ____________________________________________ RIO/FORESTER : http://www.genetics.wustl.edu/eddy/forester/ RIO webserver: http://www.rio.wustl.edu/ Reference: Zmasek C.M. and Eddy S.R. (2002) RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3:14 http://www.biomedcentral.com/1471-2105/3/14/ It is highly recommended that you read this paper before installing and/or using RIO. (Included in the RIO distribution as PDF: "RIO.pdf".) Preconditions: A Unix system, Java 1.2 or higher, Perl, gcc or cc, ... and some experience with Perl and Unix. 1. Compilation ______________ This describes how to compile the various components of RIO. "gunzip RIO1.x.tar.gz" "tar -xvf RIO1.x.tar" in directory "RIO1.x/C": "make" in directory "RIO1.x/hmmer" (version of HMMER is "2.2g"): (if you already have a local copy of HMMER 2.2g installed, this step is not necessary, but in this case you need to change variables "$HMMALIGN", "$HMMSEARCH", "$HMMBUILD", "$HMMFETCH", and "$SFE" to point to the corresponding HMMER programs) "./configure" "make" in directory "RIO1.x/java" (requires JDK 1.2 or greater): "javac forester/tools/*java" "javac ATVapp.java" in directory "RIO1.x/puzzle_dqo": "./configure" "make" in directory "RIO1.x/puzzle_mod": "./configure" "make" in directory "RIO1.x/phylip_mod/src": "make install" 2. Setting the variables in "RIO1.x/perl/rio_module.pm" _______________________________________________________ Most global variables used in "RIO1.x/perl/rio.pl" are set in the perl module "RIO1.x/perl/rio_module.pm". This module pretty much "controls everything". It is necessary to set the variables which point to: -- the rio directory itself: $PATH_TO_FORESTER (example: $PATH_TO_FORESTER = "/home/czmasek/linux/RIO1.1/";) -- your Java virtual machine: $JAVA (example: $JAVA = "/home/czmasek/linux/j2sdk1.4.0/bin/java";) -- a directory where temporary files can be created: $TEMP_DIR_DEFAULT Example: Now that $PATH_TO_FORESTER, $JAVA, $TEMP_DIR_DEFAULT are set, it is posssible to run rio.pl based on the example precalculated distances in "/example_data/": % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out0 p I To use RIO to analyze your protein sequences, please continue setting variables and preparing data...... -- your local copy of the Pfam database (see http://pfam.wustl.edu/) (if only precalculated distances are being used, these variables do not matter): $PFAM_FULL_DIRECTORY -- the directory containing the "full" alignments (Pfam-A.full) see below (3.) $PFAM_SEED_DIRECTORY -- the directory containing the "seed" alignments (Pfam-A.seed) see below (3.) $PFAM_HMM_DB -- the Pfam HMM library file (Pfam_ls) see below (3.) -- $TREMBL_ACDEOS_FILE and $SWISSPROT_ACDEOS_FILE: see below (4. and 5.). -- list of species (SWISS-PROT codes) which can be analyzed: $SPECIES_NAMES_FILE (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4_species_list" should be sufficient, hence this variable does not necessarly need to be changed) -- a default species tree in NHX format: $SPECIES_TREE_FILE_DEFAULT (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4.nhx" should be sufficient, hence this variable does not necessarly need to be changed) -- Only if precalculated distances are being used: $MATRIX_FOR_PWD, $RIO_PWD_DIRECTORY, $RIO_BSP_DIRECTORY, $RIO_NBD_DIRECTORY, $RIO_ALN_DIRECTORY, and $RIO_HMM_DIRECTORY: please see below (6.) IMPORTANT: Need to redo steps 3., 4., 5., and 6. if species in the master species tree and/or the species list are added and/or changed or if a new version of Pfam is used!! 3. Downloading and processing of Pfam _____________________________________ Please note: Even if you already have a local copy of the Pfam database, you still need to perform steps c. through k. a. download - "Pfam_ls" (PFAM HMM library, glocal alignment models) - "Pfam-A.full" (full alignments of the curated families) - "Pfam-A.seed" (seed alignments of the curated families) [and ideally "prior.tar.gz"] from http://pfam.wustl.edu/ or ftp.genetics.wustl.edu/pub/eddy/pfam-x/ b. "gunzip" and "tar -xvf" these downloaded files, if necessary c. create a new directory named "Full" and move "Pfam-A.full" into it d. in directory "Full" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.full" e. set variable $PFAM_FULL_DIRECTORY in "RIO1.x/perl/rio_module.pm" to point to this "Full" directory f. create a new directory named "Seed" and move "Pfam-A.seed" into it g. in directory "Seed" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.seed" h. set variable $PFAM_SEED_DIRECTORY in "RIO1.x/perl/rio_module.pm" to point to this "Seed" directory i. execute "RIO1.x/hmmer/binaries/hmmindex Pfam_ls" (in same directory as "Pfam_ls") resulting in "Pfam_ls.ssi" j. set environment variable HMMERDB to point to the directory where "Pfam_ls" and "Pfam_ls.ssi" reside (for example "setenv HMMERDB /home/czmasek/PFAM7.3/") k. set variable $PFAM_HMM_DB in "RIO1.x/perl/rio_module.pm" to point to the "Pfam_ls" file (for example $PFAM_HMM_DB = "/home/czmasek/PFAM7.3/Pfam_ls";) 4. Extraction of ID, DE, and species from a SWISS-PROT sprot.dat file _____________________________________________________________________ This creates the file from which RIO will get the sequence descriptions for sequences from SWISS-PROT. (RIO1.x/data/ does not contain an example for this, since SWISS-PROT is copyrighted.) a. download SWISS-PROT "sprotXX.dat" from "ftp://ca.expasy.org/databases/swiss-prot/release/" b. "extractSWISS-PROT.pl [species list]" ("extractSWISS-PROT.pl" is in "RIO1.x/perl") example: "extractSWISS-PROT.pl sprot40.dat sp40_ACDEOS RIO1.x/data/species/tree_of_life_bin_1-4_species_list" c. the output file should be placed in "RIO1.x/data" and the variable $SWISSPROT_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point to this output. 5. Extraction of AC, DE, and species from a TrEMBL trembl.dat file __________________________________________________________________ This creates the file from which RIO will get the sequence descriptions for sequences from TrEMBL. (RIO1.x/data/ already contains an example: "trembl20_ACDEOS_1-4") a. download TrEMBL "trembl.dat.gz" from "ftp://ca.expasy.org/databases/sp_tr_nrdb/" b. "gunzip trembl.dat.gz" c. "extractTrembl.pl [species list]" ("extractTrembl.pl" is in "RIO1.x/perl") example: "extractTrembl.pl trembl.dat trembl17.7_ACDEOS_1-4 RIO1.x/data/species/tree_of_life_bin_1-4_species_list" d. the output file should be placed in "RIO1.x/data/" and the variable $TREMBL_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point to this output. Now, you could go to directly to 7. to run the examples...... 6. Precalculation of pairwise distances (optional): pfam2pwd.pl _______________________________________________________________ This step is of course only necessary if you want to use RIO on precalculated pairwise distances. The precalculation is time consuming (range of one or two weeks on ten processors). It is best to run it on a few machines, dividing up the input data. The program to do this, is "RIO1.x/perl/pfam2pwd.pl". Please note: "pfam2pwd.pl" creates a logfile in the same directory where is places the pairwise distance output ($MY_RIO_PWD_DIRECTORY). The following variables in "RIO1.x/perl/pfam2pwd.pl" need to be set ("pfam2pwd.pl" gets most of its information from "rio_module.pm"): "$MY_PFAM_FULL_DIRECTORY": This is the directory where the Pfam full alignments reside, processed as described in 3.a to 3.d. "$ALGNS_TO_USE_LIST_FILE": If left empty, all alignments in $MY_PFAM_FULL_DIRECTORY are being used the calculate pairwise distances from. If this points to a file listing names of Pfam alignments, only those listed are being used. The file can either be a simple new-line deliminated list, or can have the same format as the "Summary of changes" list ("FI PF03214 RGP NEW SEED HMM_ls HMM_fs FULL DESC") which is part of the Pfam distribution. One purpose of this is to use the list of "too large" alignments in the logfile produced by "pfam2pwd.pl" to run "pfam2pwd.pl" with a smaller species list (as can be set with "$MY_SPECIES_NAMES_FILE") on large alignments. "$MY_SPECIES_NAMES_FILE" -- Dealing with too large alignments: This is most important. It determines the species whose sequences are being used (sequences from species not listed in $MY_SPECIES_NAMES_FILE are ignored). Normally, one would use the same list as RIO uses ($SPECIES_NAMES_FILE in "rio_module.pm"): my $MY_SPECIES_NAMES_FILE = $SPECIES_NAMES_FILE; For certain large families (such as protein kinases, one must use a species file which contains less species in order to be able to finish the calculations in reasonable time: my $MY_SPECIES_NAMES_FILE = $PATH_TO_FORESTER."data/tree_of_life_bin_1-4_species_list_NO_RAT_RABBIT_MONKEYS_APES_SHEEP_GOAT_HAMSTER An additional way to reduce the number of sequences in an alignment is to only use sequences originating from SWISS-PROT. This is done by placing the following line of code into pfam2pwd.pl: $TREMBL_ACDEOS_FILE = $PATH_TO_FORESTER."data/NO_TREMBL"; "$MY_RIO_PWD_DIRECTORY", "$MY_RIO_BSP_DIRECTORY", "$MY_RIO_NBD_DIRECTORY", "$MY_RIO_ALN_DIRECTORY", "$MY_RIO_HMM_DIRECTORY": These determine where to place the output. After all the data has been calculated, the corresponding variables in RIO1.x/perl/rio_module.pm ("$RIO_PWD_DIRECTORY", etc.) need to be set so that they point to the appropriate values. Having different variables allows to precalculate distances and at the same time use RIO on previously precalculated distances. "$MY_TEMP_DIR": A directory to create temporary files in. "$MIN_SEQS": Alignments in which the number of sequences after pruning (determined by "$MY_SPECIES_NAMES_FILE") is lower than $MIN_SEQS, are ignored (no calculation of pwds). "$MAX_SEQS": Alignments in which the number of sequences after pruning (determined by "$MY_SPECIES_NAMES_FILE") is greater than $MAX_SEQS, are ignored (no calculation of pwds). "$MY_SEED": Seed for the random number generator for bootstrapping (must be 4n+1). "$MY_MATRIX": This is used to choose the model to be used for the (ML) distance calculation: 0 = JTT 2 = BLOSUM 62 3 = mtREV24 5 = VT 6 = WAG PAM otherwise After all the data has been calculated, variable "$MATRIX_FOR_PWD" in RIO1.x/perl/rio_module.pm needs to be set to the same value. Once pairwise distances are calculated, the following variables in "rio_module.pm" need to be set accordingly: $MATRIX_FOR_PWD : corresponds to $MY_MATRIX in pfam2pwd.pl $RIO_PWD_DIRECTORY : corresponds to $MY_RIO_PWD_DIRECTORY in pfam2pwd.pl $RIO_BSP_DIRECTORY : corresponds to $MY_RIO_BSP_DIRECTORY in pfam2pwd.pl $RIO_NBD_DIRECTORY : corresponds to $MY_RIO_NBD_DIRECTORY in pfam2pwd.pl $RIO_ALN_DIRECTORY : corresponds to $MY_RIO_ALN_DIRECTORY in pfam2pwd.pl $RIO_HMM_DIRECTORY : corresponds to $MY_RIO_HMM_DIRECTORY in pfam2pwd.pl ...of course, if Pfam has been updated, the corresponding variables in rio_module.pm ($PFAM_FULL_DIRECTORY, etc.) need to be updated, too. IMPORTANT: Need to redo steps 3., 4., 5., and 6. if species in the master species tree and/or the species list are added and/or changed or if a new version of Pfam is used! 7. Example of a phylogenomic analysis using "rio.pl" ____________________________________________________ Without using precalculated distances (for this, all the variables above need to point to the correct loctions, in particular to your local and processed Pfam database): % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E Without using precalculated distances (for this, all the variables above need to point to the correct loctions, in particular to your local and processed Pfam database) using a query sequence which is already in the alignment: % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E Using the example precalculated distances in "/example_data/" ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"): % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E Using a query sequence which is already in the precalculated distances in "/example_data/" ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"): % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E for detailed instructions on how to use rio.pl see the source code, or type "rio.pl" without any arguments Christian Zmasek zmasek@genetics.wustl.edu 05/26/02