2 RIO - Phylogenomic Protein Function Analysis
4 ____________________________________________
9 RIO/FORESTER : http://www.genetics.wustl.edu/eddy/forester/
10 RIO webserver: http://www.rio.wustl.edu/
12 Reference: Zmasek C.M. and Eddy S.R. (2002)
13 RIO: Analyzing proteomes by automated phylogenomics using
14 resampled inference of orthologs.
15 BMC Bioinformatics 3:14
16 http://www.biomedcentral.com/1471-2105/3/14/
18 It is highly recommended that you read this paper before
19 installing and/or using RIO. (Included in the RIO
20 distribution as PDF: "RIO.pdf".)
23 Preconditions: A Unix system, Java 1.2 or higher, Perl, gcc or cc,
24 ... and some experience with Perl and Unix.
32 This describes how to compile the various components of RIO.
35 "gunzip RIO1.x.tar.gz"
42 in directory "RIO1.x/C":
48 in directory "RIO1.x/hmmer" (version of HMMER is "2.2g"):
50 (if you already have a local copy of HMMER 2.2g installed, this step
51 is not necessary, but in this case you need to change variables "$HMMALIGN",
52 "$HMMSEARCH", "$HMMBUILD", "$HMMFETCH", and "$SFE" to point to the
53 corresponding HMMER programs)
61 in directory "RIO1.x/java" (requires JDK 1.2 or greater):
63 "javac forester/tools/*java"
70 in directory "RIO1.x/puzzle_dqo":
78 in directory "RIO1.x/puzzle_mod":
86 in directory "RIO1.x/phylip_mod/src":
93 2. Setting the variables in "RIO1.x/perl/rio_module.pm"
94 _______________________________________________________
97 Most global variables used in "RIO1.x/perl/rio.pl" are set in
98 the perl module "RIO1.x/perl/rio_module.pm".
99 This module pretty much "controls everything".
101 It is necessary to set the variables which point to:
103 -- the rio directory itself: $PATH_TO_FORESTER
105 (example: $PATH_TO_FORESTER = "/home/czmasek/linux/RIO1.1/";)
108 -- your Java virtual machine: $JAVA
110 (example: $JAVA = "/home/czmasek/linux/j2sdk1.4.0/bin/java";)
113 -- a directory where temporary files can be created: $TEMP_DIR_DEFAULT
118 Now that $PATH_TO_FORESTER, $JAVA, $TEMP_DIR_DEFAULT are set,
119 it is posssible to run rio.pl based on the example precalculated distances
122 % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out0 p I
124 To use RIO to analyze your protein sequences, please continue setting
125 variables and preparing data......
129 -- your local copy of the Pfam database (see http://pfam.wustl.edu/)
130 (if only precalculated distances are being used, these variables do not
133 $PFAM_FULL_DIRECTORY -- the directory containing the "full" alignments
134 (Pfam-A.full) see below (3.)
136 $PFAM_SEED_DIRECTORY -- the directory containing the "seed" alignments
137 (Pfam-A.seed) see below (3.)
139 $PFAM_HMM_DB -- the Pfam HMM library file (Pfam_ls)
143 -- $TREMBL_ACDEOS_FILE and $SWISSPROT_ACDEOS_FILE: see below (4. and 5.).
146 -- list of species (SWISS-PROT codes) which can be analyzed: $SPECIES_NAMES_FILE
147 (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4_species_list"
148 should be sufficient, hence this variable does not necessarly need to be changed)
151 -- a default species tree in NHX format: $SPECIES_TREE_FILE_DEFAULT
152 (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4.nhx"
153 should be sufficient, hence this variable does not necessarly need to be changed)
156 -- Only if precalculated distances are being used:
157 $MATRIX_FOR_PWD, $RIO_PWD_DIRECTORY, $RIO_BSP_DIRECTORY,
158 $RIO_NBD_DIRECTORY, $RIO_ALN_DIRECTORY, and $RIO_HMM_DIRECTORY:
159 please see below (6.)
166 IMPORTANT: Need to redo steps 3., 4., 5., and 6. if species
167 in the master species tree and/or the species list
168 are added and/or changed or if a new version of Pfam is used!!
174 3. Downloading and processing of Pfam
175 _____________________________________
179 Please note: Even if you already have a local copy of the
180 Pfam database, you still need to perform steps c. through k.
183 - "Pfam_ls" (PFAM HMM library, glocal alignment models)
184 - "Pfam-A.full" (full alignments of the curated families)
185 - "Pfam-A.seed" (seed alignments of the curated families)
186 [and ideally "prior.tar.gz"]
187 from http://pfam.wustl.edu/ or ftp.genetics.wustl.edu/pub/eddy/pfam-x/
189 b. "gunzip" and "tar -xvf" these downloaded files, if necessary
191 c. create a new directory named "Full" and move "Pfam-A.full" into it
193 d. in directory "Full" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.full"
195 e. set variable $PFAM_FULL_DIRECTORY in "RIO1.x/perl/rio_module.pm"
196 to point to this "Full" directory
198 f. create a new directory named "Seed" and move "Pfam-A.seed" into it
200 g. in directory "Seed" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.seed"
202 h. set variable $PFAM_SEED_DIRECTORY in "RIO1.x/perl/rio_module.pm"
203 to point to this "Seed" directory
205 i. execute "RIO1.x/hmmer/binaries/hmmindex Pfam_ls" (in same
206 directory as "Pfam_ls") resulting in "Pfam_ls.ssi"
208 j. set environment variable HMMERDB to point to the directory where
209 "Pfam_ls" and "Pfam_ls.ssi" reside
210 (for example "setenv HMMERDB /home/czmasek/PFAM7.3/")
212 k. set variable $PFAM_HMM_DB in "RIO1.x/perl/rio_module.pm"
213 to point to the "Pfam_ls" file
214 (for example $PFAM_HMM_DB = "/home/czmasek/PFAM7.3/Pfam_ls";)
219 4. Extraction of ID, DE, and species from a SWISS-PROT sprot.dat file
220 _____________________________________________________________________
223 This creates the file from which RIO will get the sequence descriptions for
224 sequences from SWISS-PROT.
225 (RIO1.x/data/ does not contain an example for this, since SWISS-PROT is
229 a. download SWISS-PROT "sprotXX.dat" from
230 "ftp://ca.expasy.org/databases/swiss-prot/release/"
232 b. "extractSWISS-PROT.pl <infile> <outfile> [species list]"
234 ("extractSWISS-PROT.pl" is in "RIO1.x/perl")
237 "extractSWISS-PROT.pl sprot40.dat sp40_ACDEOS RIO1.x/data/species/tree_of_life_bin_1-4_species_list"
239 c. the output file should be placed in "RIO1.x/data" and the
240 variable $SWISSPROT_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
246 5. Extraction of AC, DE, and species from a TrEMBL trembl.dat file
247 __________________________________________________________________
250 This creates the file from which RIO will get the sequence descriptions for
251 sequences from TrEMBL.
252 (RIO1.x/data/ already contains an example: "trembl20_ACDEOS_1-4")
254 a. download TrEMBL "trembl.dat.gz" from
255 "ftp://ca.expasy.org/databases/sp_tr_nrdb/"
257 b. "gunzip trembl.dat.gz"
259 c. "extractTrembl.pl <infile> <outfile> [species list]"
261 ("extractTrembl.pl" is in "RIO1.x/perl")
264 "extractTrembl.pl trembl.dat trembl17.7_ACDEOS_1-4 RIO1.x/data/species/tree_of_life_bin_1-4_species_list"
266 d. the output file should be placed in "RIO1.x/data/" and the
267 variable $TREMBL_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
272 Now, you could go to directly to 7. to run the examples......
276 6. Precalculation of pairwise distances (optional): pfam2pwd.pl
277 _______________________________________________________________
280 This step is of course only necessary if you want to use RIO on
281 precalculated pairwise distances. The precalculation is time consuming
282 (range of one or two weeks on ten processors).
283 It is best to run it on a few machines, dividing up the input data.
285 The program to do this, is "RIO1.x/perl/pfam2pwd.pl".
287 Please note: "pfam2pwd.pl" creates a logfile in the same directory
288 where is places the pairwise distance output ($MY_RIO_PWD_DIRECTORY).
292 The following variables in "RIO1.x/perl/pfam2pwd.pl" need to be set
293 ("pfam2pwd.pl" gets most of its information from "rio_module.pm"):
296 "$MY_PFAM_FULL_DIRECTORY":
297 This is the directory where the Pfam full alignments reside, processed
298 as described in 3.a to 3.d.
302 "$ALGNS_TO_USE_LIST_FILE":
303 If left empty, all alignments in $MY_PFAM_FULL_DIRECTORY are being
304 used the calculate pairwise distances from.
305 If this points to a file listing names of Pfam alignments,
306 only those listed are being used.
307 The file can either be a simple new-line deliminated list, or can have
308 the same format as the "Summary of changes" list
309 ("FI PF03214 RGP NEW SEED HMM_ls HMM_fs FULL DESC")
310 which is part of the Pfam distribution.
311 One purpose of this is to use the list of "too large" alignments
312 in the logfile produced by "pfam2pwd.pl" to run "pfam2pwd.pl" with
313 a smaller species list (as can be set with "$MY_SPECIES_NAMES_FILE")
318 "$MY_SPECIES_NAMES_FILE" -- Dealing with too large alignments:
320 This is most important. It determines the species whose sequences
321 are being used (sequences from species not listed in $MY_SPECIES_NAMES_FILE
322 are ignored). Normally, one would use the same list as RIO uses
323 ($SPECIES_NAMES_FILE in "rio_module.pm"):
325 my $MY_SPECIES_NAMES_FILE = $SPECIES_NAMES_FILE;
327 For certain large families (such as protein kinases, one must use
328 a species file which contains less species in order to be able to finish
329 the calculations in reasonable time:
331 my $MY_SPECIES_NAMES_FILE = $PATH_TO_FORESTER."data/tree_of_life_bin_1-4_species_list_NO_RAT_RABBIT_MONKEYS_APES_SHEEP_GOAT_HAMSTER
333 An additional way to reduce the number of sequences in an alignment is
334 to only use sequences originating from SWISS-PROT. This is done by
335 placing the following line of code into pfam2pwd.pl:
337 $TREMBL_ACDEOS_FILE = $PATH_TO_FORESTER."data/NO_TREMBL";
341 "$MY_RIO_PWD_DIRECTORY",
342 "$MY_RIO_BSP_DIRECTORY",
343 "$MY_RIO_NBD_DIRECTORY",
344 "$MY_RIO_ALN_DIRECTORY",
345 "$MY_RIO_HMM_DIRECTORY":
346 These determine where to place the output.
347 After all the data has been calculated, the corresponding variables
348 in RIO1.x/perl/rio_module.pm ("$RIO_PWD_DIRECTORY", etc.) need to be set
349 so that they point to the appropriate values. Having different variables
350 allows to precalculate distances and at the same time use RIO on
351 previously precalculated distances.
356 A directory to create temporary files in.
361 Alignments in which the number of sequences after pruning (determined
362 by "$MY_SPECIES_NAMES_FILE") is lower than $MIN_SEQS, are ignored
363 (no calculation of pwds).
368 Alignments in which the number of sequences after pruning (determined
369 by "$MY_SPECIES_NAMES_FILE") is greater than $MAX_SEQS, are ignored
370 (no calculation of pwds).
375 Seed for the random number generator for bootstrapping (must be 4n+1).
380 This is used to choose the model to be used for the (ML)
381 distance calculation:
388 After all the data has been calculated, variable "$MATRIX_FOR_PWD"
389 in RIO1.x/perl/rio_module.pm needs to be set to the same value.
393 Once pairwise distances are calculated, the following variables in
394 "rio_module.pm" need to be set accordingly:
395 $MATRIX_FOR_PWD : corresponds to $MY_MATRIX in pfam2pwd.pl
396 $RIO_PWD_DIRECTORY : corresponds to $MY_RIO_PWD_DIRECTORY in pfam2pwd.pl
397 $RIO_BSP_DIRECTORY : corresponds to $MY_RIO_BSP_DIRECTORY in pfam2pwd.pl
398 $RIO_NBD_DIRECTORY : corresponds to $MY_RIO_NBD_DIRECTORY in pfam2pwd.pl
399 $RIO_ALN_DIRECTORY : corresponds to $MY_RIO_ALN_DIRECTORY in pfam2pwd.pl
400 $RIO_HMM_DIRECTORY : corresponds to $MY_RIO_HMM_DIRECTORY in pfam2pwd.pl
401 ...of course, if Pfam has been updated, the corresponding variables in rio_module.pm
402 ($PFAM_FULL_DIRECTORY, etc.) need to be updated, too.
409 IMPORTANT: Need to redo steps 3., 4., 5., and 6. if species
410 in the master species tree and/or the species list
411 are added and/or changed or if a new version of Pfam is used!
416 7. Example of a phylogenomic analysis using "rio.pl"
417 ____________________________________________________
420 Without using precalculated distances (for this, all the variables above
421 need to point to the correct loctions, in particular to your local and processed
424 % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E
428 Without using precalculated distances (for this, all the variables above
429 need to point to the correct loctions, in particular to your local and processed
430 Pfam database) using a query sequence which is already in the alignment:
432 % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E
436 Using the example precalculated distances in "/example_data/"
437 ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):
439 % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E
443 Using a query sequence which is already in the precalculated distances in "/example_data/"
444 ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):
446 % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E
450 for detailed instructions on how to use rio.pl see the source code,
451 or type "rio.pl" without any arguments
457 zmasek@genetics.wustl.edu