forester/archive/RIO/RIO_INSTALL

   1
   2 RIO - Phylogenomic Protein Function Analysis
   3
   4 ____________________________________________
   5
   6
   7
   8
   9 RIO/FORESTER : http://www.genetics.wustl.edu/eddy/forester/
  10 RIO webserver: http://www.rio.wustl.edu/
  11
  12 Reference: Zmasek C.M. and Eddy S.R. (2002)
  13            RIO: Analyzing proteomes by automated phylogenomics using
  14            resampled inference of orthologs.
  15            BMC Bioinformatics 3:14
  16            http://www.biomedcentral.com/1471-2105/3/14/
  17
  18            It is highly recommended that you read this paper before
  19            installing and/or using RIO. (Included in the RIO
  20            distribution as PDF: "RIO.pdf".)
  21
  22
  23 Preconditions: A Unix system, Java 1.2 or higher, Perl, gcc or cc,
  24                ... and some experience with Perl and Unix.
  25
  26
  27
  28 1. Compilation
  29 ______________
  30
  31
  32 This describes how to compile the various components of RIO.
  33
  34
  35   "gunzip RIO1.x.tar.gz"
  36
  37   "tar -xvf RIO1.x.tar"
  38
  39
  40
  41
  42 in directory "RIO1.x/C":
  43
  44  "make"
  45
  46
  47
  48 in directory "RIO1.x/hmmer" (version of HMMER is "2.2g"):
  49
  50 (if you already have a local copy of HMMER 2.2g installed, this step
  51 is not necessary, but in this case you need to change variables "$HMMALIGN",
  52 "$HMMSEARCH", "$HMMBUILD", "$HMMFETCH", and "$SFE" to point to the
  53 corresponding HMMER programs)
  54
  55  "./configure"
  56
  57  "make"
  58
  59
  60
  61 in directory "RIO1.x/java" (requires JDK 1.2 or greater):
  62
  63  "javac forester/tools/*java"
  64
  65  "javac ATVapp.java"
  66
  67
  68
  69
  70 in directory "RIO1.x/puzzle_dqo":
  71
  72  "./configure"
  73
  74  "make"
  75
  76
  77
  78 in directory "RIO1.x/puzzle_mod":
  79
  80  "./configure"
  81
  82  "make"
  83
  84
  85
  86 in directory "RIO1.x/phylip_mod/src":
  87
  88   "make install"
  89
  90
  91
  92
  93 2. Setting the variables in "RIO1.x/perl/rio_module.pm"
  94 _______________________________________________________
  95
  96
  97 Most global variables used in "RIO1.x/perl/rio.pl" are set in
  98 the perl module "RIO1.x/perl/rio_module.pm".
  99 This module pretty much "controls everything".
 100
 101 It is necessary to set the variables which point to:
 102
 103 -- the rio directory itself: $PATH_TO_FORESTER
 104
 105    (example: $PATH_TO_FORESTER = "/home/czmasek/linux/RIO1.1/";)
 106
 107
 108 -- your Java virtual machine: $JAVA
 109
 110    (example: $JAVA = "/home/czmasek/linux/j2sdk1.4.0/bin/java";)
 111
 112
 113 -- a directory where temporary files can be created: $TEMP_DIR_DEFAULT
 114
 115
 116
 117    Example:
 118    Now that $PATH_TO_FORESTER, $JAVA, $TEMP_DIR_DEFAULT are set,
 119    it is posssible to run rio.pl based on the example precalculated distances
 120    in "/example_data/":
 121
 122      % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out0 p I
 123
 124    To use RIO to analyze your protein sequences, please continue setting
 125    variables and preparing data......
 126
 127
 128
 129 -- your local copy of the Pfam database (see http://pfam.wustl.edu/)
 130    (if only precalculated distances are being used, these variables do not
 131     matter):
 132
 133    $PFAM_FULL_DIRECTORY -- the directory containing the "full" alignments
 134                            (Pfam-A.full) see below (3.)
 135
 136    $PFAM_SEED_DIRECTORY -- the directory containing the "seed" alignments
 137                            (Pfam-A.seed) see below (3.)
 138
 139    $PFAM_HMM_DB         -- the Pfam HMM library file (Pfam_ls)
 140                            see below (3.)
 141
 142
 143 -- $TREMBL_ACDEOS_FILE and $SWISSPROT_ACDEOS_FILE: see below (4. and 5.).
 144
 145
 146 -- list of species (SWISS-PROT codes) which can be analyzed: $SPECIES_NAMES_FILE
 147    (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4_species_list"
 148    should be sufficient, hence this variable does not necessarly need to be changed)
 149
 150
 151 -- a default species tree in NHX format: $SPECIES_TREE_FILE_DEFAULT
 152    (for most purposes $PATH_TO_FORESTER."data/species/tree_of_life_bin_1-4.nhx"
 153    should be sufficient, hence this variable does not necessarly need to be changed)
 154
 155
 156 -- Only if precalculated distances are being used:
 157    $MATRIX_FOR_PWD, $RIO_PWD_DIRECTORY, $RIO_BSP_DIRECTORY,
 158    $RIO_NBD_DIRECTORY, $RIO_ALN_DIRECTORY, and $RIO_HMM_DIRECTORY:
 159    please see below (6.)
 160
 161
 162
 163
 164
 165
 166 IMPORTANT:  Need to redo steps 3., 4., 5., and 6. if species
 167             in the master species tree and/or the species list
 168             are added and/or changed or if a new version of Pfam is used!!
 169
 170
 171
 172
 173
 174 3. Downloading and processing of Pfam
 175 _____________________________________
 176
 177
 178
 179 Please note: Even if you already have a local copy of the
 180 Pfam database, you still need to perform steps c. through k.
 181
 182 a. download
 183    - "Pfam_ls" (PFAM HMM library, glocal alignment models)
 184    - "Pfam-A.full" (full alignments of the curated families)
 185    - "Pfam-A.seed" (seed alignments of the curated families)
 186      [and ideally "prior.tar.gz"]
 187    from http://pfam.wustl.edu/ or ftp.genetics.wustl.edu/pub/eddy/pfam-x/
 188
 189 b. "gunzip" and "tar -xvf" these downloaded files, if necessary
 190
 191 c. create a new directory named "Full" and move "Pfam-A.full" into it
 192
 193 d. in directory "Full" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.full"
 194
 195 e. set variable $PFAM_FULL_DIRECTORY in "RIO1.x/perl/rio_module.pm"
 196    to point to this "Full" directory
 197
 198 f. create a new directory named "Seed" and move "Pfam-A.seed" into it
 199
 200 g. in directory "Seed" execute "RIO1.x/perl/pfam2slx.pl Pfam-A.seed"
 201
 202 h. set variable $PFAM_SEED_DIRECTORY in "RIO1.x/perl/rio_module.pm"
 203    to point to this "Seed" directory
 204
 205 i. execute "RIO1.x/hmmer/binaries/hmmindex Pfam_ls" (in same
 206    directory as "Pfam_ls") resulting in "Pfam_ls.ssi"
 207
 208 j. set environment variable HMMERDB to point to the directory where
 209    "Pfam_ls" and "Pfam_ls.ssi" reside
 210    (for example "setenv HMMERDB /home/czmasek/PFAM7.3/")
 211
 212 k. set variable $PFAM_HMM_DB in "RIO1.x/perl/rio_module.pm"
 213    to point to the "Pfam_ls" file
 214    (for example $PFAM_HMM_DB = "/home/czmasek/PFAM7.3/Pfam_ls";)
 215
 216
 217
 218
 219 4. Extraction of ID, DE, and species from a SWISS-PROT sprot.dat file
 220 _____________________________________________________________________
 221
 222
 223 This creates the file from which RIO will get the sequence descriptions for
 224 sequences from SWISS-PROT.
 225 (RIO1.x/data/ does not contain an example for this, since SWISS-PROT is
 226 copyrighted.)
 227
 228
 229 a. download SWISS-PROT "sprotXX.dat" from
 230    "ftp://ca.expasy.org/databases/swiss-prot/release/"
 231
 232 b. "extractSWISS-PROT.pl <infile> <outfile> [species list]"
 233
 234     ("extractSWISS-PROT.pl" is in "RIO1.x/perl")
 235
 236     example:
 237     "extractSWISS-PROT.pl sprot40.dat sp40_ACDEOS RIO1.x/data/species/tree_of_life_bin_1-4_species_list"
 238
 239 c. the output file should be placed in "RIO1.x/data" and the
 240    variable $SWISSPROT_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
 241    to this output.
 242
 243
 244
 245
 246 5. Extraction of AC, DE, and species from a TrEMBL trembl.dat file
 247 __________________________________________________________________
 248
 249
 250 This creates the file from which RIO will get the sequence descriptions for
 251 sequences from TrEMBL.
 252 (RIO1.x/data/ already contains an example: "trembl20_ACDEOS_1-4")
 253
 254 a. download TrEMBL "trembl.dat.gz" from
 255    "ftp://ca.expasy.org/databases/sp_tr_nrdb/"
 256
 257 b. "gunzip trembl.dat.gz"
 258
 259 c. "extractTrembl.pl <infile> <outfile> [species list]"
 260
 261     ("extractTrembl.pl" is in "RIO1.x/perl")
 262
 263     example:
 264     "extractTrembl.pl trembl.dat trembl17.7_ACDEOS_1-4 RIO1.x/data/species/tree_of_life_bin_1-4_species_list"
 265
 266 d. the output file should be placed in "RIO1.x/data/" and the
 267    variable $TREMBL_ACDEOS_FILE in "RIO1.x/perl/rio_module.pm" should point
 268    to this output.
 269
 270
 271
 272 Now, you could go to directly to 7. to run the examples......
 273
 274
 275
 276 6. Precalculation of pairwise distances (optional): pfam2pwd.pl
 277 _______________________________________________________________
 278
 279
 280 This step is of course only necessary if you want to use RIO on
 281 precalculated pairwise distances. The precalculation is time consuming
 282 (range of one or two weeks on ten processors).
 283 It is best to run it on a few machines, dividing up the input data.
 284
 285 The program to do this, is "RIO1.x/perl/pfam2pwd.pl".
 286
 287 Please note: "pfam2pwd.pl" creates a logfile in the same directory
 288              where is places the pairwise distance output ($MY_RIO_PWD_DIRECTORY).
 289
 290
 291
 292 The following variables in "RIO1.x/perl/pfam2pwd.pl" need to be set
 293 ("pfam2pwd.pl" gets most of its information from "rio_module.pm"):
 294
 295
 296 "$MY_PFAM_FULL_DIRECTORY":
 297   This is the directory where the Pfam full alignments reside, processed
 298   as described in 3.a to 3.d.
 299
 300
 301
 302 "$ALGNS_TO_USE_LIST_FILE":
 303   If left empty, all alignments in $MY_PFAM_FULL_DIRECTORY are being
 304   used the calculate pairwise distances from.
 305   If this points to a file listing names of Pfam alignments,
 306   only those listed are being used.
 307   The file can either be a simple new-line deliminated list, or can have
 308   the same format as the "Summary of changes" list
 309   ("FI   PF03214 RGP   NEW  SEED HMM_ls HMM_fs FULL DESC")
 310   which is part of the Pfam distribution.
 311   One purpose of this is to use the list of "too large" alignments
 312   in the logfile produced by "pfam2pwd.pl" to run "pfam2pwd.pl" with
 313   a smaller species list (as can be set with "$MY_SPECIES_NAMES_FILE")
 314   on large alignments.
 315
 316
 317
 318 "$MY_SPECIES_NAMES_FILE" -- Dealing with too large alignments:
 319
 320   This is most important. It determines the species whose sequences
 321   are being used (sequences from species not listed in $MY_SPECIES_NAMES_FILE
 322   are ignored). Normally, one would use the same list as RIO uses
 323   ($SPECIES_NAMES_FILE in "rio_module.pm"):
 324
 325   my $MY_SPECIES_NAMES_FILE = $SPECIES_NAMES_FILE;
 326
 327   For certain large families (such as protein kinases, one must use
 328   a species file which contains less species in order to be able to finish
 329   the calculations in reasonable time:
 330
 331   my $MY_SPECIES_NAMES_FILE = $PATH_TO_FORESTER."data/tree_of_life_bin_1-4_species_list_NO_RAT_RABBIT_MONKEYS_APES_SHEEP_GOAT_HAMSTER
 332
 333   An additional way to reduce the number of sequences in an alignment is
 334   to only use sequences originating from SWISS-PROT. This is done by
 335   placing the following line of code into pfam2pwd.pl:
 336
 337   $TREMBL_ACDEOS_FILE = $PATH_TO_FORESTER."data/NO_TREMBL";
 338
 339
 340
 341 "$MY_RIO_PWD_DIRECTORY",
 342 "$MY_RIO_BSP_DIRECTORY",
 343 "$MY_RIO_NBD_DIRECTORY",
 344 "$MY_RIO_ALN_DIRECTORY",
 345 "$MY_RIO_HMM_DIRECTORY":
 346   These determine where to place the output.
 347   After all the data has been calculated, the corresponding variables
 348   in RIO1.x/perl/rio_module.pm ("$RIO_PWD_DIRECTORY", etc.) need to be set
 349   so that they point to the appropriate values. Having different variables
 350   allows to precalculate distances and at the same time use RIO on
 351   previously precalculated distances.
 352
 353
 354
 355 "$MY_TEMP_DIR":
 356   A directory to create temporary files in.
 357
 358
 359
 360 "$MIN_SEQS":
 361   Alignments in which the number of sequences after pruning (determined
 362   by "$MY_SPECIES_NAMES_FILE") is lower than $MIN_SEQS, are ignored
 363   (no calculation of pwds).
 364
 365
 366
 367 "$MAX_SEQS":
 368   Alignments in which the number of sequences after pruning (determined
 369   by "$MY_SPECIES_NAMES_FILE") is greater than $MAX_SEQS, are ignored
 370   (no calculation of pwds).
 371
 372
 373
 374 "$MY_SEED":
 375   Seed for the random number generator for bootstrapping (must be 4n+1).
 376
 377
 378
 379 "$MY_MATRIX":
 380   This is used to choose the model to be used for the (ML)
 381   distance calculation:
 382   0 = JTT
 383   2 = BLOSUM 62
 384   3 = mtREV24
 385   5 = VT
 386   6 = WAG
 387   PAM otherwise
 388   After all the data has been calculated, variable "$MATRIX_FOR_PWD"
 389   in RIO1.x/perl/rio_module.pm needs to be set to the same value.
 390
 391
 392
 393 Once pairwise distances are calculated, the following variables in
 394 "rio_module.pm" need to be set accordingly:
 395 $MATRIX_FOR_PWD     : corresponds to $MY_MATRIX in pfam2pwd.pl
 396 $RIO_PWD_DIRECTORY  : corresponds to $MY_RIO_PWD_DIRECTORY in pfam2pwd.pl
 397 $RIO_BSP_DIRECTORY  : corresponds to $MY_RIO_BSP_DIRECTORY in pfam2pwd.pl
 398 $RIO_NBD_DIRECTORY  : corresponds to $MY_RIO_NBD_DIRECTORY in pfam2pwd.pl
 399 $RIO_ALN_DIRECTORY  : corresponds to $MY_RIO_ALN_DIRECTORY in pfam2pwd.pl
 400 $RIO_HMM_DIRECTORY  : corresponds to $MY_RIO_HMM_DIRECTORY in pfam2pwd.pl
 401 ...of course, if Pfam has been updated, the corresponding variables in rio_module.pm
 402 ($PFAM_FULL_DIRECTORY, etc.) need to be updated, too.
 403
 404
 405
 406
 407
 408
 409 IMPORTANT:  Need to redo steps 3., 4., 5., and 6. if species
 410             in the master species tree and/or the species list
 411             are added and/or changed or if a new version of Pfam is used!
 412
 413
 414
 415
 416 7. Example of a phylogenomic analysis using "rio.pl"
 417 ____________________________________________________
 418
 419
 420 Without using precalculated distances (for this, all the variables above
 421 need to point to the correct loctions, in particular to your local and processed
 422 Pfam database):
 423
 424   % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E
 425
 426
 427
 428 Without using precalculated distances (for this, all the variables above
 429 need to point to the correct loctions, in particular to your local and processed
 430 Pfam database) using a query sequence which is already in the alignment:
 431
 432   % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E
 433
 434
 435
 436 Using the example precalculated distances in "/example_data/"
 437 ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):
 438
 439   % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E
 440
 441
 442
 443 Using a query sequence which is already in the precalculated distances in "/example_data/"
 444 ($RIO_PWD_DIRECTORY, etc. need to point to $PATH_TO_FORESTER."example_data/"):
 445
 446   % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E
 447
 448
 449
 450 for detailed instructions on how to use rio.pl see the source code,
 451 or type "rio.pl" without any arguments
 452
 453
 454
 455
 456 Christian Zmasek
 457 zmasek@genetics.wustl.edu
 458 05/26/02
 459