wiki/PhyloBioRuby.wiki

   1 #summary Tutorial for multiple sequence alignments and phylogenetic methods in BioRuby -- under development!
   2
   3
   4
   5 = Introduction =
   6
   7 Under development!
   8
   9 Tutorial for multiple sequence alignments and phylogenetic methods in [http://bioruby.open-bio.org/ BioRuby].
  10
  11 Eventually, this is expected to be placed on the official !BioRuby page.
  12
  13 Author: [http://www.cmzmasek.net/ Christian M Zmasek], Sanford-Burnham Medical Research Institute
  14
  15
  16 Copyright (C) 2011 Christian M Zmasek. All rights reserved.
  17
  18
  19 = Multiple Sequence Alignment =
  20
  21
  22 == Multiple Sequence Alignment Input and Output ==
  23
  24 === Reading in a Multiple Sequence Alignment from a File ===
  25
  26 The following example shows how to read in a *ClustalW*-formatted multiple sequence alignment.
  27
  28 {{{
  29 #!/usr/bin/env ruby
  30 require 'bio'
  31
  32 # Reads in a ClustalW-formatted multiple sequence alignment
  33 # from a file named "infile_clustalw.aln" and stores it in 'report'.
  34 report = Bio::ClustalW::Report.new(File.read('infile_clustalw.aln'))
  35
  36 # Accesses the actual alignment.
  37 msa = report.alignment
  38
  39 # Goes through all sequences in 'msa' and prints the
  40 # actual molecular sequence.
  41 msa.each do |entry|
  42   puts entry.seq
  43 end
  44 }}}
  45
  46
  47
  48 === Writing a Multiple Sequence Alignment to a File ===
  49
  50
  51 The following example shows how to write a multiple sequence alignment in *FASTA*-format. It first creates a file named "outfile.fasta" for writing ('w') and then writes the multiple sequence alignment referred to by variable 'msa' to it in FASTA-format (':fasta').
  52
  53 {{{
  54 #!/usr/bin/env ruby
  55 require 'bio'
  56
  57 # Creates a new file named "outfile.fasta" and writes
  58 # multiple sequence alignment 'msa' to it in fasta format.
  59 File.open('outfile.fasta', 'w') do |f|
  60   f.write(msa.output(:fasta))
  61 end
  62 }}}
  63
  64 ==== Setting the Output Format ====
  65
  66 The following symbols determine the output format:
  67
  68   * `:clustal` for ClustalW
  69   * `:fasta` for FASTA
  70   * `:phylip` for PHYLIP interleaved (will truncate sequence names to no more than 10 characters)
  71   * `:phylipnon` for PHYLIP non-interleaved (will truncate sequence names to no more than 10 characters)
  72   * `:msf` for MSF
  73   * `:molphy` for Molphy
  74
  75
  76 For example, the following writes in PHYLIP's non-interleaved format:
  77
  78 {{{
  79 f.write(align.output(:phylipnon))
  80 }}}
  81
  82
  83 === Formatting of Individual Sequences ===
  84
  85 _... to be done_
  86
  87 !BioRuby can format molecular sequences in a variety of formats.
  88 Individual sequences can be formatted to (e.g.) Genbank format as shown in the following examples.
  89
  90 For Sequence objects:
  91 {{{
  92 seq.to_seq.output(:genbank)
  93 }}}
  94
  95 For Bio::!FlatFile entries:
  96 {{{
  97 entry.to_biosequence.output(:genbank)
  98 }}}
  99
 100 The following symbols determine the output format:
 101   *`:genbank` for Genbank
 102   *`:fasta` for FASTA
 103   *`:embl` for EMBL
 104
 105
 106 == Calculating Multiple Sequence Alignments ==
 107
 108 !BioRuby can be used to execute a variety of multiple sequence alignment
 109 programs (such as [http://mafft.cbrc.jp/alignment/software/ MAFFT], [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], [http://www.drive5.com/muscle/ Muscle], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee]).
 110 In the following, examples for using the MAFFT and Muscle are shown.
 111
 112
 113 === MAFFT ===
 114
 115 The following example uses the MAFFT program to align four sequences
 116 and then prints the result to the screen.
 117 Please note that if the path to the MAFFT executable is properly set `mafft=Bio::MAFFT.new(options)` can be used instead of explicitly indicating the path as in the example.
 118
 119 {{{
 120 #!/usr/bin/env ruby
 121 require 'bio'
 122
 123 # 'seqs' is either an array of sequences or a multiple sequence
 124 # alignment. In general this is read in from a file as described in ?.
 125 # For the purpose of this tutorial, it is generated in code.
 126 seqs = ['MFQIPEFEPSEQEDSSSAER',
 127         'MGTPKQPSLAPAHALGLRKS',
 128         'PKQPSLAPAHALGLRKS',
 129         'MCSTSGCDLE']
 130
 131
 132 # Calculates the alignment using the MAFFT program on the local
 133 # machine with options '--maxiterate 1000 --localpair'
 134 # and stores the result in 'report'.
 135 options = ['--maxiterate', '1000', '--localpair']
 136 mafft = Bio::MAFFT.new('path/to/mafft', options)
 137 report = mafft.query_align(seqs)
 138
 139 # Accesses the actual alignment.
 140 align = report.alignment
 141
 142 # Prints each sequence to the console.
 143 align.each { |s| puts s.to_s }
 144
 145 }}}
 146
 147 References:
 148
 149  * Katoh, Toh (2008) "Recent developments in the MAFFT multiple sequence alignment program" Briefings in Bioinformatics 9:286-298
 150
 151  * Katoh, Toh 2010 (2010) "Parallelization of the MAFFT multiple sequence alignment program" Bioinformatics 26:1899-1900
 152
 153
 154
 155 === Muscle ===
 156
 157 {{{
 158 #!/usr/bin/env ruby
 159 require 'bio'
 160
 161 # 'seqs' is either an array of sequences or a multiple sequence
 162 # alignment. In general this is read in from a file as described in ?.
 163 # For the purpose of this tutorial, it is generated in code.
 164 seqs = ['MFQIPEFEPSEQEDSSSAER',
 165         'MGTPKQPSLAPAHALGLRKS',
 166         'PKQPSLAPAHALGLRKS',
 167         'MCSTSGCDLE']
 168
 169 # Calculates the alignment using the Muscle program on the local
 170 # machine with options '-quiet -maxiters 64'
 171 # and stores the result in 'report'.
 172 options = ['-quiet', '-maxiters', '64']
 173 muscle = Bio::Muscle.new('path/to/muscle', options)
 174 report = muscle.query_align(seqs)
 175
 176 # Accesses the actual alignment.
 177 align = report.alignment
 178
 179 # Prints each sequence to the console.
 180 align.each { |s| puts s.to_s }
 181
 182 }}}
 183
 184 References:
 185
 186  * Edgar, R.C. (2004) "MUSCLE: multiple sequence alignment with high accuracy and high throughput" Nucleic Acids Res 32(5):1792-1797
 187
 188 === Other Programs ===
 189
 190 _need more detail here..._
 191
 192 [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee] can be used in the same manner as the programs above.
 193
 194
 195
 196 == Manipulating Multiple Sequence Alignments ==
 197
 198 Oftentimes, multiple sequence to be used for phylogenetic inference are 'cleaned up' in some manner. For instance, some researchers prefer to delete columns with more than 50% gaps. The following code is an example of how to do that in !BioRuby.
 199
 200
 201 _... to be done_
 202
 203 {{{
 204 #!/usr/bin/env ruby
 205 require 'bio'
 206
 207 }}}
 208
 209
 210 ----
 211
 212 = Phylogenetic Trees =
 213
 214
 215 == Phylogenetic Tree Input and Output ==
 216
 217
 218 === Reading in of Phylogenetic Trees ===
 219
 220
 221
 222 ====Newick or New Hampshire Format====
 223
 224 _... to be done_
 225
 226 {{{
 227 #!/usr/bin/env ruby
 228 require 'bio'
 229
 230 }}}
 231
 232 ====phyloXML Format====
 233
 234 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 235
 236 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 237
 238 {{{
 239 % gem install -r libxml-ruby
 240 }}}
 241
 242 This example reads file "example.xml" and stores its [http://www.phyloxml.org/ phyloXML]-formatted trees in variable 'trees'.
 243
 244 {{{
 245 #!/usr/bin/env ruby
 246 require 'bio'
 247
 248 # This creates new phyloXML parser.
 249 trees = Bio::PhyloXML::Parser.new('example.xml')
 250
 251 # This prints the names of all trees in the file.
 252 trees.each do |tree|
 253   puts tree.name
 254 end
 255
 256 # If there are several trees in the file, you can access the one you wish via index.
 257 tree = trees[3]
 258
 259 }}}
 260
 261
 262 ====Nexus  Format====
 263
 264 _... to be done_
 265
 266 {{{
 267 #!/usr/bin/env ruby
 268 require 'bio'
 269
 270 }}}
 271
 272 === Writing of Phylogenetic Trees ===
 273
 274 ====Newick or New Hampshire Format====
 275
 276 _... to be done_
 277
 278 {{{
 279 #!/usr/bin/env ruby
 280 require 'bio'
 281
 282 }}}
 283
 284 ====phyloXML Format====
 285
 286 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 287
 288 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 289
 290 {{{
 291 % gem install -r libxml-ruby
 292 }}}
 293
 294 This example writes trees 'tree1' and 'tree2' to file "tree.xml" in [http://www.phyloxml.org/ phyloXML] format.
 295
 296 {{{
 297 #!/usr/bin/env ruby
 298 require 'bio'
 299
 300 # this creates new phyloXML writer.
 301 writer = Bio::PhyloXML::Writer.new('tree.xml')
 302
 303 # Writes tree to the file "tree.xml".
 304 writer.write(tree1)
 305
 306 # Adds another tree to the file.
 307 writer.write(tree2)
 308
 309
 310 }}}
 311
 312
 313 ====Nexus  Format====
 314
 315 _... to be done_
 316
 317 {{{
 318 #!/usr/bin/env ruby
 319 require 'bio'
 320
 321 }}}
 322
 323
 324 = Phylogenetic Inference =
 325
 326 _Currently !BioRuby does not contain wrappers for phylogenetic inference programs, thus I am progress of writing a RAxML wrapper followed by a wrapper for FastME..._
 327
 328 == Optimality Criteria Based on Character Data ==
 329
 330 Character data based methods work directly on molecular sequences and thus do not require the calculation of pairwise distances but tend to be time consuming and sensitive to errors in the multiple sequence alignment.
 331
 332 === Maximum Likelihood ===
 333
 334 ==== RAxML ====
 335
 336 _... to be done_
 337
 338 {{{
 339 #!/usr/bin/env ruby
 340 require 'bio'
 341
 342 }}}
 343
 344
 345 ==== PhyML ====
 346
 347 _... to be done_
 348
 349 {{{
 350 #!/usr/bin/env ruby
 351 require 'bio'
 352
 353 }}}
 354
 355 === Maximum Parsimony ===
 356
 357 Currently no direct support in !BioRuby.
 358
 359
 360 === Bayesian Inference ===
 361
 362 Currently no direct support in !BioRuby.
 363
 364
 365 == Pairwise Distance Based Methods ==
 366
 367 === Pairwise Sequence Distance Estimation ===
 368
 369 _... to be done_
 370
 371 {{{
 372 #!/usr/bin/env ruby
 373 require 'bio'
 374
 375 }}}
 376
 377
 378 === Optimality Criteria Based on Pairwise Distances ===
 379
 380
 381 ==== Minimal Evolution: FastME ====
 382
 383 _... to be done_
 384
 385 {{{
 386 #!/usr/bin/env ruby
 387 require 'bio'
 388
 389 }}}
 390
 391 === Algorithmic Methods Based on Pairwise Distances ===
 392
 393 ==== Neighbor Joining and Related Methods ====
 394
 395 _... to be done_
 396
 397 {{{
 398 #!/usr/bin/env ruby
 399 require 'bio'
 400
 401 }}}
 402
 403
 404
 405
 406
 407
 408
 409 == Support Calculation? ==
 410
 411 === Bootstrap Resampling? ===
 412
 413
 414 ----
 415
 416 = Analyzing Phylogenetic Trees =
 417
 418 == PAML ==
 419
 420
 421 == Gene Duplication Inference ==
 422
 423 _need to further test and then import GSoC 'SDI' work..._
 424
 425
 426 == Others? ==
 427
 428
 429 ----
 430
 431 = Putting It All Together =
 432
 433 Example of a small "pipeline"-type program running a mininal phyogenetic analysis: starting with a set of sequences and ending with a phylogenetic tree.
 434