wiki/PhyloBioRuby.wiki

   1 #summary Tutorial for multiple sequence alignments and phylogenetic methods in BioRuby -- under development!
   2
   3
   4
   5 = Introduction =
   6
   7 Under development!
   8
   9 Tutorial for multiple sequence alignments and phylogenetic methods in [http://bioruby.open-bio.org/ BioRuby].
  10
  11 Eventually, this is expected to be placed on the official !BioRuby page.
  12
  13 Author: [http://www.cmzmasek.net/ Christian M Zmasek], Sanford-Burnham Medical Research Institute
  14
  15
  16 Copyright (C) 2011 Christian M Zmasek. All rights reserved.
  17
  18
  19 = Multiple Sequence Alignment =
  20
  21
  22 == Multiple Sequence Alignment Input and Output ==
  23
  24 === Reading in a Multiple Sequence Alignment from a File ===
  25
  26 This automatically determines the format
  27 {{{
  28 #!/usr/bin/env ruby
  29 require 'bio'
  30
  31 seq_ary = Array.new
  32 ff = Bio::FlatFile.auto('bcl2.fasta')
  33 ff.each_entry do |entry|
  34   seq_ary.push(entry)
  35   puts entry.entry_id          # prints the identifier of the entry
  36   puts entry.definition        # prints the definition of the entry
  37   puts entry.seq               # prints the sequence data of the entry
  38 end
  39
  40 # Creates a multiple sequence alignment (possibly unaligned) named
  41 # 'seqs' from array 'seq_ary'.
  42 seqs = Bio::Alignment.new(seq_ary)
  43 seqs.each { |seq| puts seq.to_s }
  44
  45 # Writes multiple sequence alignment (possibly unaligned) 'seqs'
  46 # to a file in PHYLIP format.
  47 File.open('out0.phylip', 'w') do |f|
  48   f.write(seqs.output(:phylip))
  49 end
  50
  51 # Writes multiple sequence alignment (possibly unaligned) 'seqs'
  52 # to a file in FASTA format.
  53 File.open('out0.fasta', 'w') do |f|
  54   f.write(seqs.output(:fasta))
  55 end
  56 }}}
  57
  58
  59 ==== ClustalW Format ====
  60
  61 The following example shows how to read in a *ClustalW*-formatted multiple sequence alignment.
  62
  63 {{{
  64 #!/usr/bin/env ruby
  65 require 'bio'
  66
  67 # Reads in a ClustalW-formatted multiple sequence alignment
  68 # from a file named "infile_clustalw.aln" and stores it in 'report'.
  69 report = Bio::ClustalW::Report.new(File.read('infile_clustalw.aln'))
  70
  71 # Accesses the actual alignment.
  72 msa = report.alignment
  73
  74 # Goes through all sequences in 'msa' and prints the
  75 # actual molecular sequence.
  76 msa.each do |entry|
  77   puts entry.seq
  78 end
  79 }}}
  80
  81 ==== FASTA Format ====
  82
  83 The following example shows how to read in a *FASTA*-formatted multiple sequence file. (_This seems a little clumsy, I wonder if there is a more direct way, avoiding the creation of an array.)
  84 {{{
  85 #!/usr/bin/env ruby
  86 require 'bio'
  87
  88 # Reads in a FASTA-formatted multiple sequence alignment (which does
  89 # not have to be aligned, though) and stores its sequences in
  90 # array 'seq_ary'.
  91 seq_ary = Array.new
  92 fasta_seqs = Bio::Alignment::MultiFastaFormat.new(File.open('infile.fasta').read)
  93 fasta_seqs.entries.each do |seq|
  94   seq_ary.push(seq)
  95 end
  96
  97 # Creates a multiple sequence alignment (possibly unaligned) named
  98 # 'seqs' from array 'seq_ary'.
  99 seqs = Bio::Alignment.new(seq_ary)
 100
 101 # Prints each sequence to the console.
 102 seqs.each { |seq| puts seq.to_s }
 103
 104 # Writes multiple sequence alignment (possibly unaligned) 'seqs'
 105 # to a file in PHYLIP format.
 106 File.open('outfile.phylip', 'w') do |f|
 107   f.write(seqs.output(:phylip))
 108 end
 109 }}}
 110
 111 Relevant API documentation:
 112
 113  * [http://bioruby.open-bio.org/rdoc/classes/Bio/ClustalW/Report.html Bio::ClustalW::Report]
 114  * [http://bioruby.open-bio.org/rdoc/classes/Bio/Alignment.html Bio::Alignment]
 115  * [http://bioruby.open-bio.org/rdoc/classes/Bio/Sequence.html Bio::Sequence]
 116
 117 === Writing a Multiple Sequence Alignment to a File ===
 118
 119
 120 The following example shows how to write a multiple sequence alignment in *FASTA*-format. It first creates a file named "outfile.fasta" for writing ('w') and then writes the multiple sequence alignment referred to by variable 'msa' to it in FASTA-format (':fasta').
 121
 122 {{{
 123 #!/usr/bin/env ruby
 124 require 'bio'
 125
 126 # Creates a new file named "outfile.fasta" and writes
 127 # multiple sequence alignment 'msa' to it in fasta format.
 128 File.open('outfile.fasta', 'w') do |f|
 129   f.write(msa.output(:fasta))
 130 end
 131 }}}
 132
 133 ==== Setting the Output Format ====
 134
 135 The following symbols determine the output format:
 136
 137   * `:clustal` for ClustalW
 138   * `:fasta` for FASTA
 139   * `:phylip` for PHYLIP interleaved (will truncate sequence names to no more than 10 characters)
 140   * `:phylipnon` for PHYLIP non-interleaved (will truncate sequence names to no more than 10 characters)
 141   * `:msf` for MSF
 142   * `:molphy` for Molphy
 143
 144
 145 For example, the following writes in PHYLIP's non-interleaved format:
 146
 147 {{{
 148 f.write(align.output(:phylipnon))
 149 }}}
 150
 151
 152 === Formatting of Individual Sequences ===
 153
 154 !BioRuby can format molecular sequences in a variety of formats.
 155 Individual sequences can be formatted to (e.g.) Genbank format as shown in the following examples.
 156
 157 For Sequence objects:
 158 {{{
 159 seq.to_seq.output(:genbank)
 160 }}}
 161
 162 For Bio::!FlatFile entries:
 163 {{{
 164 entry.to_biosequence.output(:genbank)
 165 }}}
 166
 167 The following symbols determine the output format:
 168   * `:genbank` for Genbank
 169   * `:embl` for EMBL
 170   * `:fasta` for FASTA
 171   * `:fasta_ncbi` for NCBI-type FASTA
 172   * `:raw` for raw sequence
 173   * `:fastq` for FASTQ (includes quality scores)
 174   * `:fastq_sanger` for Sanger-type FASTQ
 175   * `:fastq_solexa` for Solexa-type FASTQ
 176   * `:fastq_illumina` for Illumina-type FASTQ
 177
 178 == Calculating Multiple Sequence Alignments ==
 179
 180 !BioRuby can be used to execute a variety of multiple sequence alignment
 181 programs (such as [http://mafft.cbrc.jp/alignment/software/ MAFFT], [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], [http://www.drive5.com/muscle/ Muscle], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee]).
 182 In the following, examples for using the MAFFT and Muscle are shown.
 183
 184
 185 === MAFFT ===
 186
 187 The following example uses the MAFFT program to align four sequences
 188 and then prints the result to the screen.
 189 Please note that if the path to the MAFFT executable is properly set `mafft=Bio::MAFFT.new(options)` can be used instead of explicitly indicating the path as in the example.
 190
 191 {{{
 192 #!/usr/bin/env ruby
 193 require 'bio'
 194
 195 # 'seqs' is either an array of sequences or a multiple sequence
 196 # alignment. In general this is read in from a file as described in ?.
 197 # For the purpose of this tutorial, it is generated in code.
 198 seqs = ['MFQIPEFEPSEQEDSSSAER',
 199         'MGTPKQPSLAPAHALGLRKS',
 200         'PKQPSLAPAHALGLRKS',
 201         'MCSTSGCDLE']
 202
 203
 204 # Calculates the alignment using the MAFFT program on the local
 205 # machine with options '--maxiterate 1000 --localpair'
 206 # and stores the result in 'report'.
 207 options = ['--maxiterate', '1000', '--localpair']
 208 mafft = Bio::MAFFT.new('path/to/mafft', options)
 209 report = mafft.query_align(seqs)
 210
 211 # Accesses the actual alignment.
 212 align = report.alignment
 213
 214 # Prints each sequence to the console.
 215 align.each { |s| puts s.to_s }
 216
 217 }}}
 218
 219 References:
 220
 221  * Katoh, Toh (2008) "Recent developments in the MAFFT multiple sequence alignment program" Briefings in Bioinformatics 9:286-298
 222
 223  * Katoh, Toh 2010 (2010) "Parallelization of the MAFFT multiple sequence alignment program" Bioinformatics 26:1899-1900
 224
 225
 226
 227 === Muscle ===
 228
 229 {{{
 230 #!/usr/bin/env ruby
 231 require 'bio'
 232
 233 # 'seqs' is either an array of sequences or a multiple sequence
 234 # alignment. In general this is read in from a file as described in ?.
 235 # For the purpose of this tutorial, it is generated in code.
 236 seqs = ['MFQIPEFEPSEQEDSSSAER',
 237         'MGTPKQPSLAPAHALGLRKS',
 238         'PKQPSLAPAHALGLRKS',
 239         'MCSTSGCDLE']
 240
 241 # Calculates the alignment using the Muscle program on the local
 242 # machine with options '-quiet -maxiters 64'
 243 # and stores the result in 'report'.
 244 options = ['-quiet', '-maxiters', '64']
 245 muscle = Bio::Muscle.new('path/to/muscle', options)
 246 report = muscle.query_align(seqs)
 247
 248 # Accesses the actual alignment.
 249 align = report.alignment
 250
 251 # Prints each sequence to the console.
 252 align.each { |s| puts s.to_s }
 253
 254 }}}
 255
 256 References:
 257
 258  * Edgar, R.C. (2004) "MUSCLE: multiple sequence alignment with high accuracy and high throughput" Nucleic Acids Res 32(5):1792-1797
 259
 260 === Other Programs ===
 261
 262 _need more detail here..._
 263
 264 [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee] can be used in the same manner as the programs above.
 265
 266
 267
 268 == Manipulating Multiple Sequence Alignments ==
 269
 270 Oftentimes, multiple sequence to be used for phylogenetic inference are 'cleaned up' in some manner. For instance, some researchers prefer to delete columns with more than 50% gaps. The following code is an example of how to do that in !BioRuby.
 271
 272
 273 _... to be done_
 274
 275 {{{
 276 #!/usr/bin/env ruby
 277 require 'bio'
 278
 279 }}}
 280
 281
 282 ----
 283
 284 = Phylogenetic Trees =
 285
 286
 287 == Phylogenetic Tree Input and Output ==
 288
 289
 290 === Reading in of Phylogenetic Trees ===
 291
 292
 293
 294 ====Newick or New Hampshire Format====
 295
 296 _... to be done_
 297
 298 {{{
 299 #!/usr/bin/env ruby
 300 require 'bio'
 301
 302 }}}
 303
 304 ====phyloXML Format====
 305
 306 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 307
 308 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 309
 310 {{{
 311 % gem install -r libxml-ruby
 312 }}}
 313
 314 This example reads file "example.xml" and stores its [http://www.phyloxml.org/ phyloXML]-formatted trees in variable 'trees'.
 315
 316 {{{
 317 #!/usr/bin/env ruby
 318 require 'bio'
 319
 320 # This creates new phyloXML parser.
 321 trees = Bio::PhyloXML::Parser.new('example.xml')
 322
 323 # This prints the names of all trees in the file.
 324 trees.each do |tree|
 325   puts tree.name
 326 end
 327
 328 # If there are several trees in the file, you can access the one you wish via index.
 329 tree = trees[3]
 330
 331 }}}
 332
 333
 334 ====Nexus  Format====
 335
 336 _... to be done_
 337
 338 {{{
 339 #!/usr/bin/env ruby
 340 require 'bio'
 341
 342 }}}
 343
 344 === Writing of Phylogenetic Trees ===
 345
 346 ====Newick or New Hampshire Format====
 347
 348 _... to be done_
 349
 350 {{{
 351 #!/usr/bin/env ruby
 352 require 'bio'
 353
 354 }}}
 355
 356 ====phyloXML Format====
 357
 358 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 359
 360 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 361
 362 {{{
 363 % gem install -r libxml-ruby
 364 }}}
 365
 366 This example writes trees 'tree1' and 'tree2' to file "tree.xml" in [http://www.phyloxml.org/ phyloXML] format.
 367
 368 {{{
 369 #!/usr/bin/env ruby
 370 require 'bio'
 371
 372 # this creates new phyloXML writer.
 373 writer = Bio::PhyloXML::Writer.new('tree.xml')
 374
 375 # Writes tree to the file "tree.xml".
 376 writer.write(tree1)
 377
 378 # Adds another tree to the file.
 379 writer.write(tree2)
 380
 381
 382 }}}
 383
 384
 385 ====Nexus  Format====
 386
 387 _... to be done_
 388
 389 {{{
 390 #!/usr/bin/env ruby
 391 require 'bio'
 392
 393 }}}
 394
 395
 396 = Phylogenetic Inference =
 397
 398 _Currently !BioRuby does not contain wrappers for phylogenetic inference programs, thus I am progress of writing a RAxML wrapper followed by a wrapper for FastME..._
 399
 400 == Optimality Criteria Based on Character Data ==
 401
 402 Character data based methods work directly on molecular sequences and thus do not require the calculation of pairwise distances but tend to be time consuming and sensitive to errors in the multiple sequence alignment.
 403
 404 === Maximum Likelihood ===
 405
 406 ==== RAxML ====
 407
 408 _... to be done_
 409
 410 {{{
 411 #!/usr/bin/env ruby
 412 require 'bio'
 413
 414 }}}
 415
 416
 417 ==== PhyML ====
 418
 419 _... to be done_
 420
 421 {{{
 422 #!/usr/bin/env ruby
 423 require 'bio'
 424
 425 }}}
 426
 427 === Maximum Parsimony ===
 428
 429 Currently no direct support in !BioRuby.
 430
 431
 432 === Bayesian Inference ===
 433
 434 Currently no direct support in !BioRuby.
 435
 436
 437 == Pairwise Distance Based Methods ==
 438
 439 === Pairwise Sequence Distance Estimation ===
 440
 441 _... to be done_
 442
 443 {{{
 444 #!/usr/bin/env ruby
 445 require 'bio'
 446
 447 }}}
 448
 449
 450 === Optimality Criteria Based on Pairwise Distances ===
 451
 452
 453 ==== Minimal Evolution: FastME ====
 454
 455 _... to be done_
 456
 457 {{{
 458 #!/usr/bin/env ruby
 459 require 'bio'
 460
 461 }}}
 462
 463 === Algorithmic Methods Based on Pairwise Distances ===
 464
 465 ==== Neighbor Joining and Related Methods ====
 466
 467 _... to be done_
 468
 469 {{{
 470 #!/usr/bin/env ruby
 471 require 'bio'
 472
 473 }}}
 474
 475
 476
 477
 478
 479
 480
 481 == Support Calculation? ==
 482
 483 === Bootstrap Resampling? ===
 484
 485
 486 ----
 487
 488 = Analyzing Phylogenetic Trees =
 489
 490 == PAML ==
 491
 492
 493 == Gene Duplication Inference ==
 494
 495 _need to further test and then import GSoC 'SDI' work..._
 496
 497
 498 == Others? ==
 499
 500
 501 ----
 502
 503 = Putting It All Together =
 504
 505 Example of a small "pipeline"-type program running a mininal phyogenetic analysis: starting with a set of sequences and ending with a phylogenetic tree.
 506