wiki/PhyloBioRuby.wiki

   1 #summary Tutorial for multiple sequence alignments and phylogenetic methods in BioRuby -- under development!
   2
   3
   4
   5 = Introduction =
   6
   7 Under development!
   8
   9 Tutorial for multiple sequence alignments and phylogenetic methods in [http://bioruby.open-bio.org/ BioRuby].
  10
  11 Eventually, this is expected to be placed on the official !BioRuby page.
  12
  13 Author: [http://www.cmzmasek.net/ Christian M Zmasek], Sanford-Burnham Medical Research Institute
  14
  15
  16 Copyright (C) 2011 Christian M Zmasek. All rights reserved.
  17
  18
  19 = Multiple Sequence Alignment =
  20
  21
  22 == Multiple Sequence Alignment Input and Output ==
  23
  24 === Reading in a Multiple Sequence Alignment from a File ===
  25
  26 This automatically determines the format
  27 {{{
  28 ff = Bio::FlatFile.auto('bcl2.fasta')
  29 ff.each_entry do |entry|
  30   puts entry.entry_id          # identifier of the entry
  31   puts entry.definition        # definition of the entry
  32   puts entry.seq               # sequence data of the entry
  33 end
  34 }}}
  35
  36 ==== ClustalW Format ====
  37
  38 The following example shows how to read in a *ClustalW*-formatted multiple sequence alignment.
  39
  40 {{{
  41 #!/usr/bin/env ruby
  42 require 'bio'
  43
  44 # Reads in a ClustalW-formatted multiple sequence alignment
  45 # from a file named "infile_clustalw.aln" and stores it in 'report'.
  46 report = Bio::ClustalW::Report.new(File.read('infile_clustalw.aln'))
  47
  48 # Accesses the actual alignment.
  49 msa = report.alignment
  50
  51 # Goes through all sequences in 'msa' and prints the
  52 # actual molecular sequence.
  53 msa.each do |entry|
  54   puts entry.seq
  55 end
  56 }}}
  57
  58 ==== FASTA Format ====
  59
  60 The following example shows how to read in a *FASTA*-formatted multiple sequence file. (_This seems a little clumsy, I wonder if there is a more direct way, avoiding the creation of an array.)
  61 {{{
  62 #!/usr/bin/env ruby
  63 require 'bio'
  64
  65 # Reads in a FASTA-formatted multiple sequence alignment (which does
  66 # not have to be aligned, though) and stores its sequences in
  67 # array 'seq_ary'.
  68 seq_ary = Array.new
  69 fasta_seqs = Bio::Alignment::MultiFastaFormat.new(File.open('infile.fasta').read)
  70 fasta_seqs.entries.each do |seq|
  71   seq_ary.push(seq)
  72 end
  73
  74 # Creates a multiple sequence alignment (possibly unaligned) named
  75 # 'seqs' from array 'seq_ary'.
  76 seqs = Bio::Alignment.new(seq_ary)
  77
  78 # Prints each sequence to the console.
  79 seqs.each { |seq| puts seq.to_s }
  80
  81 # Writes multiple sequence alignment (possibly unaligned) 'seqs'
  82 # to a file in PHYLIP format.
  83 File.open('outfile.phylip', 'w') do |f|
  84   f.write(seqs.output(:phylip))
  85 end
  86 }}}
  87
  88 Relevant API documentation:
  89
  90  * [http://bioruby.open-bio.org/rdoc/classes/Bio/ClustalW/Report.html Bio::ClustalW::Report]
  91  * [http://bioruby.open-bio.org/rdoc/classes/Bio/Alignment.html Bio::Alignment]
  92  * [http://bioruby.open-bio.org/rdoc/classes/Bio/Sequence.html Bio::Sequence]
  93
  94 === Writing a Multiple Sequence Alignment to a File ===
  95
  96
  97 The following example shows how to write a multiple sequence alignment in *FASTA*-format. It first creates a file named "outfile.fasta" for writing ('w') and then writes the multiple sequence alignment referred to by variable 'msa' to it in FASTA-format (':fasta').
  98
  99 {{{
 100 #!/usr/bin/env ruby
 101 require 'bio'
 102
 103 # Creates a new file named "outfile.fasta" and writes
 104 # multiple sequence alignment 'msa' to it in fasta format.
 105 File.open('outfile.fasta', 'w') do |f|
 106   f.write(msa.output(:fasta))
 107 end
 108 }}}
 109
 110 ==== Setting the Output Format ====
 111
 112 The following symbols determine the output format:
 113
 114   * `:clustal` for ClustalW
 115   * `:fasta` for FASTA
 116   * `:phylip` for PHYLIP interleaved (will truncate sequence names to no more than 10 characters)
 117   * `:phylipnon` for PHYLIP non-interleaved (will truncate sequence names to no more than 10 characters)
 118   * `:msf` for MSF
 119   * `:molphy` for Molphy
 120
 121
 122 For example, the following writes in PHYLIP's non-interleaved format:
 123
 124 {{{
 125 f.write(align.output(:phylipnon))
 126 }}}
 127
 128
 129 === Formatting of Individual Sequences ===
 130
 131 !BioRuby can format molecular sequences in a variety of formats.
 132 Individual sequences can be formatted to (e.g.) Genbank format as shown in the following examples.
 133
 134 For Sequence objects:
 135 {{{
 136 seq.to_seq.output(:genbank)
 137 }}}
 138
 139 For Bio::!FlatFile entries:
 140 {{{
 141 entry.to_biosequence.output(:genbank)
 142 }}}
 143
 144 The following symbols determine the output format:
 145   * `:genbank` for Genbank
 146   * `:embl` for EMBL
 147   * `:fasta` for FASTA
 148   * `:fasta_ncbi` for NCBI-type FASTA
 149   * `:raw` for raw sequence
 150   * `:fastq` for FASTQ (includes quality scores)
 151   * `:fastq_sanger` for Sanger-type FASTQ
 152   * `:fastq_solexa` for Solexa-type FASTQ
 153   * `:fastq_illumina` for Illumina-type FASTQ
 154
 155 == Calculating Multiple Sequence Alignments ==
 156
 157 !BioRuby can be used to execute a variety of multiple sequence alignment
 158 programs (such as [http://mafft.cbrc.jp/alignment/software/ MAFFT], [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], [http://www.drive5.com/muscle/ Muscle], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee]).
 159 In the following, examples for using the MAFFT and Muscle are shown.
 160
 161
 162 === MAFFT ===
 163
 164 The following example uses the MAFFT program to align four sequences
 165 and then prints the result to the screen.
 166 Please note that if the path to the MAFFT executable is properly set `mafft=Bio::MAFFT.new(options)` can be used instead of explicitly indicating the path as in the example.
 167
 168 {{{
 169 #!/usr/bin/env ruby
 170 require 'bio'
 171
 172 # 'seqs' is either an array of sequences or a multiple sequence
 173 # alignment. In general this is read in from a file as described in ?.
 174 # For the purpose of this tutorial, it is generated in code.
 175 seqs = ['MFQIPEFEPSEQEDSSSAER',
 176         'MGTPKQPSLAPAHALGLRKS',
 177         'PKQPSLAPAHALGLRKS',
 178         'MCSTSGCDLE']
 179
 180
 181 # Calculates the alignment using the MAFFT program on the local
 182 # machine with options '--maxiterate 1000 --localpair'
 183 # and stores the result in 'report'.
 184 options = ['--maxiterate', '1000', '--localpair']
 185 mafft = Bio::MAFFT.new('path/to/mafft', options)
 186 report = mafft.query_align(seqs)
 187
 188 # Accesses the actual alignment.
 189 align = report.alignment
 190
 191 # Prints each sequence to the console.
 192 align.each { |s| puts s.to_s }
 193
 194 }}}
 195
 196 References:
 197
 198  * Katoh, Toh (2008) "Recent developments in the MAFFT multiple sequence alignment program" Briefings in Bioinformatics 9:286-298
 199
 200  * Katoh, Toh 2010 (2010) "Parallelization of the MAFFT multiple sequence alignment program" Bioinformatics 26:1899-1900
 201
 202
 203
 204 === Muscle ===
 205
 206 {{{
 207 #!/usr/bin/env ruby
 208 require 'bio'
 209
 210 # 'seqs' is either an array of sequences or a multiple sequence
 211 # alignment. In general this is read in from a file as described in ?.
 212 # For the purpose of this tutorial, it is generated in code.
 213 seqs = ['MFQIPEFEPSEQEDSSSAER',
 214         'MGTPKQPSLAPAHALGLRKS',
 215         'PKQPSLAPAHALGLRKS',
 216         'MCSTSGCDLE']
 217
 218 # Calculates the alignment using the Muscle program on the local
 219 # machine with options '-quiet -maxiters 64'
 220 # and stores the result in 'report'.
 221 options = ['-quiet', '-maxiters', '64']
 222 muscle = Bio::Muscle.new('path/to/muscle', options)
 223 report = muscle.query_align(seqs)
 224
 225 # Accesses the actual alignment.
 226 align = report.alignment
 227
 228 # Prints each sequence to the console.
 229 align.each { |s| puts s.to_s }
 230
 231 }}}
 232
 233 References:
 234
 235  * Edgar, R.C. (2004) "MUSCLE: multiple sequence alignment with high accuracy and high throughput" Nucleic Acids Res 32(5):1792-1797
 236
 237 === Other Programs ===
 238
 239 _need more detail here..._
 240
 241 [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee] can be used in the same manner as the programs above.
 242
 243
 244
 245 == Manipulating Multiple Sequence Alignments ==
 246
 247 Oftentimes, multiple sequence to be used for phylogenetic inference are 'cleaned up' in some manner. For instance, some researchers prefer to delete columns with more than 50% gaps. The following code is an example of how to do that in !BioRuby.
 248
 249
 250 _... to be done_
 251
 252 {{{
 253 #!/usr/bin/env ruby
 254 require 'bio'
 255
 256 }}}
 257
 258
 259 ----
 260
 261 = Phylogenetic Trees =
 262
 263
 264 == Phylogenetic Tree Input and Output ==
 265
 266
 267 === Reading in of Phylogenetic Trees ===
 268
 269
 270
 271 ====Newick or New Hampshire Format====
 272
 273 _... to be done_
 274
 275 {{{
 276 #!/usr/bin/env ruby
 277 require 'bio'
 278
 279 }}}
 280
 281 ====phyloXML Format====
 282
 283 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 284
 285 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 286
 287 {{{
 288 % gem install -r libxml-ruby
 289 }}}
 290
 291 This example reads file "example.xml" and stores its [http://www.phyloxml.org/ phyloXML]-formatted trees in variable 'trees'.
 292
 293 {{{
 294 #!/usr/bin/env ruby
 295 require 'bio'
 296
 297 # This creates new phyloXML parser.
 298 trees = Bio::PhyloXML::Parser.new('example.xml')
 299
 300 # This prints the names of all trees in the file.
 301 trees.each do |tree|
 302   puts tree.name
 303 end
 304
 305 # If there are several trees in the file, you can access the one you wish via index.
 306 tree = trees[3]
 307
 308 }}}
 309
 310
 311 ====Nexus  Format====
 312
 313 _... to be done_
 314
 315 {{{
 316 #!/usr/bin/env ruby
 317 require 'bio'
 318
 319 }}}
 320
 321 === Writing of Phylogenetic Trees ===
 322
 323 ====Newick or New Hampshire Format====
 324
 325 _... to be done_
 326
 327 {{{
 328 #!/usr/bin/env ruby
 329 require 'bio'
 330
 331 }}}
 332
 333 ====phyloXML Format====
 334
 335 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 336
 337 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 338
 339 {{{
 340 % gem install -r libxml-ruby
 341 }}}
 342
 343 This example writes trees 'tree1' and 'tree2' to file "tree.xml" in [http://www.phyloxml.org/ phyloXML] format.
 344
 345 {{{
 346 #!/usr/bin/env ruby
 347 require 'bio'
 348
 349 # this creates new phyloXML writer.
 350 writer = Bio::PhyloXML::Writer.new('tree.xml')
 351
 352 # Writes tree to the file "tree.xml".
 353 writer.write(tree1)
 354
 355 # Adds another tree to the file.
 356 writer.write(tree2)
 357
 358
 359 }}}
 360
 361
 362 ====Nexus  Format====
 363
 364 _... to be done_
 365
 366 {{{
 367 #!/usr/bin/env ruby
 368 require 'bio'
 369
 370 }}}
 371
 372
 373 = Phylogenetic Inference =
 374
 375 _Currently !BioRuby does not contain wrappers for phylogenetic inference programs, thus I am progress of writing a RAxML wrapper followed by a wrapper for FastME..._
 376
 377 == Optimality Criteria Based on Character Data ==
 378
 379 Character data based methods work directly on molecular sequences and thus do not require the calculation of pairwise distances but tend to be time consuming and sensitive to errors in the multiple sequence alignment.
 380
 381 === Maximum Likelihood ===
 382
 383 ==== RAxML ====
 384
 385 _... to be done_
 386
 387 {{{
 388 #!/usr/bin/env ruby
 389 require 'bio'
 390
 391 }}}
 392
 393
 394 ==== PhyML ====
 395
 396 _... to be done_
 397
 398 {{{
 399 #!/usr/bin/env ruby
 400 require 'bio'
 401
 402 }}}
 403
 404 === Maximum Parsimony ===
 405
 406 Currently no direct support in !BioRuby.
 407
 408
 409 === Bayesian Inference ===
 410
 411 Currently no direct support in !BioRuby.
 412
 413
 414 == Pairwise Distance Based Methods ==
 415
 416 === Pairwise Sequence Distance Estimation ===
 417
 418 _... to be done_
 419
 420 {{{
 421 #!/usr/bin/env ruby
 422 require 'bio'
 423
 424 }}}
 425
 426
 427 === Optimality Criteria Based on Pairwise Distances ===
 428
 429
 430 ==== Minimal Evolution: FastME ====
 431
 432 _... to be done_
 433
 434 {{{
 435 #!/usr/bin/env ruby
 436 require 'bio'
 437
 438 }}}
 439
 440 === Algorithmic Methods Based on Pairwise Distances ===
 441
 442 ==== Neighbor Joining and Related Methods ====
 443
 444 _... to be done_
 445
 446 {{{
 447 #!/usr/bin/env ruby
 448 require 'bio'
 449
 450 }}}
 451
 452
 453
 454
 455
 456
 457
 458 == Support Calculation? ==
 459
 460 === Bootstrap Resampling? ===
 461
 462
 463 ----
 464
 465 = Analyzing Phylogenetic Trees =
 466
 467 == PAML ==
 468
 469
 470 == Gene Duplication Inference ==
 471
 472 _need to further test and then import GSoC 'SDI' work..._
 473
 474
 475 == Others? ==
 476
 477
 478 ----
 479
 480 = Putting It All Together =
 481
 482 Example of a small "pipeline"-type program running a mininal phyogenetic analysis: starting with a set of sequences and ending with a phylogenetic tree.
 483