wiki/PhyloBioRuby.wiki

   1 #summary Tutorial for multiple sequence alignments and phylogenetic methods in BioRuby -- under development!
   2
   3
   4
   5 = Introduction =
   6
   7 Under development!
   8
   9 Tutorial for multiple sequence alignments and phylogenetic methods in [http://bioruby.open-bio.org/ BioRuby].
  10
  11 Eventually, this is expected to be placed on the official !BioRuby page.
  12
  13 Author: [http://www.cmzmasek.net/ Christian M Zmasek], Sanford-Burnham Medical Research Institute
  14
  15
  16 Copyright (C) 2011 Christian M Zmasek. All rights reserved.
  17
  18
  19 = Multiple Sequence Alignment =
  20
  21
  22 == Multiple Sequence Alignment Input and Output ==
  23
  24 === Reading in a Multiple Sequence Alignment from a File ===
  25
  26 The following example shows how to read in a *ClustalW*-formatted multiple sequence alignment.
  27
  28 {{{
  29 #!/usr/bin/env ruby
  30 require 'bio'
  31
  32 # Reads in a ClustalW-formatted multiple sequence alignment
  33 # from a file named "infile_clustalw.aln" and stores it in 'report'.
  34 report = Bio::ClustalW::Report.new(File.read('infile_clustalw.aln'))
  35
  36 # Accesses the actual alignment.
  37 msa = report.alignment
  38
  39 # Goes through all sequences in 'msa' and prints the
  40 # actual molecular sequence.
  41 msa.each do |entry|
  42   puts entry.seq
  43 end
  44 }}}
  45
  46
  47
  48 === Writing a Multiple Sequence Alignment to a File ===
  49
  50
  51 The following example shows how to write a multiple sequence alignment in *FASTA*-format. It first creates a file named "outfile.fasta" for writing ('w') and then writes the multiple sequence alignment referred to by variable 'msa' to it in FASTA-format (':fasta').
  52
  53 {{{
  54 #!/usr/bin/env ruby
  55 require 'bio'
  56
  57 # Creates a new file named "outfile.fasta" and writes
  58 # multiple sequence alignment 'msa' to it in fasta format.
  59 File.open('outfile.fasta', 'w') do |f|
  60   f.write(msa.output(:fasta))
  61 end
  62 }}}
  63
  64 ==== Setting the Output Format ====
  65
  66 The following symbols determine the output format:
  67
  68   * `:clustal` for ClustalW
  69   * `:fasta` for FASTA
  70   * `:phylip` for PHYLIP interleaved (will truncate sequence names to no more than 10 characters)
  71   * `:phylipnon` for PHYLIP non-interleaved (will truncate sequence names to no more than 10 characters)
  72   * `:msf` for MSF
  73   * `:molphy` for Molphy
  74
  75
  76 For example, the following writes in PHYLIP's non-interleaved format:
  77
  78 {{{
  79 f.write(align.output(:phylipnon))
  80 }}}
  81
  82
  83 === Formatting of Individual Sequences ===
  84
  85 !BioRuby can format molecular sequences in a variety of formats.
  86 Individual sequences can be formatted to (e.g.) Genbank format as shown in the following examples.
  87
  88 For Sequence objects:
  89 {{{
  90 seq.to_seq.output(:genbank)
  91 }}}
  92
  93 For Bio::!FlatFile entries:
  94 {{{
  95 entry.to_biosequence.output(:genbank)
  96 }}}
  97
  98 The following symbols determine the output format:
  99   * `:genbank` for Genbank
 100   * `:embl` for EMBL
 101   * `:fasta` for FASTA
 102   * `:fasta_ncbi` for NCBI-type FASTA
 103   * `:raw` for raw sequence
 104   * `:fastq` for FASTQ (includes quality scores)
 105   * `:fastq_sanger` for Sanger-type FASTQ
 106   * `:fastq_solexa` for Solexa-type FASTQ
 107   * `:fastq_illumina` for Illumina-type FASTQ
 108
 109 == Calculating Multiple Sequence Alignments ==
 110
 111 !BioRuby can be used to execute a variety of multiple sequence alignment
 112 programs (such as [http://mafft.cbrc.jp/alignment/software/ MAFFT], [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], [http://www.drive5.com/muscle/ Muscle], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee]).
 113 In the following, examples for using the MAFFT and Muscle are shown.
 114
 115
 116 === MAFFT ===
 117
 118 The following example uses the MAFFT program to align four sequences
 119 and then prints the result to the screen.
 120 Please note that if the path to the MAFFT executable is properly set `mafft=Bio::MAFFT.new(options)` can be used instead of explicitly indicating the path as in the example.
 121
 122 {{{
 123 #!/usr/bin/env ruby
 124 require 'bio'
 125
 126 # 'seqs' is either an array of sequences or a multiple sequence
 127 # alignment. In general this is read in from a file as described in ?.
 128 # For the purpose of this tutorial, it is generated in code.
 129 seqs = ['MFQIPEFEPSEQEDSSSAER',
 130         'MGTPKQPSLAPAHALGLRKS',
 131         'PKQPSLAPAHALGLRKS',
 132         'MCSTSGCDLE']
 133
 134
 135 # Calculates the alignment using the MAFFT program on the local
 136 # machine with options '--maxiterate 1000 --localpair'
 137 # and stores the result in 'report'.
 138 options = ['--maxiterate', '1000', '--localpair']
 139 mafft = Bio::MAFFT.new('path/to/mafft', options)
 140 report = mafft.query_align(seqs)
 141
 142 # Accesses the actual alignment.
 143 align = report.alignment
 144
 145 # Prints each sequence to the console.
 146 align.each { |s| puts s.to_s }
 147
 148 }}}
 149
 150 References:
 151
 152  * Katoh, Toh (2008) "Recent developments in the MAFFT multiple sequence alignment program" Briefings in Bioinformatics 9:286-298
 153
 154  * Katoh, Toh 2010 (2010) "Parallelization of the MAFFT multiple sequence alignment program" Bioinformatics 26:1899-1900
 155
 156
 157
 158 === Muscle ===
 159
 160 {{{
 161 #!/usr/bin/env ruby
 162 require 'bio'
 163
 164 # 'seqs' is either an array of sequences or a multiple sequence
 165 # alignment. In general this is read in from a file as described in ?.
 166 # For the purpose of this tutorial, it is generated in code.
 167 seqs = ['MFQIPEFEPSEQEDSSSAER',
 168         'MGTPKQPSLAPAHALGLRKS',
 169         'PKQPSLAPAHALGLRKS',
 170         'MCSTSGCDLE']
 171
 172 # Calculates the alignment using the Muscle program on the local
 173 # machine with options '-quiet -maxiters 64'
 174 # and stores the result in 'report'.
 175 options = ['-quiet', '-maxiters', '64']
 176 muscle = Bio::Muscle.new('path/to/muscle', options)
 177 report = muscle.query_align(seqs)
 178
 179 # Accesses the actual alignment.
 180 align = report.alignment
 181
 182 # Prints each sequence to the console.
 183 align.each { |s| puts s.to_s }
 184
 185 }}}
 186
 187 References:
 188
 189  * Edgar, R.C. (2004) "MUSCLE: multiple sequence alignment with high accuracy and high throughput" Nucleic Acids Res 32(5):1792-1797
 190
 191 === Other Programs ===
 192
 193 _need more detail here..._
 194
 195 [http://probcons.stanford.edu/ Probcons], [http://www.clustal.org/ ClustalW], and [http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html T-Coffee] can be used in the same manner as the programs above.
 196
 197
 198
 199 == Manipulating Multiple Sequence Alignments ==
 200
 201 Oftentimes, multiple sequence to be used for phylogenetic inference are 'cleaned up' in some manner. For instance, some researchers prefer to delete columns with more than 50% gaps. The following code is an example of how to do that in !BioRuby.
 202
 203
 204 _... to be done_
 205
 206 {{{
 207 #!/usr/bin/env ruby
 208 require 'bio'
 209
 210 }}}
 211
 212
 213 ----
 214
 215 = Phylogenetic Trees =
 216
 217
 218 == Phylogenetic Tree Input and Output ==
 219
 220
 221 === Reading in of Phylogenetic Trees ===
 222
 223
 224
 225 ====Newick or New Hampshire Format====
 226
 227 _... to be done_
 228
 229 {{{
 230 #!/usr/bin/env ruby
 231 require 'bio'
 232
 233 }}}
 234
 235 ====phyloXML Format====
 236
 237 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 238
 239 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 240
 241 {{{
 242 % gem install -r libxml-ruby
 243 }}}
 244
 245 This example reads file "example.xml" and stores its [http://www.phyloxml.org/ phyloXML]-formatted trees in variable 'trees'.
 246
 247 {{{
 248 #!/usr/bin/env ruby
 249 require 'bio'
 250
 251 # This creates new phyloXML parser.
 252 trees = Bio::PhyloXML::Parser.new('example.xml')
 253
 254 # This prints the names of all trees in the file.
 255 trees.each do |tree|
 256   puts tree.name
 257 end
 258
 259 # If there are several trees in the file, you can access the one you wish via index.
 260 tree = trees[3]
 261
 262 }}}
 263
 264
 265 ====Nexus  Format====
 266
 267 _... to be done_
 268
 269 {{{
 270 #!/usr/bin/env ruby
 271 require 'bio'
 272
 273 }}}
 274
 275 === Writing of Phylogenetic Trees ===
 276
 277 ====Newick or New Hampshire Format====
 278
 279 _... to be done_
 280
 281 {{{
 282 #!/usr/bin/env ruby
 283 require 'bio'
 284
 285 }}}
 286
 287 ====phyloXML Format====
 288
 289 Partially copied from [https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Diana Jaunzeikare's documentation].
 290
 291 In addition to !BioRuby, a libxml Ruby binding is also required. This can be installed with the following command:
 292
 293 {{{
 294 % gem install -r libxml-ruby
 295 }}}
 296
 297 This example writes trees 'tree1' and 'tree2' to file "tree.xml" in [http://www.phyloxml.org/ phyloXML] format.
 298
 299 {{{
 300 #!/usr/bin/env ruby
 301 require 'bio'
 302
 303 # this creates new phyloXML writer.
 304 writer = Bio::PhyloXML::Writer.new('tree.xml')
 305
 306 # Writes tree to the file "tree.xml".
 307 writer.write(tree1)
 308
 309 # Adds another tree to the file.
 310 writer.write(tree2)
 311
 312
 313 }}}
 314
 315
 316 ====Nexus  Format====
 317
 318 _... to be done_
 319
 320 {{{
 321 #!/usr/bin/env ruby
 322 require 'bio'
 323
 324 }}}
 325
 326
 327 = Phylogenetic Inference =
 328
 329 _Currently !BioRuby does not contain wrappers for phylogenetic inference programs, thus I am progress of writing a RAxML wrapper followed by a wrapper for FastME..._
 330
 331 == Optimality Criteria Based on Character Data ==
 332
 333 Character data based methods work directly on molecular sequences and thus do not require the calculation of pairwise distances but tend to be time consuming and sensitive to errors in the multiple sequence alignment.
 334
 335 === Maximum Likelihood ===
 336
 337 ==== RAxML ====
 338
 339 _... to be done_
 340
 341 {{{
 342 #!/usr/bin/env ruby
 343 require 'bio'
 344
 345 }}}
 346
 347
 348 ==== PhyML ====
 349
 350 _... to be done_
 351
 352 {{{
 353 #!/usr/bin/env ruby
 354 require 'bio'
 355
 356 }}}
 357
 358 === Maximum Parsimony ===
 359
 360 Currently no direct support in !BioRuby.
 361
 362
 363 === Bayesian Inference ===
 364
 365 Currently no direct support in !BioRuby.
 366
 367
 368 == Pairwise Distance Based Methods ==
 369
 370 === Pairwise Sequence Distance Estimation ===
 371
 372 _... to be done_
 373
 374 {{{
 375 #!/usr/bin/env ruby
 376 require 'bio'
 377
 378 }}}
 379
 380
 381 === Optimality Criteria Based on Pairwise Distances ===
 382
 383
 384 ==== Minimal Evolution: FastME ====
 385
 386 _... to be done_
 387
 388 {{{
 389 #!/usr/bin/env ruby
 390 require 'bio'
 391
 392 }}}
 393
 394 === Algorithmic Methods Based on Pairwise Distances ===
 395
 396 ==== Neighbor Joining and Related Methods ====
 397
 398 _... to be done_
 399
 400 {{{
 401 #!/usr/bin/env ruby
 402 require 'bio'
 403
 404 }}}
 405
 406
 407
 408
 409
 410
 411
 412 == Support Calculation? ==
 413
 414 === Bootstrap Resampling? ===
 415
 416
 417 ----
 418
 419 = Analyzing Phylogenetic Trees =
 420
 421 == PAML ==
 422
 423
 424 == Gene Duplication Inference ==
 425
 426 _need to further test and then import GSoC 'SDI' work..._
 427
 428
 429 == Others? ==
 430
 431
 432 ----
 433
 434 = Putting It All Together =
 435
 436 Example of a small "pipeline"-type program running a mininal phyogenetic analysis: starting with a set of sequences and ending with a phylogenetic tree.
 437