1 # Copyright 2009 by Peter Cock. All rights reserved.
2 # This code is part of the Biopython distribution and governed by its
3 # license. Please see the LICENSE file that should have been included
4 # as part of this package.
6 # This module is for reading and writing FASTQ and QUAL format files as
7 # SeqRecord objects, and is expected to be used via the Bio.SeqIO API.
9 """Bio.SeqIO support for the "fastq" and "qual" file formats.
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a quality value between 0 and 90 to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
30 In the QUAL format these quality values are held as space separated text in
31 a FASTA like file format. In the FASTQ format, each quality values is encoded
32 with a single ASCI character using chr(Q+33), meaning zero maps to the
33 character "!" and for example 80 maps to "q". The sequences and quality are
34 then stored in pairs in a FASTA like format.
36 Unfortunately there is no official document describing the FASTQ file format,
37 and worse, several related but different variants exist. Reasonable
38 documentation exists at: http://maq.sourceforge.net/fastq.shtml
40 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
41 be negative or easily exceed 90. PHRED scores and Solexa scores are NOT
42 interchangeable (but a reasonable mapping can be achieved between them).
43 Confusingly Solexa produces a FASTQ like file but using their own score
46 Also note that Roche 454 sequencers can output files in the QUAL format, and
47 thankfully they use PHREP style scores like Sanger. To extract QUAL files from
48 a Roche 454 SFF binary file, use the Roche off instrument command line tool
49 "sffinfo" with the -q or -qual argument. You can extract a matching FASTA file
50 using the -s or -seq argument instead.
52 You are expected to use this module via the Bio.SeqIO functions, with the
53 following format names:
54 - "fastq" means Sanger style FASTQ files using PHRED scores.
55 - "fastq-solexa" means Solexa/Illumina style FASTQ files.
56 - "qual" means simple quality files using PHRED scores.
58 For example, consider the following short FASTQ file (extracted from a real
61 @EAS54_6_R1_2_1_413_324
62 CCCTTCTTGTCTTCAGCGTTTCTCC
64 ;;3;;;;;;;;;;;;7;;;;;;;88
65 @EAS54_6_R1_2_1_540_792
66 TTGGCAGGCCAAGGCCGATGGATCA
68 ;;;;;;;;;;;7;;;;;-;;;3;83
69 @EAS54_6_R1_2_1_443_348
70 GTTGCTTCTGGCGTGGGTGGGGGGG
72 ;;;;;;;;;;;9;7;;.7;393333
74 This contains three reads of length 25. From the read length these were
75 probably originally from an early Solexa/Illumina sequencer but NCBI have
76 followed the Sanger FASTQ convention and this actually uses PHRED style
77 qualities. This means we can parse this file using Bio.SeqIO using "fastq"
80 >>> from Bio import SeqIO
81 >>> for record in SeqIO.parse(open("Quality/example.fastq"), "fastq") :
82 ... print record.id, record.seq
83 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
84 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
85 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
87 The qualities are held as a list of integers in each record's annotation:
90 ID: EAS54_6_R1_2_1_443_348
91 Name: EAS54_6_R1_2_1_443_348
92 Description: EAS54_6_R1_2_1_443_348
94 Per letter annotation for: phred_quality
95 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
96 >>> print record.letter_annotations["phred_quality"]
97 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
99 You can use the SeqRecord format method you can show this in the QUAL format:
101 >>> print record.format("qual")
102 >EAS54_6_R1_2_1_443_348
103 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
107 Or go back to the FASTQ format,
109 >>> print record.format("fastq")
110 @EAS54_6_R1_2_1_443_348
111 GTTGCTTCTGGCGTGGGTGGGGGGG
113 ;;;;;;;;;;;9;7;;.7;393333
116 You can also get Biopython to convert the scores and show a Solexa style
119 >>> print record.format("fastq-solexa")
120 @EAS54_6_R1_2_1_443_348
121 GTTGCTTCTGGCGTGGGTGGGGGGG
123 ZZZZZZZZZZZXZVZZMVZRXRRRR
126 If you wanted to trim your sequences (perhaps to remove low quality regions,
127 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
129 >>> sub_rec = record[5:15]
131 ID: EAS54_6_R1_2_1_443_348
132 Name: EAS54_6_R1_2_1_443_348
133 Description: EAS54_6_R1_2_1_443_348
134 Number of features: 0
135 Per letter annotation for: phred_quality
136 Seq('TTCTGGCGTG', SingleLetterAlphabet())
137 >>> print sub_rec.letter_annotations["phred_quality"]
138 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
139 >>> print sub_rec.format("fastq")
140 @EAS54_6_R1_2_1_443_348
146 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
148 >>> from Bio import SeqIO
149 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
150 >>> out_handle = open("Quality/temp.qual", "w")
151 >>> SeqIO.write(record_iterator, out_handle, "qual")
153 >>> out_handle.close()
155 You can of course read in a QUAL file, such as the one we just created:
157 >>> from Bio import SeqIO
158 >>> for record in SeqIO.parse(open("Quality/temp.qual"), "qual") :
159 ... print record.id, record.seq
160 EAS54_6_R1_2_1_413_324 ?????????????????????????
161 EAS54_6_R1_2_1_540_792 ?????????????????????????
162 EAS54_6_R1_2_1_443_348 ?????????????????????????
164 Notice that QUAL files don't have a proper sequence present! But the quality
165 information is there:
168 ID: EAS54_6_R1_2_1_443_348
169 Name: EAS54_6_R1_2_1_443_348
170 Description: EAS54_6_R1_2_1_443_348
171 Number of features: 0
172 Per letter annotation for: phred_quality
173 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
174 >>> print record.letter_annotations["phred_quality"]
175 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
177 Just to keep things tidy, if you are following this example yourself, you can
178 delete this temporary file now:
181 >>> os.remove("Quality/temp.qual")
183 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
184 files. Because the Bio.SeqIO system is designed for reading single files, you
185 would have to read the two in separately and then combine the data. However,
186 since this is such a common thing to want to do, there is a helper iterator
187 defined in this module that does this for you - PairedFastaQualIterator.
189 Alternatively, if you have enough RAM to hold all the records in memory at once,
190 then a simple dictionary approach would work:
192 >>> from Bio import SeqIO
193 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
194 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual") :
195 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
197 You can then access any record by its key, and get both the sequence and the
200 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
201 @EAS54_6_R1_2_1_540_792
202 TTGGCAGGCCAAGGCCGATGGATCA
204 ;;;;;;;;;;;7;;;;;-;;;3;83
207 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
208 using ("fastq" for the Sanger standard using PHRED values, or "fastq-solexa"
209 for the Solexa/Illumina variant), as this cannot be detected reliably
212 __docformat__ = "epytext en" #Don't just use plain text in epydoc API pages!
214 #See also http://blog.malde.org/index.php/2008/09/09/the-fastq-file-format-for-sequences/
216 from Bio.Alphabet import single_letter_alphabet
217 from Bio.Seq import Seq, UnknownSeq
218 from Bio.SeqRecord import SeqRecord
219 from Interfaces import SequentialSequenceWriter
222 # define score offsets. See discussion for differences between Sanger and
224 SANGER_SCORE_OFFSET = 33
225 SOLEXA_SCORE_OFFSET = 64
227 def solexa_quality_from_phred(phred_quality) :
228 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
230 This will return a floating point number, it is up to you to round this to
231 the nearest integer if appropriate. e.g.
233 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
235 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
237 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
239 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
241 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
244 return 10*log(10**(phred_quality/10.0) - 1, 10)
246 def phred_quality_from_solexa(solexa_quality) :
247 """Convert a Solexa quality (which can be negative) to a PHRED quality.
249 This will return a floating point number, it is up to you to round this to
250 the nearest integer if appropriate. e.g.
252 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
254 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
256 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
258 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
260 >>> print "%0.2f" % round(phred_quality_from_solexa(-10),2)
263 return 10*log(10**(solexa_quality/10.0) + 1, 10)
265 def _get_phred_quality(record) :
266 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
268 If there are no PHRED qualities, but there are Solexa qualities, those are
269 used instead after conversion.
272 return record.letter_annotations["phred_quality"]
276 return [phred_quality_from_solexa(q) for \
277 q in record.letter_annotations["solexa_quality"]]
279 raise ValueError("No suitable quality scores found in letter_annotations "
280 "of SeqRecord (id=%s)." % record.id)
282 def _get_solexa_quality(record) :
283 """Extract Solexa qualities from a SeqRecord's letter_annotations (PRIVATE).
285 If there are no Solexa qualities, but there are PHRED qualities, those are
286 used instead after conversion.
289 return record.letter_annotations["solexa_quality"]
293 return [solexa_quality_from_phred(q) for \
294 q in record.letter_annotations["phred_quality"]]
296 raise ValueError("No suitable quality scores found in letter_annotation "
297 "of SeqRecord (id=%s)." % record.id)
300 #TODO - Default to nucleotide or even DNA?
301 def FastqGeneralIterator(handle) :
302 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
304 This code does not try to interpret the quality string numerically. It
305 just returns tuples of the title, sequence and quality as strings. For
306 the sequence and quality, any whitespace (such as new lines) is removed.
308 Our SeqRecord based FASTQ iterators call this function internally, and then
309 turn the strings into a SeqRecord objects, mapping the quality string into
310 a list of numerical scores. If you want to do a custom quality mapping,
311 then you might consider calling this function directly.
313 For parsing FASTQ files, the title string from the "@" line at the start
314 of each record can optionally be omitted on the "+" lines. If it is
315 repeated, it must be identical.
317 The sequence string and the quality string can optionally be split over
318 multiple lines, although several sources discourage this. In comparison,
319 for the FASTA file format line breaks between 60 and 80 characters are
322 WARNING - Because the "@" character can appear in the quality string,
323 this can cause problems as this is also the marker for the start of
324 a new sequence. In fact, the "+" sign can also appear as well. Some
325 sources recommended having no line breaks in the quality to avoid this,
326 but even that is not enough, consider this example::
328 @071113_EAS56_0053:1:1:998:236
329 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
330 +071113_EAS56_0053:1:1:998:236
331 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
332 @071113_EAS56_0053:1:1:182:712
333 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
335 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
336 @071113_EAS56_0053:1:1:153:10
337 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
339 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
340 @071113_EAS56_0053:1:3:990:501
347 This is four PHRED encoded FASTQ entries originally from an NCBI source
348 (given the read length of 36, these are probably Solexa Illumna reads where
349 the quality has been mapped onto the PHRED values).
351 This example has been edited to illustrate some of the nasty things allowed
352 in the FASTQ format. Firstly, on the "+" lines most but not all of the
353 (redundant) identifiers are ommited. In real files it is likely that all or
354 none of these extra identifiers will be present.
356 Secondly, while the first three sequences have been shown without line
357 breaks, the last has been split over multiple lines. In real files any line
358 breaks are likely to be consistent.
360 Thirdly, some of the quality string lines start with an "@" character. For
361 the second record this is unavoidable. However for the fourth sequence this
362 only happens because its quality string is split over two lines. A naive
363 parser could wrongly treat any line starting with an "@" as the beginning of
364 a new sequence! This code copes with this possible ambiguity by keeping track
365 of the length of the sequence which gives the expected length of the quality
368 Using this tricky example file as input, this short bit of code demonstrates
369 what this parsing function would return:
371 >>> handle = open("Quality/tricky.fastq", "rU")
372 >>> for (title, sequence, quality) in FastqGeneralIterator(handle) :
374 ... print sequence, quality
375 071113_EAS56_0053:1:1:998:236
376 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
377 071113_EAS56_0053:1:1:182:712
378 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
379 071113_EAS56_0053:1:1:153:10
380 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
381 071113_EAS56_0053:1:3:990:501
382 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
385 Finally we note that some sources state that the quality string should
386 start with "!" (which using the PHRED mapping means the first letter always
387 has a quality score of zero). This rather restrictive rule is not widely
388 observed, so is therefore ignored here. One plus point about this "!" rule
389 is that (provided there are no line breaks in the quality sequence) it
390 would prevent the above problem with the "@" character.
392 #Skip any text before the first record (e.g. blank lines, comments?)
394 line = handle.readline()
395 if line == "" : return #Premature end of file, or just empty?
401 raise ValueError("Records in Fastq files should start with '@' character")
402 title_line = line[1:].rstrip()
405 line = handle.readline()
408 raise ValueError("End of file without quality information.")
410 #The title here is optional, but if present must match!
411 if line[1:].rstrip() and line[1:].rstrip() != title_line :
412 raise ValueError("Sequence and quality captions differ.")
414 seq_lines.extend(line.split()) #removes any whitespace
415 line = handle.readline()
417 seq_string = "".join(seq_lines)
421 line = handle.readline()
425 #This COULD be the start of a new sequence. However, it MAY just
426 #be a line of quality data which starts with a "@" character. We
427 #should be able to check this by looking at the sequence length
428 #and the amount of quality data found so far.
429 if len("".join(quality_lines)) >= len(seq_string) :
430 #We expect it to be equal if this is the start of a new record.
431 #If the quality data is longer, we'll raise an error below.
433 #Continue - its just some (more) sequence data.
435 quality_lines.extend(line.split()) #removes any whitespace
436 line = handle.readline()
438 quality_string = "".join(quality_lines)
441 if len(seq_string) != len(quality_string) :
442 raise ValueError("Lengths of sequence and quality values differs "
443 " for %s (%i and %i)." \
444 % (title_line, len(seq_string), len(quality_string)))
446 #Return the record and then continue...
447 yield (title_line, seq_string, quality_string)
448 if not line : return #StopIteration at end of file
449 assert False, "Should not reach this line"
451 #This is a generator function!
452 def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None) :
453 """Generator function to iterate over FASTQ records (as SeqRecord objects).
455 - handle - input file
456 - alphabet - optional alphabet
457 - title2ids - A function that, when given the title line from the FASTQ
458 file (without the beginning >), will return the id, name and
459 description (in that order) for the record as a tuple of
460 strings. If this is not given, then the entire title line
461 will be used as the description, and the first word as the
464 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
466 For each sequence in a (Sanger style) FASTQ file there is a matching string
467 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
468 values with an offset of 33.
470 For example, consider a file containing three short reads::
472 @EAS54_6_R1_2_1_413_324
473 CCCTTCTTGTCTTCAGCGTTTCTCC
475 ;;3;;;;;;;;;;;;7;;;;;;;88
476 @EAS54_6_R1_2_1_540_792
477 TTGGCAGGCCAAGGCCGATGGATCA
479 ;;;;;;;;;;;7;;;;;-;;;3;83
480 @EAS54_6_R1_2_1_443_348
481 GTTGCTTCTGGCGTGGGTGGGGGGG
483 ;;;;;;;;;;;9;7;;.7;393333
485 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
486 string encoding the PHRED qualities using a ASCI values with an offset of
487 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
489 Using this module directly you might run:
491 >>> handle = open("Quality/example.fastq", "rU")
492 >>> for record in FastqPhredIterator(handle) :
493 ... print record.id, record.seq
494 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
495 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
496 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
499 Typically however, you would call this via Bio.SeqIO instead with "fastq" as
502 >>> from Bio import SeqIO
503 >>> handle = open("Quality/example.fastq", "rU")
504 >>> for record in SeqIO.parse(handle, "fastq") :
505 ... print record.id, record.seq
506 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
507 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
508 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
511 If you want to look at the qualities, they are record in each record's
512 per-letter-annotation dictionary as a simple list of integers:
514 >>> print record.letter_annotations["phred_quality"]
515 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
517 for title_line, seq_string, quality_string in FastqGeneralIterator(handle) :
519 id, name, descr = title2ids(title_line)
522 id = descr.split()[0]
524 record = SeqRecord(Seq(seq_string, alphabet),
525 id=id, name=name, description=descr)
527 assert SANGER_SCORE_OFFSET == ord("!")
528 #According to BioPerl documentation at least, the first character should
529 #be an "!" (and therefore quality zero). This seems crazy - what if the
530 #sequence has been trimmed to remove any poor quality sequence? In any
531 #case real examples from the NCBI don't follow this practice, so we
532 #won't enforce it here.
533 #e.g. ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271/fastq/200x36x36-071113_EAS56_0053-s_1_1.fastq.gz
535 #if quality_string[0] != "!" :
536 # raise ValueError("The quality string should always start with a ! character.")
537 qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string]
539 if min(qualities) < 0 or max(qualities) > 90 :
540 raise ValueError("Quality score outside 0 to 90 found - these are perhaps "
541 "in a Solexa/Illumina format, not the Sanger FASTQ format "
542 "which uses PHRED scores.")
543 record.letter_annotations["phred_quality"] = qualities
546 #This is a generator function!
547 def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None) :
548 """Parsing the Solexa/Illumina FASTQ like files (which differ in the quality mapping).
550 The optional arguments are the same as those for the FastqPhredIterator.
552 For each sequence in Solexa/Illumina FASTQ files there is a matching string
553 encoding the Solexa integer qualities using ASCII values with an offset
554 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
555 will NOT perform any automatic conversion when loading.
557 For example, consider a file containing these five records::
559 @SLXA-B3_649_FC8437_R1_1_1_610_79
560 GATGTGCAATACCTTTGTAGAGGAA
561 +SLXA-B3_649_FC8437_R1_1_1_610_79
562 YYYYYYYYYYYYYYYYYYWYWYYSU
563 @SLXA-B3_649_FC8437_R1_1_1_397_389
564 GGTTTGAGAAAGAGAAATGAGATAA
565 +SLXA-B3_649_FC8437_R1_1_1_397_389
566 YYYYYYYYYWYYYYWWYYYWYWYWW
567 @SLXA-B3_649_FC8437_R1_1_1_850_123
568 GAGGGTGTTGATCATGATGATGGCG
569 +SLXA-B3_649_FC8437_R1_1_1_850_123
570 YYYYYYYYYYYYYWYYWYYSYYYSY
571 @SLXA-B3_649_FC8437_R1_1_1_362_549
572 GGAAACAAAGTTTTTCTCAACATAG
573 +SLXA-B3_649_FC8437_R1_1_1_362_549
574 YYYYYYYYYYYYYYYYYYWWWWYWY
575 @SLXA-B3_649_FC8437_R1_1_1_183_714
576 GTATTATTTAATGGCATACACTCAA
577 +SLXA-B3_649_FC8437_R1_1_1_183_714
578 YYYYYYYYYYWYYYYWYWWUWWWQQ
580 Using this module directly you might run:
582 >>> handle = open("Quality/solexa_example.fastq", "rU")
583 >>> for record in FastqSolexaIterator(handle) :
584 ... print record.id, record.seq
585 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
586 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
587 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
588 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
589 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
592 Typically however, you would call this via Bio.SeqIO instead with "fastq" as
595 >>> from Bio import SeqIO
596 >>> handle = open("Quality/solexa_example.fastq", "rU")
597 >>> for record in SeqIO.parse(handle, "fastq-solexa") :
598 ... print record.id, record.seq
599 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
600 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
601 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
602 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
603 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
606 If you want to look at the qualities, they are recorded in each record's
607 per-letter-annotation dictionary as a simple list of integers:
609 >>> print record.letter_annotations["solexa_quality"]
610 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
612 These scores aren't very good, but they are high enough that they map
613 almost exactly onto PHRED scores:
615 >>> print "%0.2f" % phred_quality_from_solexa(25)
618 Let's look at another example read which is even worse, where there are
619 more noticeable differences between the Solexa and PHRED scores::
622 ACAAAAATCACAAGCATTCTTATACACC
624 ??????????????????:??<?<-6%.
626 Again, you would typically use Bio.SeqIO to read this file in (rather than
627 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
628 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
629 as shown above. This example has only as one entry, so instead we can
630 use the Bio.SeqIO.read() function:
632 >>> from Bio import SeqIO
633 >>> handle = open("Quality/solexa.fastq", "rU")
634 >>> record = SeqIO.read(handle, "fastq-solexa")
636 >>> print record.id, record.seq
637 slxa_0013_1_0001_24 ACAAAAATCACAAGCATTCTTATACACC
638 >>> print record.letter_annotations["solexa_quality"]
639 [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -6, -1, -1, -4, -1, -4, -19, -10, -27, -18]
641 These quality scores are so low that when converted from the Solexa scheme
642 into PHRED scores they look quite different:
644 >>> print "%0.2f" % phred_quality_from_solexa(-1)
647 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
648 method to output the record(s):
650 >>> print record.format("fastq-solexa")
652 ACAAAAATCACAAGCATTCTTATACACC
654 ??????????????????:??<?<-6%.
657 Note this output is slightly different from the input file as Biopython
658 has left out the optional repetition of the sequence identifier on the "+"
659 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
660 output format instead, and Biopython will do the conversion for you:
662 >>> print record.format("fastq")
664 ACAAAAATCACAAGCATTCTTATACACC
666 $$$$$$$$$$$$$$$$$$"$$"$"!!!!
669 >>> print record.format("qual")
671 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 0 0 0 0
674 for title_line, seq_string, quality_string in FastqGeneralIterator(handle) :
676 id, name, descr = title_line
679 id = descr.split()[0]
681 record = SeqRecord(Seq(seq_string, alphabet),
682 id=id, name=name, description=descr)
683 qualities = [ord(letter)-SOLEXA_SCORE_OFFSET for letter in quality_string]
684 #DO NOT convert these into PHRED qualities automatically!
685 record.letter_annotations["solexa_quality"] = qualities
688 def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None) :
689 """For QUAL files which include PHRED quality scores, but no sequence.
691 For example, consider this short QUAL file::
693 >EAS54_6_R1_2_1_413_324
694 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
696 >EAS54_6_R1_2_1_540_792
697 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
699 >EAS54_6_R1_2_1_443_348
700 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
703 Using this module directly you might run:
705 >>> handle = open("Quality/example.qual", "rU")
706 >>> for record in QualPhredIterator(handle) :
707 ... print record.id, record.seq
708 EAS54_6_R1_2_1_413_324 ?????????????????????????
709 EAS54_6_R1_2_1_540_792 ?????????????????????????
710 EAS54_6_R1_2_1_443_348 ?????????????????????????
713 Typically however, you would call this via Bio.SeqIO instead with "qual"
716 >>> from Bio import SeqIO
717 >>> handle = open("Quality/example.qual", "rU")
718 >>> for record in SeqIO.parse(handle, "qual") :
719 ... print record.id, record.seq
720 EAS54_6_R1_2_1_413_324 ?????????????????????????
721 EAS54_6_R1_2_1_540_792 ?????????????????????????
722 EAS54_6_R1_2_1_443_348 ?????????????????????????
725 Becase QUAL files don't contain the sequence string itself, the seq
726 property is set to an UnknownSeq object. As no alphabet was given, this
727 has defaulted to a generic single letter alphabet and the character "?"
730 By specifying a nucleotide alphabet, "N" is used instead:
732 >>> from Bio import SeqIO
733 >>> from Bio.Alphabet import generic_dna
734 >>> handle = open("Quality/example.qual", "rU")
735 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna) :
736 ... print record.id, record.seq
737 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
738 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
739 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
742 However, the quality scores themselves are available as a list of integers
743 in each record's per-letter-annotation:
745 >>> print record.letter_annotations["phred_quality"]
746 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
748 You can still slice one of these SeqRecord objects with an UnknownSeq:
750 >>> sub_record = record[5:10]
751 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
752 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
754 #Skip any text before the first record (e.g. blank lines, comments)
756 line = handle.readline()
757 if line == "" : return #Premature end of file, or just empty?
763 raise ValueError("Records in Fasta files should start with '>' character")
765 id, name, descr = title2ids(line[1:].rstrip())
767 descr = line[1:].rstrip()
768 id = descr.split()[0]
772 line = handle.readline()
775 if line[0] == ">": break
776 qualities.extend([int(word) for word in line.split()])
777 line = handle.readline()
780 if min(qualities) < 0 or max(qualities) > 90 :
781 raise ValueError(("Quality score range for %s is %i to %i, outside the " \
782 +"expected 0 to 90. Perhaps these are Solexa/Illumina " \
783 +"scores, and not PHRED scores?") \
784 % (id, min(qualities), max(qualities)))
786 #Return the record and then continue...
787 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
788 id = id, name = name, description = descr)
789 record.letter_annotations["phred_quality"] = qualities
792 if not line : return #StopIteration
793 assert False, "Should not reach this line"
795 class FastqPhredWriter(SequentialSequenceWriter):
796 """Class to write FASTQ format files (using PHRED quality scores).
798 Although you can use this class directly, you are strongly encouraged
799 to use the Bio.SeqIO.write() function instead. For example, this code
800 reads in a FASTQ (PHRED) file and re-saves it as another FASTQ (PHRED)
803 >>> from Bio import SeqIO
804 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
805 >>> out_handle = open("Quality/temp.fastq", "w")
806 >>> SeqIO.write(record_iterator, out_handle, "fastq")
808 >>> out_handle.close()
810 You might want to do this if the original file included extra line breaks,
811 which while valid may not be supported by all tools. The output file from
812 Biopython will have each sequence on a single line, and each quality
813 string on a single line (which is considered desirable for maximum
816 In this next example, a Solexa FASTQ file is converted into a standard
817 Sanger style FASTQ file using PHRED qualities:
819 >>> from Bio import SeqIO
820 >>> record_iterator = SeqIO.parse(open("Quality/solexa.fastq"), "fastq-solexa")
821 >>> out_handle = open("Quality/temp.fastq", "w")
822 >>> SeqIO.write(record_iterator, out_handle, "fastq")
824 >>> out_handle.close()
826 This code is also called if you use the .format("fastq") method of a
829 P.S. To avoid cluttering up your working directory, you can delete this
833 >>> os.remove("Quality/temp.fastq")
836 def write_record(self, record):
837 """Write a single FASTQ record to the file."""
838 assert self._header_written
839 assert not self._footer_written
840 self._record_written = True
842 #TODO - Is an empty sequence allowed in FASTQ format?
843 assert SANGER_SCORE_OFFSET == ord("!")
844 #This rounds to the nearest integer:
845 qualities = "".join([chr(int(round(q+SANGER_SCORE_OFFSET,0))) for q \
846 in _get_phred_quality(record)])
847 if record.seq is None:
848 raise ValueError("No sequence for record %s" % record.id)
849 if len(qualities) != len(record) :
850 raise ValueError("Record %s has sequence length %i but %i quality scores" \
851 % (record.id, len(record), len(qualities)))
853 title = self.clean(record.id) #TODO - add the description too? cf Fasta output
854 self.handle.write("@%s\n%s\n+\n%s\n" % (title, record.seq, qualities))
856 class QualPhredWriter(SequentialSequenceWriter):
857 """Class to write QUAL format files (using PHRED quality scores).
859 Although you can use this class directly, you are strongly encouraged
860 to use the Bio.SeqIO.write() function instead. For example, this code
861 reads in a FASTQ file and saves the quality scores into a QUAL file:
863 >>> from Bio import SeqIO
864 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
865 >>> out_handle = open("Quality/temp.qual", "w")
866 >>> SeqIO.write(record_iterator, out_handle, "qual")
868 >>> out_handle.close()
870 This code is also called if you use the .format("qual") method of a
873 P.S. Don't forget to clean up the temp file if you don't need it anymore:
876 >>> os.remove("Quality/temp.qual")
878 def __init__(self, handle, wrap=60, record2title=None):
879 """Create a QUAL writer.
882 - handle - Handle to an output file, e.g. as returned
883 by open(filename, "w")
884 - wrap - Optional line length used to wrap sequence lines.
885 Defaults to wrapping the sequence at 60 characters
886 Use zero (or None) for no wrapping, giving a single
887 long line for the sequence.
888 - record2title - Optional function to return the text to be
889 used for the title line of each record. By default
890 a combination of the record.id and record.description
891 is used. If the record.description starts with the
892 record.id, then just the record.description is used.
894 The record2title argument is present for consistency with the
895 Bio.SeqIO.FastaIO writer class.
897 SequentialSequenceWriter.__init__(self, handle)
898 #self.handle = handle
904 self.record2title = record2title
906 def write_record(self, record):
907 """Write a single QUAL record to the file."""
908 assert self._header_written
909 assert not self._footer_written
910 self._record_written = True
912 if self.record2title :
913 title=self.clean(self.record2title(record))
915 id = self.clean(record.id)
916 description = self.clean(record.description)
918 #if description[:len(id)]==id :
919 if description and description.split(None,1)[0]==id :
920 #The description includes the id at the start
923 title = "%s %s" % (id, description)
925 assert "\n" not in title
926 assert "\r" not in title
927 self.handle.write(">%s\n" % title)
929 #This rounds to the nearest integer.
930 #TODO - can we put a float in a qual file?
931 qualities = [("%i" % round(q,0)) for q in _get_phred_quality(record)]
935 line=qualities.pop(0)
937 and len(line) + 1 + len(qualities[0]) < self.wrap :
938 line += " " + qualities.pop(0)
939 self.handle.write(line + "\n")
941 data = " ".join(qualities)
942 self.handle.write(data + "\n")
944 class FastqSolexaWriter(SequentialSequenceWriter):
945 """Class to write FASTQ format files (using Solexa quality scores).
947 Although you can use this class directly, you are strongly encouraged
948 to use the Bio.SeqIO.write() function instead. For example, this code
949 reads in a FASTQ file and re-saves it as another FASTQ file:
951 >>> from Bio import SeqIO
952 >>> record_iterator = SeqIO.parse(open("Quality/solexa.fastq"), "fastq-solexa")
953 >>> out_handle = open("Quality/temp.fastq", "w")
954 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
956 >>> out_handle.close()
958 You might want to do this if the original file included extra line
959 breaks, which (while valid) may not be supported by all tools. The
960 output file from Biopython will have each sequence on a single line, and
961 each quality string on a single line (which is considered desirable for
962 maximum compatibility).
964 This code is also called if you use the .format("fastq-solexa") method of
967 P.S. Don't forget to delete the temp file if you don't need it anymore:
970 >>> os.remove("Quality/temp.fastq")
972 def write_record(self, record):
973 """Write a single FASTQ record to the file."""
974 assert self._header_written
975 assert not self._footer_written
976 self._record_written = True
978 #TODO - Is an empty sequence allowed in FASTQ format?
979 qualities = "".join([chr(int(round(q+SOLEXA_SCORE_OFFSET,0))) for q \
980 in _get_solexa_quality(record)])
981 if record.seq is None:
982 raise ValueError("No sequence for record %s" % record.id)
983 if len(qualities) != len(record) :
984 raise ValueError("Record %s has sequence length %i but %i quality scores" \
985 % (record.id, len(record), len(qualities)))
987 title = self.clean(record.id) #TODO - add the description too? cf Fasta output
988 self.handle.write("@%s\n%s\n+\n%s\n" % (title, record.seq, qualities))
990 def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None) :
991 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
993 For example, consider this short QUAL file::
995 >EAS54_6_R1_2_1_413_324
996 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
998 >EAS54_6_R1_2_1_540_792
999 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1001 >EAS54_6_R1_2_1_443_348
1002 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1005 And a matching FASTA file::
1007 >EAS54_6_R1_2_1_413_324
1008 CCCTTCTTGTCTTCAGCGTTTCTCC
1009 >EAS54_6_R1_2_1_540_792
1010 TTGGCAGGCCAAGGCCGATGGATCA
1011 >EAS54_6_R1_2_1_443_348
1012 GTTGCTTCTGGCGTGGGTGGGGGGG
1014 You can parse these separately using Bio.SeqIO with the "qual" and
1015 "fasta" formats, but then you'll get a group of SeqRecord objects with
1016 no sequence, and a matching group with the sequence but not the
1017 qualities. Because it only deals with one input file handle, Bio.SeqIO
1018 can't be used to read the two files together - but this function can!
1021 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1022 ... open("Quality/example.qual", "rU"))
1023 >>> for record in rec_iter :
1024 ... print record.id, record.seq
1025 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1026 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1027 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1029 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1030 they are in each record's per-letter-annotation dictionary as a simple
1033 >>> print record.letter_annotations["phred_quality"]
1034 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1036 If you have access to data as a FASTQ format file, using that directly
1037 would be simpler and more straight forward. Note that you can easily use
1038 this function to convert paired FASTA and QUAL files into FASTQ files:
1040 >>> from Bio import SeqIO
1041 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1042 ... open("Quality/example.qual", "rU"))
1043 >>> out_handle = open("Quality/temp.fastq", "w")
1044 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1046 >>> out_handle.close()
1048 And don't forget to clean up the temp file if you don't need it anymore:
1051 >>> os.remove("Quality/temp.fastq")
1053 from Bio.SeqIO.FastaIO import FastaIterator
1054 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
1055 title2ids=title2ids)
1056 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
1057 title2ids=title2ids)
1059 #Using zip(...) would create a list loading everything into memory!
1060 #It would also not catch any extra records found in only one file.
1063 f_rec = fasta_iter.next()
1064 except StopIteration :
1067 q_rec = qual_iter.next()
1068 except StopIteration :
1070 if f_rec is None and q_rec is None :
1074 raise ValueError("FASTA file has more entries than the QUAL file.")
1076 raise ValueError("QUAL file has more entries than the FASTA file.")
1077 if f_rec.id != q_rec.id :
1078 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
1079 % (f_rec.id, q_rec.id))
1080 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]) :
1081 raise ValueError("Sequence length and number of quality scores disagree for %s" \
1084 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
1090 """Run the Bio.SeqIO module's doctests.
1092 This will try and locate the unit tests directory, and run the doctests
1093 from there in order that the relative paths used in the examples work.
1097 if os.path.isdir(os.path.join("..","..","Tests")) :
1098 print "Runing doctests..."
1099 cur_dir = os.path.abspath(os.curdir)
1100 os.chdir(os.path.join("..","..","Tests"))
1101 assert os.path.isfile("Quality/example.fastq")
1102 assert os.path.isfile("Quality/example.fasta")
1103 assert os.path.isfile("Quality/example.qual")
1104 assert os.path.isfile("Quality/tricky.fastq")
1105 assert os.path.isfile("Quality/solexa.fastq")
1111 if __name__ == "__main__" :