1 % --------------------------------------------------------------
3 % SRE, Wed Jul 14 17:54:59 1999
5 % --------------------------------------------------------------
7 \chapter {Sequence file formats}
11 The software can handle a number of different file formats. By
12 default, it autodetects the file format, so you don't have to worry
13 about converting formats. Most common file formats are recognized,
14 including FASTA, Genbank, EMBL, Swissprot, PIR, and FASTA for
15 unaligned sequences, and GCG MSF, Clustal, Phylip, and Stockholm
16 format for multiple sequence alignments. Some parts of the source code
17 call the autodetector the ``Babelfish''.
19 The Babelfish has three drawbacks. First, it takes a small amount of
20 time to do the autodetection. Second, the Babelfish is aggressive, and
21 it makes mistakes when a file isn't one of the known formats -- in
22 particular, it can recognize plain text files as SELEX alignments,
23 because the SELEX format is so free-form. Third, because the Babelfish
24 works by reading the first part of the file then rewinding it before
25 starting to process it, you can't use the Babelfish on a nonrewindable
26 stream: e.g. when you're taking sequence input from a UNIX pipe
27 instead of a file, or when the file is gzipped and has to be
28 decompressed before processing. In normal use, when you're using the
29 software interactively from the command line on sequence files that
30 you're familiar with, the Babelfish is very convenient and
33 However, you'll find that there are times when you want to override
34 the Babelfish -- particularly in high-throughput analysis, when you
35 know the format your files are supposed to be in, and you'd rather
36 increase robustness and sacrifice interactive flexibility. All the
37 programs have an \verb+--informat <format>+ option that lets you
38 specify the format and shut off the Babelfish. You \emph{must} use
39 \verb+--informat+ to use compressed files, or to read sequence from a
40 UNIX pipe... see below for more details on these tricks.
42 \section{Formats recognized by the Babelfish}
44 Recognized unaligned sequence file formats:
46 \begin{tabular}{ll}\hline
47 Format name & Note \\ \hline
48 fasta & BLAST flatfile databases, etc.\\
49 genbank & NCBI Genbank flat file format.\\
50 embl & Includes both EMBL (DNA) and SWISSPROT (protein) databases.\\
51 pir & Protein Information Resource database (NBRF/Georgetown)\\
52 gcg & Wisconsin Genetics Computer Group; only allows one sequence per file.\\
53 gcgdata & I think this GCG database format is obsolete now.\\ \hline
56 Recognized multiple sequence alignment file formats:
58 \begin{tabular}{ll}\hline
59 Format name & Note \\ \hline
60 stockholm & Pfam format. Allows databases of more than one alignment per file\\
61 selex & Old NeXagen RNA alignment format, adopted by early HMMER releases.\\
62 msf & GCG's alignment format.\\
63 clustal & ClustalV, ClustalW, and friends.\\
64 a2m & Aligned FASTA format; see comment below.\\
65 phylip & Format used by Felsenstein's PHYLIP phylogenetic inference software\\\hline
68 Aligned FASTA format (here called ``A2M'', though I believe that what
69 Haussler's group at UCSC started calling A2M is yet another variant of
70 aligned FASTA that's incompatible with this A2M) is only autodetected
71 when an alignment file is expected. Otherwise an A2M file will be
72 recognized as unaligned FASTA, and its gap characters (if any) will be
73 parsed as sequence characters -- often not what you want.
75 Alignment files may be used when unaligned files are expected -- the
76 sequences will silently be de-aligned and read sequentially. The
77 converse is not true; you can't give an unaligned sequence format when
78 an alignment is expected (makes sense, right?).
80 There is no provision for enforcing that single unaligned sequence
81 formats really do contain just a single sequence. An attempt to
82 convert a multisequence file to GCG format will silently ``succeed'',
83 and the file may look ok to your eye, but that multisequence ``GCG''
84 file is illegal. The data will be corrupted if you try to read that
85 file back in, possibly without generating any error messages.
87 It turns out that other formats work too, but they're undocumented,
88 not subjected to any quality control testing at software release time,
89 and prone to change without notice at my slightest whim. (In other
90 words, even less supported than the software already is.) The brave,
91 curious, or desperate are invited to peruse
92 \prog{seqio.c} and \prog{squid.h}.
94 \section{Special tricks}
96 \subsection{Reading from standard input (probably UNIX-only)}
98 If you give ``-'' as a sequence filename, the software will read the
99 sequences from standard input rather than from a file. You will need
100 to specify the format of the incoming data using the
101 \verb+--informat+ option.
102 Any format except SELEX can be read from standard input. This lets you
103 use any program downstream in a standard UNIX pipe.
105 There is one limitation: you can't use ``-'' more than once on a
106 command line, for obvious reasons. (How is it supposed to read more
107 than one file from one standard input stream?) If you do, behavior of
108 the software is undefined -- in other words, the software don't check
109 for whether you're making this mistake, so God help you if you do.
111 \subsection{Reading from gzip'ed files (probably UNIX-only)}
113 A sequence file in any format except SELEX can be compressed by gzip,
114 and read in its compressed form. The software looks for the suffix
115 \prog{.gz} to detect gzip'ed files. This allows you to save disk space
116 by keeping sequence files gzip'ed, if you like. gzip is not built in;
117 the software needs to find a gzip executable in your current PATH.
119 If for some reason you name a file with a \prog{.gz} suffix and it's
120 \emph{not} a gzip-compressed file, the software will still try to
121 decompress it, and peculiar things may happen.
123 \section{FASTA format, the recommended unaligned format}
125 FASTA is probably the simplest of formats for unaligned sequences.
126 FASTA files are easily created in a text editor. Each sequence is
127 preceded by a line starting with \verb+>+. The first word on this line
128 is the name of the sequence. The rest of the line is a description of
129 the sequence (free format). The remaining lines contain the sequence
130 itself. You can put as many letters on a sequence line as you want.
132 \textbf{Example of a simple FASTA file:}
134 >seq1 This is the description of my first sequence.
135 AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA CGACGTAGATGCTAGCTGACTCGATGC
136 >seq2 This is a description of my second sequence.
137 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGCATGCTCG
140 For better or worse, FASTA is not a documented standard. Minor (and
141 major) variants are in widespread use in the bioinformatics community,
142 all of which are called ``FASTA format''. My software attempts to
143 cater to all of them, and is tolerant of common deviations in FASTA
144 format. Certainly anything that is accepted by the database formatting
145 programs in NCBI BLAST or WU-BLAST (e.g. setdb, pressdb, xdformat)
146 will also be accepted by my software. Blank lines in a FASTA file are
147 ignored, and so are spaces or other gap symbols (dashes, underscores,
148 periods) in a sequence. Other non-amino or non-nucleic acid symbols in
149 the sequence are also silently ignored, mostly because some people
150 seem to think that ``*'' or ``.'' should be added to protein sequences
151 to (redundantly) indicate the end of the sequence. The parser will
152 also accept unlimited line lengths, which allows it to accomodate the
153 enormous description lines in the NCBI NR databases.
155 On the other hand, any FASTA files \emph{generated} by my software
156 adhere closely to community standards, and should be usable by other
157 software packages (BLAST, FASTA, etc.) that are more picky about
158 parsing their input files. That means you can run a sloppy FASTA file
159 thru \prog{sreformat} to clean it up.
161 Partly because of this tolerance, the software may have a difficult
162 time dealing with files that are \textit{not} in FASTA format,
163 especially if you're relying on the Babelfish to do format
164 autodetection. Some (now mercifully uncommon) file formats are so
165 similar to FASTA format that they be erroneously called FASTA by the
166 Babelfish and then quietly but lethally misparsed. An example is the
167 old NBRF file format. If you're using \verb+--informat+, things will
168 be more robust, and the software should simply refuse to accept a
169 non-FASTA file -- but you shouldn't count on this, because files
170 perversely similar to FASTA will still confuse the parser. (The gist
171 of these caveats applies to all formats, not just FASTA.)
173 \section{SELEX, the quick and dirty alignment format}
175 An example of a simple SELEX alignment file:
189 SELEX is an interleaved multiple alignment format that arose as a
190 simple, intuitive format that was easy to write and manipulate
191 manually in a text editor. It is usually easy to convert other
192 alignment formats into SELEX format, even with a couple of lines of
193 Perl, but it can be harder to go the other way, since SELEX is more
194 free-format than other alignment formats. For instance, GCG's MSF
195 format and the output of the CLUSTALV multiple alignment program are
196 similar interleaved formats that can be converted to SELEX just by
197 stripping a small number of non-sequence lines out. Because SELEX
198 evolved to accomodate different user input styles, it is very tolerant
199 of various inconsistencies such as different gap symbols, varying line
202 Each line contains a name, followed by the aligned sequence. A space,
203 dash, underscore, or period denotes a gap. If the alignment is too
204 long to fit on one line, the alignment is split into multiple blocks,
205 separated by blank lines. The number of sequences, their order, and
206 their names must be the same in every block (even if a sequence has no
207 residues in a given block!) Other blank lines are ignored. You can add
208 comments to the file on lines starting with a \verb+#+.
210 SELEX stands for ``Systematic Evolution of Ligands by Exponential
211 Enrichment'' -- it refers to the Tuerk and Gold technology for
212 evolving families of small RNAs for particular functions
213 \cite{Tuerk90b}. SELEX files were what we used to keep track of
214 alignments of these small RNA families, at a company then called
215 NeXagen, in Boulder. It's an interesting piece of historical baggage.
216 With the development of HMMER and more need for annotated alignments
217 in Pfam, SELEX format later evolved into ``extended SELEX'', with a
218 reserved comment style that allowed structural markup and other
219 annotations, but that became unwieldy. We now use Stockholm format
220 (see below) for highly annotated alignments. (Extended SELEX is
221 deprecated and undocumented.) Still, the basic SELEX format remains a
222 useful ``lowest common denominator'' alignment format, and has been
225 \subsubsection {Detailed specification of a SELEX file}
229 Any line beginning with a \verb+#=+ as the first two characters is a
230 parsed machine comment in extended SELEX, and is now deprecated.
233 All other lines beginning with a \verb+%+ or \verb+#+ as the first
234 character are user comments. User comments are ignored by all
235 software. Anything may appear on these lines. Any number of comments
236 may be included in a SELEX file, and at any point.
239 Lines of data consist of a name followed by a sequence. The total
240 length of the line must be smaller than 4096 characters.
243 Names must be a single word. Any non-whitespace characters are
244 accepted. No spaces are tolerated in names: names MUST be a
245 single word. Names must be less than 32 characters long.
247 \item In the sequence, any of the characters \verb+-_.+ or a space are
248 recognized as gaps. Any other characters are interpreted as sequence.
249 Sequence is case-sensitive. There is a common assumption by my
250 software that upper-case symbols are used for consensus (match)
251 positions and lower-case symbols are used for inserts. This language
252 of ``match'' versus ``insert'' comes from the hidden Markov model
253 formalism \cite{Krogh94}. To almost all of my software, this isn't
254 important, and it immediately converts the sequence to all upper-case
258 Multiple different sequences are grouped in a block of data lines.
259 Blocks are separated by blank lines. No blank lines are tolerated
260 between the sequence lines in a block. Each block in a multi-block
261 file of a long alignment must have its sequences in the same order in
262 each block. The names are checked to verify that this is the case; if
263 not, only a warning is generated. (In manually constructed files, some
264 users may wish to use shorthand names in subsequent blocks after an
265 initial block with full names -- but this isn't recommended.)
268 \section{Stockholm, the recommended multiple sequence alignment format}
270 While we recommend a community standard format (FASTA) for unaligned
271 sequence files, the recommended multiple alignment file format is not
272 a community standard. The Pfam Consortium developed a format (based
273 on extended SELEX) called ``Stockholm format''. The reasons for this
274 are two-fold. First, there really is no standard accepted format for
275 multiple sequence alignment files, so we don't feel guilty about
276 inventing a new one. Second, the formats of popular multiple alignment
277 software (e.g. CLUSTAL, GCG MSF, PHYLIP) do not support rich
278 documentation and markup of the alignment. Stockholm format was
279 developed to support extensible markup of multiple sequence
280 alignments, and we use this capability extensively in both RNA work
281 (with structural markup) and the Pfam database (with extensive use of
282 both annotation and markup).
284 \subsection{A minimal Stockholm file}
298 The simplest Stockholm file is pretty intuitive, easily generated in a
299 text editor. It is usually easy to convert alignment formats into a
300 ``least common denominator'' Stockholm format. For instance, SELEX,
301 GCG's MSF format, and the output of the CLUSTALV multiple alignment
302 program are all similar interleaved formats.
304 The first line in the file must be \verb+# STOCKHOLM 1.x+, where
305 \verb+x+ is a minor version number for the format specification
306 (and which currently has no effect on my parsers). This line allows a
307 parser to instantly identify the file format.
309 In the alignment, each line contains a name, followed by the aligned
310 sequence. A dash or period denotes a gap. If the alignment is too long
311 to fit on one line, the alignment may be split into multiple blocks,
312 with blocks separated by blank lines. The number of sequences, their
313 order, and their names must be the same in every block. Within a given
314 block, each (sub)sequence (and any associated \verb+#=GR+ and
315 \verb+#=GC+ markup, see below) is of equal length, called the
316 \textit{block length}. Block lengths may differ from block to block;
317 the block length must be at least one residue, and there is no
320 Other blank lines are ignored. You can add comments to the file on
321 lines starting with a \verb+#+.
323 All other annotation is added using a tag/value comment style. The
324 tag/value format is inherently extensible, and readily made
325 backwards-compatible; unrecognized tags will simply be ignored. Extra
326 annotation includes consensus and individual RNA or protein secondary
327 structure, sequence weights, a reference coordinate system for the
328 columns, and database source information including name, accession
329 number, and coordinates (for subsequences extracted from a longer
330 source sequence) See below for details.
332 \subsection{Syntax of Stockholm markup}
334 There are four types of Stockholm markup annotation, for per-file,
335 per-sequence, per-column, and per-residue annotation:
338 \item {\emprog{#=GF <tag> <s>}}
339 Per-file annotation. \prog{<s>} is a free format text line
340 of annotation type \prog{<tag>}. For example, \prog{#=GF DATE
341 April 1, 2000}. Can occur anywhere in the file, but usually
342 all the \prog{#=GF} markups occur in a header.
344 \item {\emprog{#=GS <seqname> <tag> <s>}}
345 Per-sequence annotation. \prog{<s>} is a free format text line
346 of annotation type \prog{tag} associated with the sequence
347 named \prog{<seqname>}. For example, \prog{#=GS seq1
348 SPECIES_SOURCE Caenorhabditis elegans}. Can occur anywhere
349 in the file, but in single-block formats (e.g. the Pfam
350 distribution) will typically follow on the line after the
351 sequence itself, and in multi-block formats (e.g. HMMER
352 output), will typically occur in the header preceding the
353 alignment but following the \prog{#=GF} annotation.
355 \item {\emprog{#=GC <tag> <...s...>}
356 Per-column annotation. \prog{<...s...>} is an aligned text line
357 of annotation type \prog{<tag>}.
358 \verb+#=GC+ lines are
359 associated with a sequence alignment block; \prog{<...s...>}
360 is aligned to the residues in the alignment block, and has
361 the same length as the rest of the block.
362 Typically \verb+#=GC+ lines are placed at the end of each block.
364 \item {\emprog{#=GR <seqname> <tag> <.....s.....>}
365 Per-residue annotation. \prog{<...s...>} is an aligned text line
366 of annotation type \prog{<tag>}, associated with the sequence
367 named \prog{<seqname>}.
368 \verb+#=GR+ lines are
369 associated with one sequence in a sequence alignment block;
371 is aligned to the residues in that sequence, and has
372 the same length as the rest of the block.
374 \verb+#=GR+ lines are placed immediately following the
375 aligned sequence they annotate.
378 \subsection{Semantics of Stockholm markup}
380 Any Stockholm parser will accept syntactically correct files, but is
381 not obligated to do anything with the markup lines. It is up to the
382 application whether it will attempt to interpret the meaning (the
383 semantics) of the markup in a useful way. At the two extremes are the
384 Belvu alignment viewer and the HMMER profile hidden Markov model
387 Belvu simply reads Stockholm markup and displays it, without trying to
388 interpret it at all. The tag types (\prog{#=GF}, etc.) are sufficient
389 to tell Belvu how to display the markup: whether it is attached to the
390 whole file, sequences, columns, or residues.
392 HMMER uses Stockholm markup to pick up a variety of information from
393 the Pfam multiple alignment database. The Pfam consortium therefore
394 agrees on additional syntax for certain tag types, so HMMER can parse
395 some markups for useful information. This additional syntax is imposed
396 by Pfam, HMMER, and other software of mine, not by Stockholm format
397 per se. You can think of Stockholm as akin to XML, and what my
398 software reads as akin to an XML DTD, if you're into that sort of
399 structured data format lingo.
401 The Stockholm markup tags that are parsed semantically by my software
404 \subsubsection{Recognized #=GF annotations}
406 \item [\emprog{ID <s>}]
407 Identifier. \emprog{<s>} is a name for the alignment;
408 e.g. ``rrm''. One word. Unique in file.
410 \item [\emprog{AC <s>}]
411 Accession. \emprog{<s>} is a unique accession number for the
413 ``PF00001''. Used by the Pfam database, for instance.
414 Often a alphabetical prefix indicating the database
415 (e.g. ``PF'') followed by a unique numerical accession.
416 One word. Unique in file.
418 \item [\emprog{DE <s>}]
419 Description. \emprog{<s>} is a free format line giving
420 a description of the alignment; e.g.
421 ``RNA recognition motif proteins''. One line. Unique in file.
423 \item [\emprog{AU <s>}]
424 Author. \emprog{<s>} is a free format line listing the
425 authors responsible for an alignment; e.g.
426 ``Bateman A''. One line. Unique in file.
428 \item [\emprog{GA <f> <f>}]
429 Gathering thresholds. Two real numbers giving HMMER bit score
430 per-sequence and per-domain cutoffs used in gathering the
431 members of Pfam full alignments. See Pfam and HMMER
432 documentation for more detail.
434 \item [\emprog{NC <f> <f>}]
435 Noise cutoffs. Two real numbers giving HMMER bit score
436 per-sequence and per-domain cutoffs, set according to the
437 highest scores seen for unrelated sequences when gathering
438 members of Pfam full alignments. See Pfam and HMMER
439 documentation for more detail.
441 \item [\emprog{TC <f> <f>}]
442 Trusted cutoffs. Two real numbers giving HMMER bit score
443 per-sequence and per-domain cutoffs, set according to the
444 lowest scores seen for true homologous sequences that
445 were above the GA gathering thresholds, when gathering
446 members of Pfam full alignments. See Pfam and HMMER
447 documentation for more detail.
450 \subsection{Recognized #=GS annotations}
453 \item [\emprog{WT <f>}]
454 Weight. \emprog{<f>} is a positive real number giving the
455 relative weight for a sequence, usually used to compensate
456 for biased representation by downweighting similar sequences.
457 Usually the weights average 1.0 (e.g. the weights sum to
458 the number of sequences in the alignment) but this is not
459 required. Either every sequence must have a weight annotated,
462 \item [\emprog{AC <s>}]
463 Accession. \emprog{<s>} is a database accession number for
464 this sequence. (Compare the \prog{#=GF AC} markup, which gives
465 an accession for the whole alignment.) One word.
467 \item [\emprog{DE <s>}]
468 Description. \emprog{<s>} is one line giving a description for
469 this sequence. (Compare the \prog{#=GF DE} markup, which gives
470 a description for the whole alignment.)
474 \subsection{Recognized #=GC annotations}
478 Reference line. Any character is accepted as a markup for a
479 column. The intent is to allow labeling the columns with some
482 \item [\emprog{SS_cons}]
483 Secondary structure consensus. For protein alignments,
484 DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-_], where
485 H is alpha helix, G is 3/10-helix, I is p-helix, E is extended
486 strand, B is a residue in an isolated b-bridge, T is a turn,
487 S is a bend, C is a random coil or loop, and X is unknown
488 (for instance, a residue that was not resolved in a crystal
489 structure). For RNA alignments
490 the symbols \verb+>+ and \verb+<+ are
491 used for base pairs (pairs point at each other). \verb-+- indicate
492 definitely single-stranded positions, and any gap symbol indicates
493 unassigned bases or single-stranded positions. This description
494 roughly follows \cite{Konings89}.
495 RNA pseudoknots are represented by alphabetic characters, with upper
496 case letters representing the 5' side of the helix and lower case
497 letters representing the 3' side. Note that this limits the
498 annotation to a maximum of 26 pseudoknots per sequence.
501 \item [\emprog{SA_cons}]
502 Surface accessibility consensus. 0-9, gap symbols, or X are
503 accepted as markup. 0 means <10\% accessible residue surface
504 area, 1 means <20\%, 9 means <100\%, etc. X means unknown
508 \subsection{Recognized #=GR annotations}
512 Secondary structure consensus. See \prog{#=GC SS_cons} above.
514 Surface accessibility consensus. See \prog{#=GC SA_cons} above.