forester/archive/RIO/others/hmmer/squid/Docs/formats.tex

   1 % --------------------------------------------------------------
   2 % squid:formats.tex
   3 % SRE, Wed Jul 14 17:54:59 1999
   4 % $CVS Id$
   5 % --------------------------------------------------------------
   6
   7 \chapter {Sequence file formats}
   8
   9 \section{Summary}
  10
  11 The software can handle a number of different file formats. By
  12 default, it autodetects the file format, so you don't have to worry
  13 about converting formats. Most common file formats are recognized,
  14 including FASTA, Genbank, EMBL, Swissprot, PIR, and FASTA for
  15 unaligned sequences, and GCG MSF, Clustal, Phylip, and Stockholm
  16 format for multiple sequence alignments. Some parts of the source code
  17 call the autodetector the ``Babelfish''.
  18
  19 The Babelfish has three drawbacks. First, it takes a small amount of
  20 time to do the autodetection. Second, the Babelfish is aggressive, and
  21 it makes mistakes when a file isn't one of the known formats -- in
  22 particular, it can recognize plain text files as SELEX alignments,
  23 because the SELEX format is so free-form. Third, because the Babelfish
  24 works by reading the first part of the file then rewinding it before
  25 starting to process it, you can't use the Babelfish on a nonrewindable
  26 stream: e.g. when you're taking sequence input from a UNIX pipe
  27 instead of a file, or when the file is gzipped and has to be
  28 decompressed before processing.  In normal use, when you're using the
  29 software interactively from the command line on sequence files that
  30 you're familiar with, the Babelfish is very convenient and
  31 (relatively) safe.
  32
  33 However, you'll find that there are times when you want to override
  34 the Babelfish -- particularly in high-throughput analysis, when you
  35 know the format your files are supposed to be in, and you'd rather
  36 increase robustness and sacrifice interactive flexibility. All the
  37 programs have an \verb+--informat <format>+ option that lets you
  38 specify the format and shut off the Babelfish. You \emph{must} use
  39 \verb+--informat+ to use compressed files, or to read sequence from a
  40 UNIX pipe... see below for more details on these tricks.
  41
  42 \section{Formats recognized by the Babelfish}
  43
  44 Recognized unaligned sequence file formats:
  45
  46 \begin{tabular}{ll}\hline
  47 Format name   & Note \\ \hline
  48 fasta         & BLAST flatfile databases, etc.\\
  49 genbank       & NCBI Genbank flat file format.\\
  50 embl          & Includes both EMBL (DNA) and SWISSPROT (protein) databases.\\
  51 pir           & Protein Information Resource database (NBRF/Georgetown)\\
  52 gcg           & Wisconsin Genetics Computer Group; only allows one sequence per file.\\
  53 gcgdata       & I think this GCG database format is obsolete now.\\ \hline
  54 \end{tabular}
  55
  56 Recognized multiple sequence alignment file formats:
  57
  58 \begin{tabular}{ll}\hline
  59 Format name   & Note \\ \hline
  60 stockholm     & Pfam format. Allows databases of more than one alignment per file\\
  61 selex         & Old NeXagen RNA alignment format, adopted by early HMMER releases.\\
  62 msf           & GCG's alignment format.\\
  63 clustal       & ClustalV, ClustalW, and friends.\\
  64 a2m           & Aligned FASTA format; see comment below.\\
  65 phylip        & Format used by Felsenstein's PHYLIP phylogenetic inference software\\\hline
  66 \end{tabular}
  67
  68 Aligned FASTA format (here called ``A2M'', though I believe that what
  69 Haussler's group at UCSC started calling A2M is yet another variant of
  70 aligned FASTA that's incompatible with this A2M) is only autodetected
  71 when an alignment file is expected. Otherwise an A2M file will be
  72 recognized as unaligned FASTA, and its gap characters (if any) will be
  73 parsed as sequence characters -- often not what you want.
  74
  75 Alignment files may be used when unaligned files are expected -- the
  76 sequences will silently be de-aligned and read sequentially. The
  77 converse is not true; you can't give an unaligned sequence format when
  78 an alignment is expected (makes sense, right?).
  79
  80 There is no provision for enforcing that single unaligned sequence
  81 formats really do contain just a single sequence.  An attempt to
  82 convert a multisequence file to GCG format will silently ``succeed'',
  83 and the file may look ok to your eye, but that multisequence ``GCG''
  84 file is illegal. The data will be corrupted if you try to read that
  85 file back in, possibly without generating any error messages.
  86
  87 It turns out that other formats work too, but they're undocumented,
  88 not subjected to any quality control testing at software release time,
  89 and prone to change without notice at my slightest whim. (In other
  90 words, even less supported than the software already is.) The brave,
  91 curious, or desperate are invited to peruse
  92 \prog{seqio.c} and \prog{squid.h}.
  93
  94 \section{Special tricks}
  95
  96 \subsection{Reading from standard input (probably UNIX-only)}
  97
  98 If you give ``-'' as a sequence filename, the software will read the
  99 sequences from standard input rather than from a file. You will need
 100 to specify the format of the incoming data using the
 101 \verb+--informat+ option.
 102 Any format except SELEX can be read from standard input. This lets you
 103 use any program downstream in a standard UNIX pipe.
 104
 105 There is one limitation: you can't use ``-'' more than once on a
 106 command line, for obvious reasons. (How is it supposed to read more
 107 than one file from one standard input stream?) If you do, behavior of
 108 the software is undefined -- in other words, the software don't check
 109 for whether you're making this mistake, so God help you if you do.
 110
 111 \subsection{Reading from gzip'ed files (probably UNIX-only)}
 112
 113 A sequence file in any format except SELEX can be compressed by gzip,
 114 and read in its compressed form. The software looks for the suffix
 115 \prog{.gz} to detect gzip'ed files. This allows you to save disk space
 116 by keeping sequence files gzip'ed, if you like. gzip is not built in;
 117 the software needs to find a gzip executable in your current PATH.
 118
 119 If for some reason you name a file with a \prog{.gz} suffix and it's
 120 \emph{not} a gzip-compressed file, the software will still try to
 121 decompress it, and peculiar things may happen.
 122
 123 \section{FASTA format, the recommended unaligned format}
 124
 125 FASTA is probably the simplest of formats for unaligned sequences.
 126 FASTA files are easily created in a text editor.  Each sequence is
 127 preceded by a line starting with \verb+>+. The first word on this line
 128 is the name of the sequence. The rest of the line is a description of
 129 the sequence (free format). The remaining lines contain the sequence
 130 itself. You can put as many letters on a sequence line as you want.
 131
 132 \textbf{Example of a simple FASTA file:}
 133 \begin{verbatim}
 134 >seq1 This is the description of my first sequence.
 135 AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA CGACGTAGATGCTAGCTGACTCGATGC
 136 >seq2 This is a description of my second sequence.
 137 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGCATGCTCG
 138 \end{verbatim}
 139
 140 For better or worse, FASTA is not a documented standard. Minor (and
 141 major) variants are in widespread use in the bioinformatics community,
 142 all of which are called ``FASTA format''. My software attempts to
 143 cater to all of them, and is tolerant of common deviations in FASTA
 144 format. Certainly anything that is accepted by the database formatting
 145 programs in NCBI BLAST or WU-BLAST (e.g. setdb, pressdb, xdformat)
 146 will also be accepted by my software. Blank lines in a FASTA file are
 147 ignored, and so are spaces or other gap symbols (dashes, underscores,
 148 periods) in a sequence. Other non-amino or non-nucleic acid symbols in
 149 the sequence are also silently ignored, mostly because some people
 150 seem to think that ``*'' or ``.'' should be added to protein sequences
 151 to (redundantly) indicate the end of the sequence. The parser will
 152 also accept unlimited line lengths, which allows it to accomodate the
 153 enormous description lines in the NCBI NR databases.
 154
 155 On the other hand, any FASTA files \emph{generated} by my software
 156 adhere closely to community standards, and should be usable by other
 157 software packages (BLAST, FASTA, etc.) that are more picky about
 158 parsing their input files. That means you can run a sloppy FASTA file
 159 thru \prog{sreformat} to clean it up.
 160
 161 Partly because of this tolerance, the software may have a difficult
 162 time dealing with files that are \textit{not} in FASTA format,
 163 especially if you're relying on the Babelfish to do format
 164 autodetection.  Some (now mercifully uncommon) file formats are so
 165 similar to FASTA format that they be erroneously called FASTA by the
 166 Babelfish and then quietly but lethally misparsed. An example is the
 167 old NBRF file format. If you're using \verb+--informat+, things will
 168 be more robust, and the software should simply refuse to accept a
 169 non-FASTA file -- but you shouldn't count on this, because files
 170 perversely similar to FASTA will still confuse the parser.  (The gist
 171 of these caveats applies to all formats, not just FASTA.)
 172
 173 \section{SELEX, the quick and dirty alignment format}
 174
 175 An example of a simple SELEX alignment file:
 176
 177 \begin{verbatim}
 178 # Example selex file
 179
 180 seq1     ACGACGACGACG.
 181 seq2     ..GGGAAAGG.GA
 182 seq3     UUU..AAAUUU.A
 183
 184 seq1  ..ACG
 185 seq2  AAGGG
 186 seq3  AA...UUU
 187 \end{verbatim}
 188
 189 SELEX is an interleaved multiple alignment format that arose as a
 190 simple, intuitive format that was easy to write and manipulate
 191 manually in a text editor. It is usually easy to convert other
 192 alignment formats into SELEX format, even with a couple of lines of
 193 Perl, but it can be harder to go the other way, since SELEX is more
 194 free-format than other alignment formats. For instance, GCG's MSF
 195 format and the output of the CLUSTALV multiple alignment program are
 196 similar interleaved formats that can be converted to SELEX just by
 197 stripping a small number of non-sequence lines out. Because SELEX
 198 evolved to accomodate different user input styles, it is very tolerant
 199 of various inconsistencies such as different gap symbols, varying line
 200 lengths, etc.
 201
 202 Each line contains a name, followed by the aligned sequence. A space,
 203 dash, underscore, or period denotes a gap. If the alignment is too
 204 long to fit on one line, the alignment is split into multiple blocks,
 205 separated by blank lines. The number of sequences, their order, and
 206 their names must be the same in every block (even if a sequence has no
 207 residues in a given block!) Other blank lines are ignored. You can add
 208 comments to the file on lines starting with a \verb+#+.
 209
 210 SELEX stands for ``Systematic Evolution of Ligands by Exponential
 211 Enrichment'' -- it refers to the Tuerk and Gold technology for
 212 evolving families of small RNAs for particular functions
 213 \cite{Tuerk90b}. SELEX files were what we used to keep track of
 214 alignments of these small RNA families, at a company then called
 215 NeXagen, in Boulder. It's an interesting piece of historical baggage.
 216 With the development of HMMER and more need for annotated alignments
 217 in Pfam, SELEX format later evolved into ``extended SELEX'', with a
 218 reserved comment style that allowed structural markup and other
 219 annotations, but that became unwieldy. We now use Stockholm format
 220 (see below) for highly annotated alignments. (Extended SELEX is
 221 deprecated and undocumented.) Still, the basic SELEX format remains a
 222 useful ``lowest common denominator'' alignment format, and has been
 223 retained.
 224
 225 \subsubsection {Detailed specification of a SELEX file}
 226
 227 \begin{enumerate}
 228 \item
 229 Any line beginning with a \verb+#=+ as the first two characters is a
 230 parsed machine comment in extended SELEX, and is now deprecated.
 231
 232 \item
 233 All other lines beginning with a \verb+%+ or \verb+#+ as the first
 234 character are user comments.  User comments are ignored by all
 235 software. Anything may appear on these lines. Any number of comments
 236 may be included in a SELEX file, and at any point.
 237
 238 \item
 239 Lines of data consist of a name followed by a sequence. The total
 240 length of the line must be smaller than 4096 characters.
 241
 242 \item
 243 Names must be a single word. Any non-whitespace characters are
 244 accepted.  No spaces are tolerated in names: names MUST be a
 245 single word. Names must be less than 32 characters long.
 246
 247 \item In the sequence, any of the characters \verb+-_.+ or a space are
 248 recognized as gaps. Any other characters are interpreted as sequence.
 249 Sequence is case-sensitive. There is a common assumption by my
 250 software that upper-case symbols are used for consensus (match)
 251 positions and lower-case symbols are used for inserts. This language
 252 of ``match'' versus ``insert'' comes from the hidden Markov model
 253 formalism \cite{Krogh94}. To almost all of my software, this isn't
 254 important, and it immediately converts the sequence to all upper-case
 255 after it's read.
 256
 257 \item
 258 Multiple different sequences are grouped in a block of data lines.
 259 Blocks are separated by blank lines. No blank lines are tolerated
 260 between the sequence lines in a block. Each block in a multi-block
 261 file of a long alignment must have its sequences in the same order in
 262 each block. The names are checked to verify that this is the case; if
 263 not, only a warning is generated. (In manually constructed files, some
 264 users may wish to use shorthand names in subsequent blocks after an
 265 initial block with full names -- but this isn't recommended.)
 266 \end{enumerate}
 267
 268 \section{Stockholm, the recommended multiple sequence alignment format}
 269
 270 While we recommend a community standard format (FASTA) for unaligned
 271 sequence files, the recommended multiple alignment file format is not
 272 a community standard.  The Pfam Consortium developed a format (based
 273 on extended SELEX) called ``Stockholm format''.  The reasons for this
 274 are two-fold. First, there really is no standard accepted format for
 275 multiple sequence alignment files, so we don't feel guilty about
 276 inventing a new one. Second, the formats of popular multiple alignment
 277 software (e.g. CLUSTAL, GCG MSF, PHYLIP) do not support rich
 278 documentation and markup of the alignment.  Stockholm format was
 279 developed to support extensible markup of multiple sequence
 280 alignments, and we use this capability extensively in both RNA work
 281 (with structural markup) and the Pfam database (with extensive use of
 282 both annotation and markup).
 283
 284 \subsection{A minimal Stockholm file}
 285 \begin{verbatim}
 286 # STOCKHOLM 1.0
 287
 288 seq1  ACDEF...GHIKL
 289 seq2  ACDEF...GHIKL
 290 seq3  ...EFMNRGHIKL
 291
 292 seq1  MNPQTVWY
 293 seq2  MNPQTVWY
 294 seq3  MNPQT...
 295
 296 \end{verbatim}
 297
 298 The simplest Stockholm file is pretty intuitive, easily generated in a
 299 text editor. It is usually easy to convert alignment formats into a
 300 ``least common denominator'' Stockholm format. For instance, SELEX,
 301 GCG's MSF format, and the output of the CLUSTALV multiple alignment
 302 program are all similar interleaved formats.
 303
 304 The first line in the file must be \verb+# STOCKHOLM 1.x+, where
 305 \verb+x+ is a minor version number for the format specification
 306 (and which currently has no effect on my parsers). This line allows a
 307 parser to instantly identify the file format.
 308
 309 In the alignment, each line contains a name, followed by the aligned
 310 sequence. A dash or period denotes a gap. If the alignment is too long
 311 to fit on one line, the alignment may be split into multiple blocks,
 312 with blocks separated by blank lines. The number of sequences, their
 313 order, and their names must be the same in every block. Within a given
 314 block, each (sub)sequence (and any associated \verb+#=GR+ and
 315 \verb+#=GC+ markup, see below) is of equal length, called the
 316 \textit{block length}. Block lengths may differ from block to block;
 317 the block length must be at least one residue, and there is no
 318 maximum.
 319
 320 Other blank lines are ignored. You can add comments to the file on
 321 lines starting with a \verb+#+.
 322
 323 All other annotation is added using a tag/value comment style. The
 324 tag/value format is inherently extensible, and readily made
 325 backwards-compatible; unrecognized tags will simply be ignored. Extra
 326 annotation includes consensus and individual RNA or protein secondary
 327 structure, sequence weights, a reference coordinate system for the
 328 columns, and database source information including name, accession
 329 number, and coordinates (for subsequences extracted from a longer
 330 source sequence) See below for details.
 331
 332 \subsection{Syntax of Stockholm markup}
 333
 334 There are four types of Stockholm markup annotation, for per-file,
 335 per-sequence, per-column, and per-residue annotation:
 336
 337 \begin{wideitem}
 338 \item {\emprog{#=GF <tag> <s>}}
 339         Per-file annotation. \prog{<s>} is a free format text line
 340         of annotation type \prog{<tag>}. For example, \prog{#=GF DATE
 341         April 1, 2000}. Can occur anywhere in the file, but usually
 342         all the \prog{#=GF} markups occur in a header.
 343
 344 \item {\emprog{#=GS <seqname> <tag> <s>}}
 345         Per-sequence annotation. \prog{<s>} is a free format text line
 346         of annotation type \prog{tag} associated with the sequence
 347         named \prog{<seqname>}. For example, \prog{#=GS seq1
 348         SPECIES_SOURCE Caenorhabditis elegans}. Can occur anywhere
 349         in the file, but in single-block formats (e.g. the Pfam
 350         distribution) will typically follow on the line after the
 351         sequence itself, and in multi-block formats (e.g. HMMER
 352         output), will typically occur in the header preceding the
 353         alignment but following the \prog{#=GF} annotation.
 354
 355 \item {\emprog{#=GC <tag> <...s...>}
 356         Per-column annotation. \prog{<...s...>} is an aligned text line
 357         of annotation type \prog{<tag>}.
 358         \verb+#=GC+ lines are
 359         associated with a sequence alignment block; \prog{<...s...>}
 360         is aligned to the residues in the alignment block, and has
 361         the same length as the rest of the block.
 362         Typically \verb+#=GC+ lines are placed at the end of each block.
 363
 364 \item {\emprog{#=GR <seqname> <tag> <.....s.....>}
 365         Per-residue annotation. \prog{<...s...>} is an aligned text line
 366         of annotation type \prog{<tag>}, associated with the sequence
 367         named \prog{<seqname>}.
 368         \verb+#=GR+ lines are
 369         associated with one sequence in a sequence alignment block;
 370         \prog{<...s...>}
 371         is aligned to the residues in that sequence, and has
 372         the same length as the rest of the block.
 373         Typically
 374         \verb+#=GR+ lines are placed immediately following the
 375         aligned sequence they annotate.
 376 \end{wideitem}
 377
 378 \subsection{Semantics of Stockholm markup}
 379
 380 Any Stockholm parser will accept syntactically correct files, but is
 381 not obligated to do anything with the markup lines. It is up to the
 382 application whether it will attempt to interpret the meaning (the
 383 semantics) of the markup in a useful way. At the two extremes are the
 384 Belvu alignment viewer and the HMMER profile hidden Markov model
 385 software package.
 386
 387 Belvu simply reads Stockholm markup and displays it, without trying to
 388 interpret it at all. The tag types (\prog{#=GF}, etc.) are sufficient
 389 to tell Belvu how to display the markup: whether it is attached to the
 390 whole file, sequences, columns, or residues.
 391
 392 HMMER uses Stockholm markup to pick up a variety of information from
 393 the Pfam multiple alignment database. The Pfam consortium therefore
 394 agrees on additional syntax for certain tag types, so HMMER can parse
 395 some markups for useful information. This additional syntax is imposed
 396 by Pfam, HMMER, and other software of mine, not by Stockholm format
 397 per se. You can think of Stockholm as akin to XML, and what my
 398 software reads as akin to an XML DTD, if you're into that sort of
 399 structured data format lingo.
 400
 401 The Stockholm markup tags that are parsed semantically by my software
 402 are as follows:
 403
 404 \subsubsection{Recognized #=GF annotations}
 405 \begin{wideitem}
 406 \item [\emprog{ID  <s>}]
 407         Identifier. \emprog{<s>} is a name for the alignment;
 408         e.g. ``rrm''. One word. Unique in file.
 409
 410 \item [\emprog{AC  <s>}]
 411         Accession. \emprog{<s>} is a unique accession number for the
 412         alignment; e.g.
 413         ``PF00001''. Used by the Pfam database, for instance.
 414         Often a alphabetical prefix indicating the database
 415         (e.g. ``PF'') followed by a unique numerical accession.
 416         One word. Unique in file.
 417
 418 \item [\emprog{DE  <s>}]
 419         Description. \emprog{<s>} is a free format line giving
 420         a description of the alignment; e.g.
 421         ``RNA recognition motif proteins''. One line. Unique in file.
 422
 423 \item [\emprog{AU  <s>}]
 424         Author. \emprog{<s>} is a free format line listing the
 425         authors responsible for an alignment; e.g.
 426         ``Bateman A''. One line. Unique in file.
 427
 428 \item [\emprog{GA  <f> <f>}]
 429         Gathering thresholds. Two real numbers giving HMMER bit score
 430         per-sequence and per-domain cutoffs used in gathering the
 431         members of Pfam full alignments. See Pfam and HMMER
 432         documentation for more detail.
 433
 434 \item [\emprog{NC  <f> <f>}]
 435         Noise cutoffs. Two real numbers giving HMMER bit score
 436         per-sequence and per-domain cutoffs, set according to the
 437         highest scores seen for unrelated sequences when gathering
 438         members of Pfam full alignments. See Pfam and HMMER
 439         documentation for more detail.
 440
 441 \item [\emprog{TC  <f> <f>}]
 442         Trusted cutoffs. Two real numbers giving HMMER bit score
 443         per-sequence and per-domain cutoffs, set according to the
 444         lowest scores seen for true homologous sequences that
 445         were above the GA gathering thresholds, when gathering
 446         members of Pfam full alignments. See Pfam and HMMER
 447         documentation for more detail.
 448 \end{wideitem}
 449
 450 \subsection{Recognized #=GS annotations}
 451
 452 \begin{wideitem}
 453 \item [\emprog{WT  <f>}]
 454         Weight. \emprog{<f>} is a positive real number giving the
 455         relative weight for a sequence, usually used to compensate
 456         for biased representation by downweighting similar sequences.
 457         Usually the weights average 1.0 (e.g. the weights sum to
 458         the number of sequences in the alignment) but this is not
 459         required. Either every sequence must have a weight annotated,
 460         or none of them can.
 461
 462 \item [\emprog{AC  <s>}]
 463         Accession. \emprog{<s>} is a database accession number for
 464         this sequence. (Compare the \prog{#=GF AC} markup, which gives
 465         an accession for the whole alignment.) One word.
 466
 467 \item [\emprog{DE  <s>}]
 468         Description. \emprog{<s>} is one line giving a description for
 469         this sequence. (Compare the \prog{#=GF DE} markup, which gives
 470         a description for the whole alignment.)
 471 \end{wideitem}
 472
 473
 474 \subsection{Recognized #=GC annotations}
 475
 476 \begin{wideitem}
 477 \item [\emprog{RF}]
 478         Reference line. Any character is accepted as a markup for a
 479         column. The intent is to allow labeling the columns with some
 480         sort of mark.
 481
 482 \item [\emprog{SS_cons}]
 483         Secondary structure consensus. For protein alignments,
 484         DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-_], where
 485         H is alpha helix, G is 3/10-helix, I is p-helix, E is extended
 486         strand, B is a residue in an isolated b-bridge, T is a turn,
 487         S is a bend, C is a random coil or loop, and X is unknown
 488         (for instance, a residue that was not resolved in a crystal
 489         structure). For RNA alignments
 490         the symbols \verb+>+ and \verb+<+ are
 491         used for base pairs (pairs point at each other).  \verb-+- indicate
 492         definitely single-stranded positions, and any gap symbol indicates
 493         unassigned bases or single-stranded positions.  This description
 494         roughly follows \cite{Konings89}.
 495         RNA pseudoknots are represented by alphabetic characters, with upper
 496         case letters representing the 5' side of the helix and lower case
 497         letters representing the 3' side. Note that this limits the
 498         annotation to a maximum of 26 pseudoknots per sequence.
 499
 500
 501 \item [\emprog{SA_cons}]
 502         Surface accessibility consensus. 0-9, gap symbols, or X are
 503         accepted as markup. 0 means <10\% accessible residue surface
 504         area, 1 means <20\%, 9 means <100\%, etc. X means unknown
 505         structure.
 506 \end{wideitem}
 507
 508 \subsection{Recognized #=GR annotations}
 509
 510 \begin{wideitem}
 511 \item [\emprog{SS}]
 512         Secondary structure consensus. See \prog{#=GC SS_cons} above.
 513 \item [\emprog{SA}]
 514         Surface accessibility consensus. See \prog{#=GC SA_cons} above.
 515 \end{wideitem}
 516
 517