forester/archive/RIO/others/hmmer/squid/Docs/selex.tex

   1 \section{ SELEX alignment file format }
   2
   3 \subsection{ Example of a simple SELEX format file}
   4
   5 \begin{verbatim}
   6 # Example selex file
   7
   8 seq1     ACGACGACGACG.
   9 seq2     ..GGGAAAGG.GA
  10 seq3     UUU..AAAUUU.A
  11
  12 seq1  ..ACG
  13 seq2  AAGGG
  14 seq3  AA...UUU
  15 \end{verbatim}
  16
  17 SELEX is an interleaved multiple alignment format that evolved as an
  18 intuitive format.  SELEX files are easy to write and manipulate
  19 manually with a text editor.  It is usually easy to convert other
  20 alignment formats into SELEX format; the output of the CLUSTALV
  21 multiple alignment program and GCG's MSF format are similar
  22 interleaved formats. Because it evolved to accomodate different user
  23 input styles, it is very tolerant of various inconsistencies such as
  24 different gap symbols, varying line lengths, etc.
  25
  26 As the format evolved, more features have been added. To maintain
  27 compatibility with past alignment files, the new features are added
  28 using a reserved comment style. These extra features are usually
  29 maintained by automated SELEX-generating software, such as the {\tt
  30 koala} sequence alignment editor or my {\tt cove} and {\tt hmm} sequence
  31 analysis packages. This extra information includes consensus and
  32 individual RNA or protein secondary structure, per-sequence weights, a
  33 reference coordinate system for the columns, and database source
  34 information including name, accession number, and coordinates (for
  35 subsequences extracted from a longer source sequence).
  36
  37 \subsection {Specification of a SELEX file}
  38
  39 \begin{enumerate}
  40 \item
  41 Any line beginning with a \verb+#=+ as the first two characters is a
  42 machine ``comment''.  \verb+#=+ comments are reserved for additional
  43 data about the alignment. Usually these features are maintained by
  44 software such as the {\tt koala} editor, not by hand.
  45
  46 \item
  47 All other lines beginning with a \verb+%+ or \verb+#+ as the first
  48 character is a user comment.  User comments are ignored by all
  49 software. Any number of comments may be included.
  50
  51 \item
  52 Lines of data consist of a name followed by a sequence. The total
  53 length of the line must be smaller than 1024 characters.
  54
  55 \item
  56 Names must be a single word. Any non-whitespace characters are
  57 accepted.  No spaces are tolerated in names: names MUST be a
  58 single word.
  59
  60 \item
  61 In the sequence, any of the characters \verb+-_.+ or a space are
  62 recognized as gaps. Gaps are converted to a '.'. Any other characters
  63 are interpreted as sequence.  Sequence is case-sensitive. There is a
  64 common assumption by my software that upper-case symbols are used for
  65 consensus (match) positions and lower-case symbols are used for
  66 inserts. This language of ``match'' versus ``insert'' comes from the
  67 hidden Markov model formalism \cite{Krogh94}. To almost all of my
  68 software, this isn't important, and it immediately converts the
  69 sequence to all upper-case after it's read.
  70
  71 \item
  72 Multiple different sequences are grouped in a block of data lines.
  73 Blocks are separated by blank lines. No blank lines are tolerated
  74 between the sequence lines in a block. Each block in a multi-block
  75 file of a long alignment must have its sequences in the same order in
  76 each block. The names are checked to verify that this is the case; if
  77 not, only a warning is generated. (In manually constructed files, some
  78 users may wish to use shorthand names after the first block with full
  79 names, but this isn't recommended.)
  80 \end{enumerate}
  81
  82 \subsection {Special comments}
  83
  84 \subsubsection {Secondary structure}
  85
  86 I use one-letter codes to indicate secondary structures. Secondary
  87 structure strings are aligned to sequence blocks just like additional
  88 sequences.
  89
  90 For RNA secondary structure, the symbols \verb+>+ and \verb+<+ are
  91 used for base pairs (pairs point at each other).  \verb-+- indicate
  92 other single-stranded positions, {\tt .} indicates unassigned bases.
  93 This description follows \cite{Konings89}.  For protein secondary
  94 structure, I use {\tt E} to indicate residues in $\beta$-sheet, {\tt
  95 H} for those in $\alpha$-helix, {\tt L} for those in loops, and {\tt
  96 .} for unassigned residues.
  97
  98 RNA pseudoknots are represented by alphabetic characters, with upper
  99 case letters representing the 5' side of the helix and lower case
 100 letters representing the 3' side. Note that this restricts the
 101 annotation to a maximum of 26 pseudoknots per sequence.
 102
 103 Lines beginning with \verb+#=SS+ or \verb+#=CS+ are individual or
 104 consensus secondary structure data, respectively.  \verb+#=SS+
 105 individual secondary structure lines must immediately follow the
 106 sequence they are associated with.  There can only be one \verb+#=SS+
 107 per sequence. \verb+#=CS+ consensus secondary structure predictions
 108 precede all the sequences in each block. There can only be one
 109 \verb+#=CS+ per file.
 110
 111 \subsubsection {Reference coordinate system}
 112
 113 Alignments are usually numbered by some reference coordinate system,
 114 often a canonical molecule. For instance, tRNA positions are numbered
 115 by reference to the positions of yeast tRNA-Phe.
 116
 117 A line beginning with \verb+#=RF+ preceding the sequences in a block
 118 gives a reference coordinate system. Any non-gap symbol in the
 119 \verb+#=RF+ line indicates that sequence positions in its columns are
 120 numbered. For instance, the \verb+#=RF+ lines for a tRNA alignment
 121 would have 76 non-gap symbols for the canonical numbered columns; they
 122 might be the aligned tRNA-Phe sequence itself, or they might be just
 123 X's.
 124
 125 \subsubsection {Sequence header}
 126
 127 Additional per-sequence information can be placed in a header before
 128 any blocks appear. These lines, one per sequence and in exactly the
 129 same order as the sequences appear in the alignment, are formatted
 130 like \verb+#=SQ <seqname> <weight> <database source name> <database
 131 accession> <source coordinates as start..stop::original length>
 132 <description>+.
 133
 134 This information includes a sequence weight (for compensating for
 135 biased representation of subfamilies of sequences in the alignment);
 136 source information, if the sequence came from a database, consisting
 137 of identifier, accession number, and source coordinates; and a
 138 description of the sequence.
 139
 140 If a \verb+#=SQ+ line is present, all the fields must be present.  If
 141 no information is available for a field, use '-' for all the fields
 142 except the source coordinates, which would be given as '0'.
 143
 144 \subsubsection {Author}
 145
 146 The first non-comment, non-blank line of the file may be a \verb+#=AU+
 147 ``author'' line. There is a programmatic interface for
 148 alignment-generating programs to record a short comment like \verb+11
 149 November 1993, by Feng-Doolittle v. 2.1.1+, and this comment will be
 150 recorded on the \verb+#=AU+ line by \verb+WriteSELEX()+.
 151
 152
 153