forester/archive/RIO/others/hmmer/squid/Docs/gsi-format.tex

   1 % Mon Dec  5 15:23:18 1994
   2
   3 \section{GSI format}
   4
   5 {\tt GSI} (``generic sequence index'') is a format for indexing
   6 sequence databases.  Database retrieval programs such as {\tt sfetch}
   7 can read GSI files when they are available to enable fast retrieval of
   8 a sequence from large databases.
   9
  10 GSI files are created from sequence databases by Perl scripts.
  11 Scripts are currently provided for indexing GenBank, SwissProt,
  12 GenPept, FASTA, and PIR formatted databases.
  13
  14 \subsection {GSI programmatic details}
  15
  16 A single GSI file indexes one or more files in a sequence database.
  17 It is a binary file consisting of a number of fixed-length records.
  18 There are three types of records: one header record, one file record
  19 for every file in the database, and one keyword record for every
  20 sequence retrieval key. (The retrieval key is usually the sequence
  21 name, but may also be a database accession number.)
  22
  23 Every GSI record is 38 bytes long and contains three fields: 32 bytes
  24 of text (31 bytes plus a trailing NUL byte), a 2 byte network short,
  25 and a 4 byte network long. (``Network short'' and ``network long''
  26 refer to portable integer variables of fixed size and byte order. See
  27 Perl manuals for a few more details.)
  28
  29 The first record is a header.  It contains a short identifying text
  30 string (``GSI''), then the number of files indexed ({\tt nfiles}), and
  31 the number of keywords indexed ({\tt nkeys}).
  32
  33 The next {\tt nfiles} records (records 1..{\tt nfiles}) map file
  34 numbers onto file names. The three fields are \verb+<filename> <file
  35 number> <file format>+. These records must be in numerical order
  36 according to their file numbers.  Because of the 31-character
  37 restriction on filename lengths, the sequence files will generally
  38 have to be in the same directory as the GSI index file. The file
  39 format number is defined in {\tt squid.h}:
  40
  41 \begin{tabular}{rl}
  42 0  & Unknown    \\
  43 1  & Intelligenetics\\
  44 2  & Genbank\\
  45 4  & EMBL\\
  46 5  & GCG single sequence\\
  47 6  & Strider        \\
  48 7  & FASTA\\
  49 8  & Zuker\\
  50 9  & Idraw\\
  51 12 & PIR\\
  52 13 & Raw\\
  53 14 & SQUID\\
  54 16 & GCG data library \\
  55 101& Stockholm alignment\\
  56 102& SELEX alignment\\
  57 103& GCG MSF alignment\\
  58 104& Clustal alignment\\
  59 105& A2M (aligned FASTA) alignment\\
  60 106& Phylip\\
  61 \end{tabular}
  62
  63 The remaining records ({\tt nfiles}+1..{\tt nfiles+nkeys}) are for
  64 mapping keys onto files and disk offsets. The three fields are
  65 \verb+<keyword> <file number> <disk offset>+. These records must be
  66 sorted in alphabetic order by their retrieval keys, because the
  67 function GSIGetOffset() locates a keyword in the index file by a
  68 binary search.
  69
  70 \subsection{Relevant functions}
  71 \begin{description}
  72 \item[GSIOpen()]
  73         Opens a GSI index file.
  74 \item[GSIGetRecord()]
  75         Gets three fields from the current record.
  76 \item[GSIGetOffset()]
  77       Looks up a keyword in a GSI index and returns a filename,
  78       file format, and disk offset in the file.
  79 \item[SeqfilePosition()]
  80       Repositions an open sequence file to a given disk offset.
  81 \item[GSIClose()]
  82       Closes an open GSI index file.
  83 \end{description}
  84
  85
  86
  87