sources/readseq/Readseq.help

   1
   2  * ReadSeq.Help -- 30 Dec 92
   3  *
   4  * Reads and writes nucleic/protein sequences in various
   5  * formats. Data files may have multiple sequences.
   6  *
   7  * Copyright 1990 by d.g.gilbert
   8  * biology dept., indiana university, bloomington, in 47405
   9  * e-mail: gilbertd@bio.indiana.edu
  10  *
  11  * This program may be freely copied and used by anyone.
  12  * Developers are encourged to incorporate parts in their
  13  * programs, rather than devise their own private sequence
  14  * format.
  15  *
  16  * This should compile and run with any ANSI C compiler.
  17  * Please advise me of any bugs, additions or corrections.
  18
  19 Readseq is particularly useful as it automatically detects many
  20 sequence formats, and interconverts among them.
  21
  22 Formats which readseq currently understands:
  23
  24   * IG/Stanford, used by Intelligenetics and others
  25   * GenBank/GB, genbank flatfile format
  26   * NBRF format
  27   * EMBL, EMBL flatfile format
  28   * GCG, single sequence format of GCG software
  29   * DNAStrider, for common Mac program
  30   * Fitch format, limited use
  31   * Pearson/Fasta, a common format used by Fasta programs and others
  32   * Zuker format, limited use. Input only.
  33   * Olsen, format printed by Olsen VMS sequence editor. Input only.
  34   * Phylip3.2, sequential format for Phylip programs
  35   * Phylip, interleaved format for Phylip programs (v3.3, v3.4)
  36   * Plain/Raw, sequence data only (no name, document, numbering)
  37   + MSF multi sequence format used by GCG software
  38   + PAUP's multiple sequence (NEXUS) format
  39   + PIR/CODATA format used by PIR
  40   + ASN.1 format used by NCBI
  41   + Pretty print with various options for nice looking output. Output only.
  42
  43 See the included "Formats" file for detail on file formats.
  44
  45
  46 Example usage:
  47   readseq
  48       -- for interactive use
  49
  50   readseq my.1st.seq  my.2nd.seq  -all  -format=genbank  -output=my.gb
  51       -- convert all of two input files to one genbank format output file
  52
  53   readseq my.seq -all -form=pretty -nameleft=3 -numleft -numright -numtop -match
  54       -- output to standard output a file in a pretty format
  55
  56   readseq my.seq -item=9,8,3,2 -degap -CASE -rev -f=msf -out=my.rev
  57       -- select 4 items from input, degap, reverse, and uppercase them
  58
  59   cat *.seq | readseq -pipe -all -format=asn > bunch-of.asn
  60       -- pipe a bunch of data thru readseq, converting all to asn
  61
  62
  63 The brief usage of readseq is as follows. The "[]" denote
  64 optional parts of the syntax:
  65
  66 readseq -help
  67 readSeq (27Dec92), multi-format molbio sequence reader.
  68 usage: readseq [-options] in.seq > out.seq
  69  options
  70     -a[ll]         select All sequences
  71     -c[aselower]   change to lower case
  72     -C[ASEUPPER]   change to UPPER CASE
  73     -degap[=-]     remove gap symbols
  74     -i[tem=2,3,4]  select Item number(s) from several
  75     -l[ist]        List sequences only
  76     -o[utput=]out.seq  redirect Output
  77     -p[ipe]        Pipe (command line, <stdin, >stdout)
  78     -r[everse]     change to Reverse-complement
  79     -v[erbose]     Verbose progress
  80     -f[ormat=]#    Format number for output,  or
  81     -f[ormat=]Name Format name for output:
  82          1. IG/Stanford           10. Olsen (in-only)
  83          2. GenBank/GB            11. Phylip3.2
  84          3. NBRF                  12. Phylip
  85          4. EMBL                  13. Plain/Raw
  86          5. GCG                   14. PIR/CODATA
  87          6. DNAStrider            15. MSF
  88          7. Fitch                 16. ASN.1
  89          8. Pearson/Fasta         17. PAUP
  90          9. Zuker                 18. Pretty (out-only)
  91
  92    Pretty format options:
  93     -wid[th]=#            sequence line width
  94     -tab=#                left indent
  95     -col[space]=#         column space within sequence line on output
  96     -gap[count]           count gap chars in sequence numbers
  97     -nameleft, -nameright[=#]   name on left/right side [=max width]
  98     -nametop              name at top/bottom
  99     -numleft, -numright   seq index on left/right side
 100     -numtop, -numbot      index on top/bottom
 101     -match[=.]            use match base for 2..n species
 102     -inter[line=#]        blank line(s) between sequence blocks
 103
 104
 105 Notes:
 106
 107 In use, readseq will respond to command line arguments, or to
 108 interactive use.  Command line arguments cannot be combined
 109 but must each follow a switch character (-).  In this release,
 110 the command line options are now words, with an equals (=)
 111 to separate parameter(s) fromt he command.  You cannot put a
 112 space between a command and its parameter, as is usual for
 113 Unix programs (this is to preserve compatibility with VMS).
 114 The command line syntax of the earlier versions is still
 115 supported.
 116
 117 See the file Formats for details of the sequence formats which
 118 are supported by readseq.  The auto-detection feature of
 119 readseq which distinguishes these formats looks for some of the
 120 unique keywords and symbols that are found in each format. It
 121 is not infallible at this, though it attempts to exclude unknown
 122 formats.  In general, if you feed to readseq a sequence file that
 123 you know is one of these common formats, you are okay.  If you feed
 124 it data that might be oddball formats, or non-sequence data,
 125 you might well get garbage results.  Also, different developers
 126 are always thinking up minor twists on these common formats
 127 (like PAUP requiring a blank line between blocks of Phylip format,
 128 or IG adding form feeds between sequences), which may cause hassles.
 129
 130 In general, output supports only minimal subsets of each format
 131 needed for sequence data exchanges.  Features, descriptions
 132 and other format-unique information is discarded.
 133
 134 The pretty format requires additional options to generate a
 135 nice output.  Try the various pretty options to see what you like.
 136 Pretty format is OUPUT only, readseq cannot read a Pretty format
 137 file.
 138
 139 Readseq is NOT optimized for LARGE files.  It generally makes several
 140 reads thru each input file (one per sequence output at present, future
 141 version may optimize this).  It should handle input and output files
 142 and sequences of any size, but will slow down quite a bit for very large
 143 (multi megabyte) sized files. It is NOT recommended for converting
 144 databanks or large subsets there-of.  It is primarily directed at the
 145 small files that researchers use to maintain their personal data, which
 146 they frequently need to interconvert for the various analysis programs
 147 which so frequently require a special format.
 148
 149 Users of Olsen multi sequence editor (VMS).  The Olsen format
 150 here is produced with the print command:
 151   print/out=some.file
 152 Use Genbank output from readseq to produce a format that this
 153 editor can read, and use the command
 154   load/genbank some.file
 155 Dan Davison has a VMS program that will convert to/from the
 156 Olsen native binary data format.  E-mail davison@uh.edu
 157
 158 Warning: Phylip format input is now supported (30Dec92), however the
 159 auto-detection of Phylip format is very probabilistic and messy,
 160 especially distinguishing sequential from interleaved versions. It
 161 is not recommended that one use readseq to convert files from Phylip
 162 format to others unless essential.
 163
 164
 165 This program is available thru Internet gopher, as
 166
 167   gopher ftp.bio.indiana.edu
 168   browse into the IUBio-Software+Data/molbio/readseq/ folder
 169   select the readseq.shar document
 170
 171 Or thru anonymous FTP in this manner:
 172   my_computer> ftp  ftp.bio.indiana.edu  (or IP address 129.79.224.25)
 173     username:  anonymous
 174     password:  my_username@my_computer
 175   ftp> cd molbio/readseq
 176   ftp> get readseq.shar
 177   ftp> bye
 178
 179 readseq.shar is a Unix shell archive of the readseq files.
 180 This file can be editted by any text editor to reconstitute the
 181 original files, for those who do not have a Unix system or an
 182 Unshar program.  Read the top of this .shar file for further
 183 instructions.
 184
 185 There are also pre-compiled executables for the following computers:
 186 Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax,
 187 Macintosh. Use binary ftp to transfer these, except Macintosh.  The
 188 Mac version is just the command-line program in a window, not very
 189 handy.
 190
 191 C source files:
 192   readseq.c ureadseq.c ureadasn.c ureadseq.h
 193
 194 Document files:
 195   Readme (this doc)
 196   Formats (description of sequence file formats)
 197   add.gdemenu (GDE program users can add this to the .GDEmenu file)
 198   Stdfiles -- test sequence files
 199   Makefile -- Unix make file
 200   Make.com -- VMS make file
 201   *.std    -- files for testing validity of readseq
 202
 203
 204 Recent changes (see also readseq.c for all history of changes):
 205
 206 4 May 92
 207 + added 32 bit CRC checksum as alternative to GCG 6.5bit checksum
 208 Aug 92
 209 = fixed Olsen format input to handle files w/ more sequences,
 210   not to mess up when more than one seq has same identifier,
 211   and to convert number masks to symbols.
 212 = IG format fix to understand ^L
 213 30 Dec 92
 214 * revised command-line & interactive interface.  Suggested form is now
 215     readseq infile -format=genbank -output=outfile -item=1,3,4 ...
 216   but remains compatible with prior commandlines:
 217     readseq infile -f2 -ooutfile -i3 ...
 218 + added GCG MSF multi sequence file format
 219 + added PIR/CODATA format
 220 + added NCBI ASN.1 sequence file format
 221 + added Pretty, multi sequence pretty output (only)
 222 + added PAUP multi seq format
 223 + added degap option
 224 + added Gary Williams (GWW, G.Williams@CRC.AC.UK) reverse-complement option.
 225 + added support for reading Phylip formats (interleave & sequential)
 226 * string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, NEEDSTRCASECMP
 227 * changed 32bit checksum to default, -DSMALLCHECKSUM for GCG version
 228
 229