2 * ReadSeq.Help -- 30 Dec 92
4 * Reads and writes nucleic/protein sequences in various
5 * formats. Data files may have multiple sequences.
7 * Copyright 1990 by d.g.gilbert
8 * biology dept., indiana university, bloomington, in 47405
9 * e-mail: gilbertd@bio.indiana.edu
11 * This program may be freely copied and used by anyone.
12 * Developers are encourged to incorporate parts in their
13 * programs, rather than devise their own private sequence
16 * This should compile and run with any ANSI C compiler.
17 * Please advise me of any bugs, additions or corrections.
19 Readseq is particularly useful as it automatically detects many
20 sequence formats, and interconverts among them.
22 Formats which readseq currently understands:
24 * IG/Stanford, used by Intelligenetics and others
25 * GenBank/GB, genbank flatfile format
27 * EMBL, EMBL flatfile format
28 * GCG, single sequence format of GCG software
29 * DNAStrider, for common Mac program
30 * Fitch format, limited use
31 * Pearson/Fasta, a common format used by Fasta programs and others
32 * Zuker format, limited use. Input only.
33 * Olsen, format printed by Olsen VMS sequence editor. Input only.
34 * Phylip3.2, sequential format for Phylip programs
35 * Phylip, interleaved format for Phylip programs (v3.3, v3.4)
36 * Plain/Raw, sequence data only (no name, document, numbering)
37 + MSF multi sequence format used by GCG software
38 + PAUP's multiple sequence (NEXUS) format
39 + PIR/CODATA format used by PIR
40 + ASN.1 format used by NCBI
41 + Pretty print with various options for nice looking output. Output only.
43 See the included "Formats" file for detail on file formats.
48 -- for interactive use
50 readseq my.1st.seq my.2nd.seq -all -format=genbank -output=my.gb
51 -- convert all of two input files to one genbank format output file
53 readseq my.seq -all -form=pretty -nameleft=3 -numleft -numright -numtop -match
54 -- output to standard output a file in a pretty format
56 readseq my.seq -item=9,8,3,2 -degap -CASE -rev -f=msf -out=my.rev
57 -- select 4 items from input, degap, reverse, and uppercase them
59 cat *.seq | readseq -pipe -all -format=asn > bunch-of.asn
60 -- pipe a bunch of data thru readseq, converting all to asn
63 The brief usage of readseq is as follows. The "[]" denote
64 optional parts of the syntax:
67 readSeq (27Dec92), multi-format molbio sequence reader.
68 usage: readseq [-options] in.seq > out.seq
70 -a[ll] select All sequences
71 -c[aselower] change to lower case
72 -C[ASEUPPER] change to UPPER CASE
73 -degap[=-] remove gap symbols
74 -i[tem=2,3,4] select Item number(s) from several
75 -l[ist] List sequences only
76 -o[utput=]out.seq redirect Output
77 -p[ipe] Pipe (command line, <stdin, >stdout)
78 -r[everse] change to Reverse-complement
79 -v[erbose] Verbose progress
80 -f[ormat=]# Format number for output, or
81 -f[ormat=]Name Format name for output:
82 1. IG/Stanford 10. Olsen (in-only)
83 2. GenBank/GB 11. Phylip3.2
89 8. Pearson/Fasta 17. PAUP
90 9. Zuker 18. Pretty (out-only)
92 Pretty format options:
93 -wid[th]=# sequence line width
95 -col[space]=# column space within sequence line on output
96 -gap[count] count gap chars in sequence numbers
97 -nameleft, -nameright[=#] name on left/right side [=max width]
98 -nametop name at top/bottom
99 -numleft, -numright seq index on left/right side
100 -numtop, -numbot index on top/bottom
101 -match[=.] use match base for 2..n species
102 -inter[line=#] blank line(s) between sequence blocks
107 In use, readseq will respond to command line arguments, or to
108 interactive use. Command line arguments cannot be combined
109 but must each follow a switch character (-). In this release,
110 the command line options are now words, with an equals (=)
111 to separate parameter(s) fromt he command. You cannot put a
112 space between a command and its parameter, as is usual for
113 Unix programs (this is to preserve compatibility with VMS).
114 The command line syntax of the earlier versions is still
117 See the file Formats for details of the sequence formats which
118 are supported by readseq. The auto-detection feature of
119 readseq which distinguishes these formats looks for some of the
120 unique keywords and symbols that are found in each format. It
121 is not infallible at this, though it attempts to exclude unknown
122 formats. In general, if you feed to readseq a sequence file that
123 you know is one of these common formats, you are okay. If you feed
124 it data that might be oddball formats, or non-sequence data,
125 you might well get garbage results. Also, different developers
126 are always thinking up minor twists on these common formats
127 (like PAUP requiring a blank line between blocks of Phylip format,
128 or IG adding form feeds between sequences), which may cause hassles.
130 In general, output supports only minimal subsets of each format
131 needed for sequence data exchanges. Features, descriptions
132 and other format-unique information is discarded.
134 The pretty format requires additional options to generate a
135 nice output. Try the various pretty options to see what you like.
136 Pretty format is OUPUT only, readseq cannot read a Pretty format
139 Readseq is NOT optimized for LARGE files. It generally makes several
140 reads thru each input file (one per sequence output at present, future
141 version may optimize this). It should handle input and output files
142 and sequences of any size, but will slow down quite a bit for very large
143 (multi megabyte) sized files. It is NOT recommended for converting
144 databanks or large subsets there-of. It is primarily directed at the
145 small files that researchers use to maintain their personal data, which
146 they frequently need to interconvert for the various analysis programs
147 which so frequently require a special format.
149 Users of Olsen multi sequence editor (VMS). The Olsen format
150 here is produced with the print command:
152 Use Genbank output from readseq to produce a format that this
153 editor can read, and use the command
154 load/genbank some.file
155 Dan Davison has a VMS program that will convert to/from the
156 Olsen native binary data format. E-mail davison@uh.edu
158 Warning: Phylip format input is now supported (30Dec92), however the
159 auto-detection of Phylip format is very probabilistic and messy,
160 especially distinguishing sequential from interleaved versions. It
161 is not recommended that one use readseq to convert files from Phylip
162 format to others unless essential.
165 This program is available thru Internet gopher, as
167 gopher ftp.bio.indiana.edu
168 browse into the IUBio-Software+Data/molbio/readseq/ folder
169 select the readseq.shar document
171 Or thru anonymous FTP in this manner:
172 my_computer> ftp ftp.bio.indiana.edu (or IP address 129.79.224.25)
174 password: my_username@my_computer
175 ftp> cd molbio/readseq
176 ftp> get readseq.shar
179 readseq.shar is a Unix shell archive of the readseq files.
180 This file can be editted by any text editor to reconstitute the
181 original files, for those who do not have a Unix system or an
182 Unshar program. Read the top of this .shar file for further
185 There are also pre-compiled executables for the following computers:
186 Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax,
187 Macintosh. Use binary ftp to transfer these, except Macintosh. The
188 Mac version is just the command-line program in a window, not very
192 readseq.c ureadseq.c ureadasn.c ureadseq.h
196 Formats (description of sequence file formats)
197 add.gdemenu (GDE program users can add this to the .GDEmenu file)
198 Stdfiles -- test sequence files
199 Makefile -- Unix make file
200 Make.com -- VMS make file
201 *.std -- files for testing validity of readseq
204 Recent changes (see also readseq.c for all history of changes):
207 + added 32 bit CRC checksum as alternative to GCG 6.5bit checksum
209 = fixed Olsen format input to handle files w/ more sequences,
210 not to mess up when more than one seq has same identifier,
211 and to convert number masks to symbols.
212 = IG format fix to understand ^L
214 * revised command-line & interactive interface. Suggested form is now
215 readseq infile -format=genbank -output=outfile -item=1,3,4 ...
216 but remains compatible with prior commandlines:
217 readseq infile -f2 -ooutfile -i3 ...
218 + added GCG MSF multi sequence file format
219 + added PIR/CODATA format
220 + added NCBI ASN.1 sequence file format
221 + added Pretty, multi sequence pretty output (only)
222 + added PAUP multi seq format
224 + added Gary Williams (GWW, G.Williams@CRC.AC.UK) reverse-complement option.
225 + added support for reading Phylip formats (interleave & sequential)
226 * string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, NEEDSTRCASECMP
227 * changed 32bit checksum to default, -DSMALLCHECKSUM for GCG version