1 ||||||||||| ReadSeq supported formats (revised 30Dec92)
2 --------------------------------------------------------
4 -f[ormat=]Name Format name for output:
5 1. IG/Stanford 10. Olsen (in-only)
6 2. GenBank/GB 11. Phylip3.2
12 8. Pearson/Fasta 17. PAUP
13 9. Zuker (in-only) 18. Pretty (out-only)
15 In general, output supports only minimal subsets of each format
16 needed for sequence data exchanges. Features, descriptions
17 and other format-unique information is discarded.
19 Users of Olsen multi sequence editor (VMS). The Olsen format
20 here is produced with the print command:
22 Use Genbank output from readseq to produce a format that this
23 editor can read, and use the command
24 load/genbank some.file
25 Dan Davison has a VMS program that will convert to/from the
26 Olsen native binary data format. E-mail davison@uh.edu
28 Warning: Phylip format input is now supported (30Dec92), however the
29 auto-detection of Phylip format is very probabilistic and messy,
30 especially distinguishing sequential from interleaved versions. It
31 is not recommended that one use readseq to convert files from Phylip
32 format to others unless essential.
36 ||||||||||| ReadSeq usage (revised 11Nov91)
37 --------------------------------------------------------
39 A. determine file format:
41 short skiplines; /* result: number of header lines to skip (or 0) */
42 short error; /* error result or 0 */
43 short format; /* resulting format code, see ureadseq.h */
44 char *filename = "Mysequence.file"
46 format = seqFileFormat( filename, &skiplines, &error);
49 B. read number and list of sequences (optional)
50 short numseqs; /* resulting number of sequences found in file */
51 char *seqlist; /* list of sequence names, newline separated, 0 terminated */
53 seqlist = listSeqs( filename, skiplines, format, &numseqs, &error);
54 if (error!=0) display (seqlist);
57 C. read individual sequences as desired
58 short seqIndex; /* sequence index #, or == kListSeqs for listSeqs equivalent */
59 long seqlen; /* length of seq */
60 char seqid[256]; /* sequence name */
61 char *seq; /* sequence, 0 terminated, free when done */
63 seq = readSeq( seqIndex, filename, skiplines, format,
64 &seqlen, &numseqs, &error, seqid);
65 if (error!=0) manipulate(seq);
68 D. write sequences as desired
69 int nlines; /* number of lines of sequence written */
70 FILE* fout; /* open file pointer (stdout or other) */
71 short outform; /* output format, see ureadseq.h */
73 nlines = writeSeq( fout, seq, seqlen, format, outform, seqid);
76 Note (30Dec92): There is various processing done by the main program (in readseq.c),
77 rather than just in the subroutines (in ureadseq.c). Especially for interleaved
78 output formats, the writeSeq subroutine does not handle interleaving, nor some of
79 the formatting at the top and end of output files. While seqFileFormat, listSeqs,
80 and readSeq subroutines are fairly self-contained, the writeSeq depends a lot on
81 auxilliary processing. At some point, this may be revised so writeSeq is self-
84 Note 2: The NCBI toolkit (ftp from ncbi.nlm.nih.gov) is needed for the ASN.1 format
85 reading (see ureadasn.c). A bastard (but workable I hope) ASN.1 format is written
90 ||||||||||| sequence formats....
91 ---------------------------------------------------
98 efgh1 (or 2 = terminator)
104 ; Dro5s-T.Seq Length: 120 April 6, 1989 21:22 Check: 9487 ..
106 GCCAACGACCAUACCACGCUGAAUACAUCGGUUCUCGUCCGAUCACCGAAAUUAAGCAGCGUCGCGGGCG
107 GUUAGUACUUAGAUGGGGGACCGCUUGGGAACACCGCGUGUUGUUGGCCU1
109 ; TOIG of: Dro5srna.Seq check: 9487 from: 1 to: 120
110 ---------------------------------------------------
116 123456789abcdefg....(1st 9 columns are formatting)
124 ---------------------------------------------------
126 NBRF format: (from uwgcg ToNBRF)
128 Iubio$Dua0:[Gilbertd.Gcg]Dro5srna.Seq;2 => DRO5SRNA
130 51 AAUUAAGCAG CGUCGCGGGC GGUUAGUACU UAGAUGGGGG ACCGCUUGGG
131 101 AACACCGCGU GUUGUUGGCC U
133 ---------------------------------------------------
136 ID345 seq1 id (the 345 are spaces)
138 SQ345Sequence (the 3,4,5 are spaces)
141 // (! this is proper end string: 12Oct90)
148 ---------------------------------------------------
151 comments of any form, up to ".." signal
152 signal line has seq id, and " Check: #### .."
155 -- e.g. --- (GCG from GenBank)
156 LOCUS DROEST6 1819 bp ss-mRNA INV 31-AUG-1987
158 ORIGIN 1 bp upstream of EcoRI site; chromosome BK9 region 69A1.
160 INVERTEBRATE:DROEST6 Length: 1819 January 9, 1989 16:48 Check: 8008 ..
162 1 GAATTCGCCG GAGTGAGGAG CAACATGAAC TACGTGGGAC TGGGACTTAT
164 51 CATTGTGCTG AGCTGCCTTT GGCTCGGTTC GAACGCGAGT GATACAGATG
167 ---------------------------------------------------
169 DNAStrider (Mac) = modified Stanford:
170 ; ### from DNA Strider Friday, April 7, 1989 11:04:24 PM
171 ; DNA sequence pBR322 4363 b.p. complete sequence
176 ---------------------------------------------------
180 GCC AAC GAC CAU ACC ACG CUG AAU ACA UCG GUU CUC GUC CGA UCA CCG AAA UUA AGC AGC
181 GUC GCG GGC GGU UAG UAC UUA GAU GGG GGA CCG CUU GGG AAC ACC GCG UGU UGU UGG CCU
183 GAA TTC GCC GGA GTG AGG AGC AAC ATG AAC TAC GTG GGA CTG GGA CTT ATC ATT GTG CTG
184 AGC TGC CTT TGG CTC GGT TCG AAC GCG AGT GAT ACA GAT GAC CCT CTG TTG GTG CAG CTG
185 ---------------------------------------------------
187 W.Pearson/Fasta format:
188 >BOVPRL GenBank entry BOVPRL from omam file. 907 nucleotides.
189 TGCTTGGCTGAGGAGCCATAGGACGAGAGCTTCCTGGTGAAGTGTGTTTCTTGAAATCAT
191 ---------------------------------------------------
192 Phylip version 3.2 format (e.g., DNAML):
194 5 13 YF (# seqs, #bases, YF)
196 aaaagggccc... (continued sp. alpha)
198 aaaagggccc... (continued sp. beta)
200 aaaagggccc... (continued sp. Gamma)
201 1234567890^-- bases must start in col 11, and run 'til #bases
202 (spaces & newlines are okay)
203 ---------------------------------------------------
204 Phylip version 3.3 format (e.g., DNAML):
206 5 42 YF (# seqs, #bases, YF)
207 Turkey AAGCTNGGGC ATTTCAGGGT
208 Salmo gairAAGCCTTGGC AGTGCAGGGT
209 H. SapiensACCGGTTGGC CGTTCAGGGT
210 Chimp AAACCCTTGC CGTTACGCTT
211 Gorilla AAACCCTTGC CGGTACGCTT
212 1234567890^-- bases must start in col 11
213 !! this version interleaves the species -- contrary to
214 all other output formats.
216 GAGCCCGGGC AATACAGGGT AT
217 GAGCCGTGGC CGGGCACGGT AT
218 ACAGGTTGGC CGTTCAGGGT AA
219 AAACCGAGGC CGGGACACTC AT
220 AAACCATTGC CGGTACGCTT AA
222 ---------------------------------------------------
223 Phylip version 3.4 format (e.g., DNAML)
224 -- Both Interleaved and sequential are permitted
226 5 13 (# seqs, #bases)
228 aaaagggccc... (continued sp. alpha)
230 aaaagggccc... (continued sp. beta)
232 aaaagggccc... (continued sp. Gamma)
233 1234567890^-- bases must start in col 11, and run 'til #bases
234 (spaces, newlines and numbers are are ignored)
236 ---------------------------------------------------
237 Gary Olsen (multiple) sequence editor /print format:
239 !---------------------
240 !17Oct91 -- error in original copy of olsen /print format, shifted right 1 space
241 ! here is correct copy:
242 301 40 Tb.thiop CGCAGCGAAA----------GCUNUGCUAAUACCGCAUA-CGnCCUG----------------------------------------------------- Tb.thiop
243 123456789012345678901
244 301 42 Rhc.purp CGUAGCGAAA----------GUUACGCUAAUACCGCAUA-UUCUGUG----------------------------------------------------- Rhc.purp
246 301 44 Rhc.gela nnngnCGAAA----------GCCGGAUUAAUACCGCAUA-CGACCUA----------------------------------------------------- Rhc.gela
247 !---------------------
249 RNase P RNA components. on 20-FEB-90 17:23:58
251 1 (E.c. pr ): Base pairing in Escherichia coli RNase P RNA.
252 2 (chrom ): Chromatium
254 12 (B.brevis): Bacillus brevis RNase P RNA, B. James.
255 13 ( 90% con): 90% conserved
256 14 (100% con): 100% conserved
257 15 (gram+ pr): pairing
260 RNase P RNA components. on 20-FEB-90 17:23:58
263 tion: identity: Data:
265 1 1 E.c. pr <<<<<<<<<< {{{{{{{{<<:<<<<<<<<<<^<<<<<<====>>>> E.c. pr
266 1 2 chrom GGAGUCGGCCAGACAGUCGCUUCCGUCCU------------------ chrom
268 1 12 B.brevis AUGCAGGAAAUGCGGGUAGCCGCUGCCGCAAUCGUCU------------- B.brevis
269 1234567890123456789012 <! this should be 21 not 22,
270 ! this example must be inset on left by 1 space from olsen /print files !
271 1 13 90% con G C G A CGC GC - - 90% con
272 1 14 100% con G A CGC 100% con
273 1 15 gram+ pr <<<<<<<<<< {{{{{{{{<<<<<<<<<<<<<=============== gram+ pr
275 60 1 E.c. pr >>>>>>^>>^>>>>:>> <<<^<<<< {{{{{ E.c. pr
276 60 2 chrom -----GGUG-ACGGGGGAGGAAAGUCCGG-GCUCCAU------------- chrom
278 60 10 B.stearo ----UU-CG-GCCGUAGAGGAAAGUCCAUGCUCGCACGGUGCUGAGAUGC B.stearo
281 ---------------------------------------------------
285 picorna.msf MSF: 100 Type: P January 17, 1991 17:53 Check: 541
287 Name: Cb3 Len: 100 Check: 7009 Weight: 1.00
288 Name: E Len: 100 Check: 60 Weight: 1.00
293 Cb3 ...gpvedai .......t.. aaigr..vad tvgtgptnse aipaltaaet
294 E gvenae.kgv tentna.tad fvaqpvylpe .nqt...... kv.affynrs
298 Cb3 ghtsqvvpgd tmqtrhvkny hsrsestien flcrsacvyf teykn.....
299 E ...spi.gaf tvks...... gs.lesgfap .fsngtc.pn sviltpgpqf
301 ---------------------------------------------------
303 This is NBRF-PIR MAILSERVER version 1.45
304 Command-> get PIR3:A31391
306 ENTRY A31391 #Type Protein
307 TITLE *Esterase-6 - Fruit fly (Drosophila melanogaster)
309 DATE 03-Aug-1992 #Sequence 03-Aug-1992 #Text 03-Aug-1992
310 PLACEMENT 0.0 0.0 0.0 0.0 0.0
311 COMMENT *This entry is not verified.
312 SOURCE Drosophila melanogaster
315 #Authors Cooke P.H., Oakeshott J.G.
316 #Citation submitted to GenBank, April 1989
317 #Reference-number A31391
319 #Cross-reference GB:J04167
321 SUMMARY #Molecular-weight 61125 #Length 544 #Checksum 1679
324 1 M N Y V G L G L I I V L S C L W L G S N A S D T D D P L L V
325 31 Q L P Q G K L R G R D N G S Y Y S Y E S I P Y A E P P T G D
326 61 L R F E A P E P Y K Q K W S D I F D A T K T P V A C L Q W D
327 91 Q F T P G A N K L V G E E D C L T V S V Y K P K N S K R N S
328 121 F P V V A H I H G G A F M F G A A W Q N G H E N V M R E G K
329 151 F I L V K I S Y R L G P L G F V S T G D R D L P G N Y G L K
330 181 D Q R L A L K W I K Q N I A S F G G E P Q N V L L V G H S A
331 211 G G A S V H L Q M L R E D F G Q L A R A A F S F S G N A L D
332 241 P W V I Q K G A R G R A F E L G R N V G C E S A E D S T S L
333 271 K K C L K S K P A S E L V T A V R K F L I F S Y V P F A P F
334 301 S P V L E P S D A P D A I I T Q D P R D V I K S G K F G Q V
335 331 P W A V S Y V T E D G G Y N A A L L L K E R K S G I V I D D
336 361 L N E R W L E L A P Y L L F Y R D T K T K K D M D D Y S R K
337 391 I K Q E Y I G N Q R F D I E S Y S E L Q R L F T D I L F K N
338 421 S T Q E S L D L H R K Y G K S P A Y A Y V Y D N P A E K G I
339 451 A Q V L A N R T D Y D F G T V H G D D Y F L I F E N F V R D
340 481 V E M R P D E Q I I S R N F I N M L A D F A S S D N G S L K
341 511 Y G E C D F K D N V G S E K F Q L L A I Y I D G C Q N R Q H
345 ---------------------------------------------------
349 Every block starts with "BEGIN blockname;" and ends with "END;".
350 Each block is composed of one or more statements, each
351 terminated by a semicolon (;).
353 Comments may be included in NEXUS files by enclosing them within
354 square brackets, as in "[This is a comment]."
356 NEXUS-conforming files are identified by a "#NEXUS" directive at
357 the very beginning of the file (line 1, column 1). If the
358 #NEXUS is omitted PAUP issues a warning but continues
361 NEXUS files are entirely free-format. Blanks, tabs, and
362 newlines may be placed anywhere in the file. Unless RESPECTCASE
363 is requested, commands and data may be entered in upper case,
364 lower case, or a mixture of upper and lower case.
366 The following conventions are used in the syntax descriptions of
367 the various blocks. Upper-case items are entered exactly as
368 shown. Lower-case items inside of angle brackets -- e.g., <x>
369 -- represent items to be substituted by the user. Items inside
370 of square brackets -- e.g., [X] -- are optional. Items inside
371 of curly braces and separated by vertical bars -- e.g., { X | Y
372 | Z } -- are mutually exclusive options.
377 The DATA block contains the data matrix and other associated
378 information. Its syntax is:
381 DIMENSIONS NTAX=<number of taxa> NCHAR=<number of characters>;
382 [ FORMAT [ MISSING=<missing-symbol> ]
383 [ LABELPOS={ LEFT | RIGHT } ]
384 [ SYMBOLS="<symbols-list>" ]
386 [ MATCHCHAR=<match-symbol> ]
387 [ EQUATE="<symbol>=<expansion> [<symbol>=<expansion>...]" ]
390 [ DATATYPE = { STANDARD | DNA | RNA | PROTEIN } ]; ]
391 [ OPTIONS [ IGNORE={ INVAR | UNINFORM } ]
392 [ MSTAXA = { UNCERTAIN | POLYMORPH | VARIABLE } ]
393 [ ZAP = "<list of zapped characters>" ] ; ]
394 [ CHARLABELS <label_1> label_2>ÊÉ <label_NCHAR> ; ]
395 [ TAXLABELS <label1_1> <label1_2> <label1_NTAX> ; ]
396 [ STATELABELS <currently ignored by PAUP> ; ]
397 MATRIX <data-matrix> ;
400 --- example PAUP file
404 [!Brown et al. (1982) primate mitochondrial DNA]
407 dimensions ntax=5 nchar=896;
408 format datatype=dna matchchar=. interleave missing='-';
412 human aagcttcaccggcgcagtca ttctcataatcgcccacggR cttacatcctcattactatt ctgcctagcaaactcaaact acgaacgcactcacagtcgc
413 chimp ................a.t. .c.................a ...............t.... ..................t. .t........c.........
414 gorilla ..................tg ....t.....t........a ........a......t.... .................... .......a..c.....c...
415 orang ................ac.. cc.....g..t.....t..a ..c........cc....g.. .................... .......a..c.....c...
416 gibbon ......t..a..t...ac.g .c.................a ..a..c..t..cc.g..... ......t............. .......a........c...
421 human cttccccacaacaatattca tgtgcctagaccaagaagtt attatctcgaactgacactg agccacaacccaaacaaccc agctctccctaagctt
422 chimp t................... .a................c. ........a.....g..... ...a................ ................
423 gorilla ..................tc .a................c. ........a.g......... ...a.............tt. .a..............
424 orang ta....a...........t. .c.......ga......acc ..cg..a.a......tg... .a.a..c.....g...cta. .a.....a........
425 gibbon a..t.......t........ ....ac...........acc .....t..a........... .a.tg..........gctag .a..............
428 ---------------------------------------------------
435 ||||||||||| Sample SMTP mail header
436 ---------------------------------------------------
439 From GenBank-Retrieval-System@genbank.bio.net Sun Nov 10 17:28:56 1991
440 Received: from genbank.bio.net by sunflower.bio.indiana.edu
441 (4.1/9.5jsm) id AA19328; Sun, 10 Nov 91 17:28:55 EST
442 Received: by genbank.bio.net (5.65/IG-2.0)
443 id AA14458; Sun, 10 Nov 91 14:30:03 -0800
444 Date: Sun, 10 Nov 91 14:30:03 -0800
445 Message-Id: <9111102230.AA14458@genbank.bio.net>
446 From: Database Server <GenBank-Retrieval-System@genbank.bio.net>
447 To: gilbertd@sunflower.bio.indiana.edu
448 Subject: Results of Query for drorna
451 No matches on drorna.
453 From GenBank-Retrieval-System@genbank.bio.net Sun Nov 10 17:28:49 1991
454 Received: from genbank.bio.net by sunflower.bio.indiana.edu
455 (4.1/9.5jsm) id AA19323; Sun, 10 Nov 91 17:28:47 EST
456 Received: by genbank.bio.net (5.65/IG-2.0)
457 id AA14461; Sun, 10 Nov 91 14:30:03 -0800
458 Date: Sun, 10 Nov 91 14:30:03 -0800
459 Message-Id: <9111102230.AA14461@genbank.bio.net>
460 From: Database Server <GenBank-Retrieval-System@genbank.bio.net>
461 To: gilbertd@sunflower.bio.indiana.edu
462 Subject: Results of Query for droest6
465 LOCUS DROEST6 1819 bp ss-mRNA INV 31-AUG-1987
466 DEFINITION D.melanogaster esterase-6 mRNA, complete cds.
480 ||||||||||| GCG manual discussion of sequence symbols:
481 ---------------------------------------------------
486 GCG programs allow all upper and lower case letters, periods (.),
487 asterisks (*), pluses (+), ampersands (&), and ats (@) as symbols in
488 biological sequences. Nucleotide symbols, their complements, and the
489 standard one-letter amino acid symbols are shown below in separate lists.
490 The meanings of the symbols +, &, and @ have not been assigned at this
491 writing (March, 1989).
493 GCG uses the letter codes for amino acid codes and nucleotide
494 ambiguity proposed by IUB (Nomenclature Committee, 1985,
495 Eur. J. Biochem. 150; 1-5). These codes are compatible with the codes
496 used by the EMBL, GenBank, and NBRF data libraries.
501 The meaning of each symbol, its complement, and the Cambridge and
502 Stanford equivalents are shown below. Cambridge files can be converted
503 into GCG files and vice versa with the programs FROMSTADEN and TOSTADEN.
504 IntelliGenetics sequence files can be interconverted with the programs
507 IUB/GCG Meaning Complement Staden/Sanger Stanford
519 V A or C or G B not supported N
520 H A or C or T D not supported N
521 D A or G or T H not supported N
522 B C or G or T V not supported N
523 X/N G or A or T or C X -/X N
524 . not G or A or T or C . not supported ?
527 The frame ambiguity codes used by Staden are not supported by GCG
528 and are translated by FROMSTADEN as the lower case single base
531 Staden Code Meaning GCG
545 Here is a list of the standard one-letter amino acid codes and their
546 three-letter equivalents. The synonymous codons and their depiction in
547 the IUB codes are shown. You should recognize that the codons following
548 semicolons (;) are not sufficiently specific to define a single amino
549 acid even though they represent the best possible back translation into
550 the IUB codes! All of the relationships in this list can be redefined by
551 the user in a local data file described below.
554 Symbol 3-letter Meaning Codons Depiction
555 A Ala Alanine GCT,GCC,GCA,GCG !GCX
557 Asparagine GAT,GAC,AAT,AAC !RAY
558 C Cys Cysteine TGT,TGC !TGY
559 D Asp Aspartic GAT,GAC !GAY
560 E Glu Glutamic GAA,GAG !GAR
561 F Phe Phenylalanine TTT,TTC !TTY
562 G Gly Glycine GGT,GGC,GGA,GGG !GGX
563 H His Histidine CAT,CAC !CAY
564 I Ile Isoleucine ATT,ATC,ATA !ATH
565 K Lys Lysine AAA,AAG !AAR
566 L Leu Leucine TTG,TTA,CTT,CTC,CTA,CTG
568 M Met Methionine ATG !ATG
569 N Asn Asparagine AAT,AAC !AAY
570 P Pro Proline CCT,CCC,CCA,CCG !CCX
571 Q Gln Glutamine CAA,CAG !CAR
572 R Arg Arginine CGT,CGC,CGA,CGG,AGA,AGG
574 S Ser Serine TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX
575 T Thr Threonine ACT,ACC,ACA,ACG !ACX
576 V Val Valine GTT,GTC,GTA,GTG !GTX
577 W Trp Tryptophan TGG !TGG
579 Y Tyr Tyrosine TAT, TAC !TAY
581 Glutamine GAA,GAG,CAA,CAG !SAR
582 * End Terminator TAA, TAG, TGA !TAR,TRA;TRR
591 ||||||||||| docs from PSC on sequence formats:
592 ---------------------------------------------------
595 Nucleic Acid and Protein Sequence File Formats
598 It will probably save you some time if you have your data in a usable
599 format before you send it to us. However, we do have the University of
600 Wisconsin Genetics Computing Group programs running on our VAXen and
601 this package includes several reformatting utilities. Our programs
602 usually recognize any of several standard formats, including GenBank,
603 EMBL, NBRF, and MolGen/Stanford. For the purposes of annotating an
604 analysis we find the GenBank and EMBL formats most useful, particularly
605 if you have already received an accession number from one of these
606 organizations for your sequence.
608 Our programs do not require that all of the line types available in
609 GenBank, EMBL, or NBRF file formats be present for the file format to
610 be recognized and processed. The following pages outline the essential
611 details required for correct processing of files by our programs.
612 Additional information may be present but will generally be ignored.
619 1. The first line in the file must have "GENETIC SEQUENCE DATA BANK"
620 in spaces 20 through 46 (see LINE 1, below).
621 2. The next 8 lines may contain arbitrary text. They are ignored but
622 are required to maintain the GenBank format (see LINE 2 - LINE 9).
624 Sequence Data Entries
626 3. Each sequence entry in the file should have the following format.
627 a) first line: Must have LOCUS in the first 5 spaces. The
628 genetic locus name or identifier must be in spaces
629 13 - 22. The length of the sequences is right
630 justified in spaces 23 through 29 (see LINE 10).
631 b) second line: Must have DEFINITION in the first 10 spaces.
632 Spaces 13 - 80 are free form text to identify the
633 sequence (see LINE 11).
634 c) third line: Must have ACCESSION in the first 9 spaces. Spaces
635 13 - 18 must hold the primary accession number
637 d) fourth line: Must have ORIGIN in the first 6 spaces. Nothing
638 else is required on this line, it indicates that
639 the nucleic acid sequence begins on the next line
641 e) fifth line: Begins the nucleotide sequence. The first 9
642 spaces of each sequence line may either be blank
643 or may contain the position in the sequence of the
644 first nucleotide on the line. The next 66 spaces
645 hold the nucleotide sequence in six blocks of ten
646 nucleotides. Each of the six blocks begins with a
647 blank space followed by ten nucleotides. Thus the
648 first nucleotide is in space eleven of the line while
649 the last is in space 75 (see LINE 14, LINE 15).
650 f) last line: Must have // in the first 2 spaces to indicate
651 termination of the sequence (see LINE 16).
653 NOTE: Multiple sequences may appear in each file. To begin another
654 sequence go back to a) and start again.
660 LINE 1 : GENETIC SEQUENCE DATA BANK
669 LINE 10 :LOCUS L_Name Length BP
670 LINE 11 :DEFINITION Describe the sequence any way you want
671 LINE 12 :ACCESSION Accession Number
673 LINE 14 : 1 acgtacgtac gtacgtacgt acgtacgtac gtacgtacgt a...
681 Unlike the GenBank file format the EMBL file format does not require
682 a series of header lines. Thus the first line in the file begins
683 the first sequence entry of the file.
685 1. The first line of each sequence entry contains the two letters ID
686 in the first two spaces. This is followed by the EMBL identifier
687 in spaces 6 through 14. (See LINE 1).
689 2. The second line of each sequence entry has the two letters AC in
690 the first two spaces. This is followed by the accession number in
691 spaces 6 through 11. (See LINE 2).
693 3. The third line of each sequence entry has the two letters DE in the
694 first two spaces. This is followed by a free form text definition
695 in spaces 6 through 72. (See LINE 3).
697 4. The fourth line in each sequence entry has the two letters SQ in
698 the first two spaces. This is followed by the length of the
699 sequence beginning at or after space 13. After the sequence length
700 there is a blank space and the two letters BP. (See LINE 4).
702 5. The nucleotide sequence begins on the fifth line of the sequence
703 entry. Each line of sequence begins with four blank spaces. The
704 next 66 spaces hold the nucleotide sequence in six blocks of ten
705 nucleotides. Each of the six blocks begins with a blank space
706 followed by ten nucleotides. Thus the first nucleotide is in space
707 6 of the line while the last is in space 70. (See LINE 5 -
710 6. The last line of each sequence entry in the file is a terminator
711 line which has the two characters // in the first two spaces.
714 7. Multiple sequences may appear in each file. To begin another
715 sequence go back to item 1 and start again.
721 LINE 2 :AC Accession number
722 LINE 3 :DE Describe the sequence any way you want
724 LINE 5 : ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTA...
730 NBRF (protein or nucleic acid) File Format
732 1. The first line of each sequence entry begins with a greater than
733 symbol, >. This is immediately followed by the two character
734 sequence type specifier. Space four must contain a semi-colon.
735 Beginning in space five is the sequence name or identification code
736 for the NBRF database. The code is from four to six letters and
737 numbers. (See LINE 1).
739 !!!! >> add these to readseq
740 Specifier Sequence type
748 N1 functional RNA, other than tRNA
751 2. The second line of each sequence entry contains two kinds of
752 information. First is the sequence name which is separated from
753 the organism or organelle name by the three character sequence
754 blank space, dash, blank space, " - ". There is no special
755 character marking the beginning of this line. (See LINE 2).
757 3. Either the amino acid or nucleic acid sequence begins on line three
758 and can begin in any space, including the first. The sequence is
759 free format and may be interrupted by blanks for ease of reading.
760 Protein sequences man contain special punctuation to indicate
761 various indeterminacies in the sequence. In the NBRF data files
762 all lines may be up to 500 characters long. However some PSC
763 programs currently have a limit of 130 characters per line
764 (including blanks), and BitNet will not accept lines of over eighty
765 characters. (See LINE 3, LINE 4, and LINE 5).
767 The last character in the sequence must be an asterisks, *.
772 LINE 2 :Cytochrome b - Rat mitochondrion (SGC1)
773 LINE 3 :M T N I R K S H P L F K I I N H S F I D L P A P S
774 LINE 4 : VTHICRDVN Y GWL IRY
775 LINE 5 :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*
779 MolGen/Stanford File Format
781 1. The first line in a sequence file is a comment line. This line
782 begins with a semi-colon in the first space. This line need
783 not be present. If it is present it holds descriptive text.
784 There may be as many comment lines as desired at the first of
785 sequence file. (See LINE 1).
787 2. The second line must be present and contains an identifier or
788 name for the sequence in the first ten spaces. (See LINE 2).
790 3. The sequence begins on the third line and occupies up to eighty
791 spaces. Spaces may be included in the sequence for ease of
792 reading. The sequence continues for as many line as needed
793 and is terminated with a 1 or 2. 1 indicates a linear sequence
794 while 2 marks a circular sequence. (See LINE 3 and LINE 4).
796 Example MolGen/Stanford file
798 LINE 1 :; Describe the sequence any way you want
800 LINE 3 :ACGCACGTAC ACGTACGTAC A C G T C C G T ACG TAC GTA CGT
801 LINE 4 : GCTTA GG G C T A1
806 ||||||||||| Phylip file format
807 ---------------------------------------------------
809 Phylip 3.3 File Format (DNA sequences)
812 The input and output formats for PROTPARS and for RESTML are described in
813 their document files. In general their input formats are similar to those
814 described here, except that the one-letter codes for data are specific to those
815 programs and are described in those document files. Since the input formats
816 for the eight DNA sequence programs apply to all eight, they are described
817 here. Their input formats are standard: the data have A's, G's, C's and T's
818 (or U's). The first line of the input file contains the number of species and
819 the number of sites. As with the other programs, options information may
820 follow this. In the case of DNAML, DNAMLK, and DNADIST an additional line
821 (described in the document file for these pograms) may follow the first one.
822 Following this, each species starts on a new line. The first 10 characters of
823 that line are the species name. There then follows the base sequence of that
824 species, each character being one of the letters A, B, C, D, G, H, K, M, N, O,
825 R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is
826 no longer allowed, because it sometimes is used to in aligned sequences to mean
827 "the same as the sequence above"). Blanks will be ignored, and so will
828 numerical digits. This allows GENBANK and EMBL sequence entries to be read
829 with minimum editing.
831 These characters can be either upper or lower case. The algorithms
832 convert all input characters to upper case (which is how they are treated).
833 The characters constitute the IUPAC (IUB) nucleic acid code plus some slight
834 extensions. They enable input of nucleic acid sequences taking full account of
835 any ambiguities in the sequence.
837 The sequences can continue over multiple lines; when this is done the sequences
838 must be either in "interleaved" format, similar to the output of alignment
839 programs, or "sequential" format. These are described in the main document
840 file. In sequential format all of one sequence is given, possibly on multiple
841 lines, before the next starts. In interleaved format the first part of the
842 file should contain the first part of each of the sequences, then possibly a
843 line containing nothing but a carriage-return character, then the second part
844 of each sequence, and so on. Only the first parts of the sequences should be
845 preceded by names. Here is a hypothetical example of interleaved format:
848 Turkey AAGCTNGGGC ATTTCAGGGT
849 Salmo gairAAGCCTTGGC AGTGCAGGGT
850 H. SapiensACCGGTTGGC CGTTCAGGGT
851 Chimp AAACCCTTGC CGTTACGCTT
852 Gorilla AAACCCTTGC CGGTACGCTT
854 GAGCCCGGGC AATACAGGGT AT
855 GAGCCGTGGC CGGGCACGGT AT
856 ACAGGTTGGC CGTTCAGGGT AA
857 AAACCGAGGC CGGGACACTC AT
858 AAACCATTGC CGGTACGCTT AA
860 while in sequential format the same sequences would be:
863 Turkey AAGCTNGGGC ATTTCAGGGT
864 GAGCCCGGGC AATACAGGGT AT
865 Salmo gairAAGCCTTGGC AGTGCAGGGT
866 GAGCCGTGGC CGGGCACGGT AT
867 H. SapiensACCGGTTGGC CGTTCAGGGT
868 ACAGGTTGGC CGTTCAGGGT AA
869 Chimp AAACCCTTGC CGTTACGCTT
870 AAACCGAGGC CGGGACACTC AT
871 Gorilla AAACCCTTGC CGGTACGCTT
872 AAACCATTGC CGGTACGCTT AA
875 Note, of course, that a portion of a sequence like this:
877 300 AAGCGTGAAC GTTGTACTAA TRCAG
879 is perfectly legal, assuming that the species name has gone before, and is
880 filled out to full length by blanks. The above digits and blanks will be
881 ignored, the sequence being taken as starting at the first base symbol (in this
884 The present versions of the programs may sometimes have difficulties with
885 the blank lines between groups of lines, and if so you might want to retype
886 those lines, making sure that they have only a carriage-return and no blank
887 characters on them, or you may perhaps have to eliminate them. The symptoms of
888 this problem are that the programs complain that the sequences are not properly
889 aligned, and you can find no other cause for this complaint.
891 ------------------------------------------------
894 ||||||||||| ASN.1 file format
895 ---------------------------------------------------
898 ASN.1 -- see NCBI toolkit docs, source and examples (ncbi.nlm.nih.gov)
900 Example asn.1 sequence file----
905 id { local id 1 } , -- id essential
906 descr { title "Dummy sequence data from nowhere" } , -- optional
907 inst { -- inst essential
913 iupacna "GAATTCATTTTTGAAACAAATCGACCTGACGACGGAATGGTACTCGAATTA
914 TGGGCCAAAGGGTTTTATGGGACAAATTAATAGGTGTTCATTATATGCCACTTTCGGAGATTAGATACAGCAATGCAG
915 TGGATTCAAAGCAATAGAGTTGTTCTT"
920 descr { title "Dummy sequence 2 data from somewhere else" } ,
927 iupacna "TTTTTTTTTTTTGAAACAAATCGACCTGACGACGGAATGGTACTCGAATTA
928 TGGGCCAAAGGGTTTTATGGGACAAATTAATAGGTGTTCATTATATGCCACTTTCGGAGATTAGATACAGCAATGCAG
929 TGGATTCAAAGCAATAGAGTT"
936 partial ASN.1 description from toolkit
938 Bioseq ::= SEQUENCE {
939 id SET OF Seq-id , -- equivalent identifiers
940 descr Seq-descr OPTIONAL , -- descriptors
941 inst Seq-inst , -- the sequence data
942 annot SET OF Seq-annot OPTIONAL }
944 Seq-inst ::= SEQUENCE { -- the sequence data itself
945 repr ENUMERATED { -- representation class
946 not-set (0) , -- empty
947 virtual (1) , -- no seq data
948 raw (2) , -- continuous sequence
949 seg (3) , -- segmented sequence
950 const (4) , -- constructed sequence
951 ref (5) , -- reference to another sequence
952 consen (6) , -- consensus sequence or pattern
953 map (7) , -- ordered map (genetic, restriction)
955 mol ENUMERATED { -- molecule class in living organism
956 not-set (0) , -- > cdna = rna
960 na (4) , -- just a nucleic acid
962 length INTEGER OPTIONAL , -- length of sequence in residues
963 fuzz Int-fuzz OPTIONAL , -- length uncertainty
964 topology ENUMERATED { -- topology of molecule
968 tandem (3) , -- some part of tandem repeat
969 other (255) } DEFAULT linear ,
970 strand ENUMERATED { -- strandedness in living organism
972 ss (1) , -- single strand
973 ds (2) , -- double strand
975 other (255) } OPTIONAL , -- default ds for DNA, ss for RNA, pept
976 seq-data Seq-data OPTIONAL , -- the sequence
977 ext Seq-ext OPTIONAL , -- extensions for special types
978 hist Seq-hist OPTIONAL } -- sequence history
980 ------------------------------------------------