1 .TH "sreformat" 1 "@RELEASEDATE@" "@PACKAGE@ @RELEASE@" "@PACKAGE@ Manual"
5 sreformat - convert sequence file to different format
16 reads the sequence file
18 in any supported format, reformats it
19 into a new format specified by
21 then prints the reformatted text.
24 Supported input formats include (but are not limited to) the unaligned
25 formats FASTA, Genbank, EMBL, SWISS-PROT, PIR, and GCG, and the
26 aligned formats Stockholm, Clustal, GCG MSF, and Phylip.
29 Available unaligned output file format codes
34 (EMBL/SWISSPROT format);
38 (GCG single sequence format);
40 (GCG flatfile database format);
46 (Intelligenetics format);
48 (PIR/CODATA flatfile format);
50 (an undocumented St. Louis format);
52 (raw sequence, no other information).
55 The available aligned output file format
58 (PFAM/Stockholm format);
62 (aligned FASTA format, called A2M by the UC Santa Cruz
65 (Felsenstein's PHYLIP format); and
67 (old SELEX/HMMER/Pfam annotated alignment format);
70 All thee codes are interpreted case-insensitively
71 (e.g. MSF, Msf, or msf all work).
74 Unaligned format files cannot be reformatted to
76 However, aligned formats can be reformatted
77 to unaligned formats -- gap characters are
81 This program was originally named
83 but that name clashes with a GCG program of the same name.
89 Enable alignment reformatting. By default, sreformat expects
90 that the input file should be handled as an unaligned input
91 file (even if it is an alignment), and it will not allow you
92 to convert an unaligned file to an alignment (for obvious
95 This may seem silly; surely if sreformat can autodetect and parse
96 alignment file formats as input, it can figure out when it's got an
97 alignment! There are two reasons. One is just the historical
98 structure of the code. The other is that FASTA unaligned format and
99 A2M aligned format (aligned FASTA) are impossible to tell apart with
104 DNA; convert U's to T's, to make sure a nucleic acid
105 sequence is shown as DNA not RNA. See
110 Print brief help; includes version number and summary of
111 all options, including expert options.
115 Lowercase; convert all sequence residues to lower case.
121 RNA; convert T's to U's, to make sure a nucleic acid
122 sequence is shown as RNA not DNA. See
127 Uppercase; convert all sequence residues to upper case.
133 For DNA sequences, convert non-IUPAC characters (such as X's) to N's.
134 This is for compatibility with benighted people who insist on using X
135 instead of the IUPAC ambiguity character N. (X is for ambiguity
136 in an amino acid residue).
138 Warning: the code doesn't
139 check that you are actually giving it DNA. It simply
140 literally just converts non-IUPAC DNA symbols to N. So
141 if you accidentally give it protein sequence, it will
142 happily convert most every amino acid residue to an N.
146 (Babelfish). Autodetect and read a sequence file format other than the
147 default (FASTA). Almost any common sequence file format is recognized
148 (including Genbank, EMBL, SWISS-PROT, PIR, and GCG unaligned sequence
149 formats, and Stockholm, GCG MSF, and Clustal alignment formats). See
150 the printed documentation for a complete list of supported formats.
156 .BI --informat " <s>"
157 Specify that the sequence file is in format
159 rather than the default FASTA format.
160 Common examples include Genbank, EMBL, GCG,
161 PIR, Stockholm, Clustal, MSF, or PHYLIP;
162 see the printed documentation for a complete list
163 of accepted format names.
164 This option overrides the default format (FASTA)
167 Babelfish autodetection option.
173 is an alignment, remove any columns that contain 100% gap
174 characters, minimizing the overall length of the alignment.
175 (Often useful if you've extracted a subset of aligned
176 sequences from a larger alignment.)
180 For SELEX alignment output format only, put the entire
181 alignment in one block (don't wrap into multiple blocks).
182 This is close to the format used internally by Pfam
183 in Stockholm and Cambridge.
187 Try to convert gap characters to UC Santa Cruz SAM style, where a .
188 means a gap in an insert column, and a - means a
189 deletion in a consensus/match column. This only
190 works for converting aligned file formats, and only
191 if the alignment already adheres to the SAM convention
192 of upper case for residues in consensus/match columns,
193 and lower case for residues in insert columns. This is
194 true, for instance, of all alignments produced by old
195 versions of HMMER. (HMMER2 produces alignments
196 that adhere to SAM's conventions even in gap character choice.)
197 This option was added to allow Pfam alignments to be
198 reformatted into something more suitable for profile HMM
199 construction using the UCSC SAM software.
203 Try to convert the alignment gap characters and
204 residue cases to UC Santa Cruz SAM style, where a .
205 means a gap in an insert column and a - means a
206 deletion in a consensus/match column, and
207 upper case means match/consensus residues and
208 lower case means inserted resiudes. This will only
209 work for converting aligned file formats, but unlike the
211 option, it will work regardless of whether the file adheres
212 to the upper/lower case residue convention. Instead, any
213 column containing more than a fraction
215 of gap characters is interpreted as an insert column,
216 and all other columns are interpreted as match columns.
217 This option was added to allow Pfam alignments to be
218 reformatted into something more suitable for profile HMM
219 construction using the UCSC SAM software.
228 @PACKAGE@ and its documentation is @COPYRIGHT@
229 HMMER - Biological sequence analysis with profile HMMs
230 Copyright (C) 1992-1999 Washington University School of Medicine
233 This source code is distributed under the terms of the
234 GNU General Public License. See the files COPYING and LICENSE
236 See COPYING in the source code distribution for more details, or contact me.
241 Washington Univ. School of Medicine
243 St Louis, MO 63110 USA
244 Phone: 1-314-362-7666
246 Email: eddy@genetics.wustl.edu