2 DOCUMENTATION OF SEG (FROM 'MAN' PAGE)
3 --------------------------------------
8 seg - segment sequence(s) by local complexity
13 seg sequence [ W ] [ K(1) ] [ K(2) ] [ -x ] [ options ]
19 seg divides sequences into contrasting segments of low-complexity
20 and high-complexity. Low-complexity segments defined by the
21 algorithm represent "simple sequences" or "compositionally-biased
24 Locally-optimized low-complexity segments are produced at defined
25 levels of stringency, based on formal definitions of local
26 compositional complexity (Wootton & Federhen, 1993). The segment
27 lengths and the number of segments per sequence are determined
28 automatically by the algorithm.
30 The input is a FASTA-formatted sequence file, or a database file
31 containing many FASTA-formatted sequences. seg is tuned for amino
32 acid sequences. For nucleotide sequences, see EXAMPLES OF
35 The stringency of the search for low-complexity segments is
36 determined by three user-defined parameters, trigger window length
37 [ W ], trigger complexity [ K(1) ] and extension complexity [ K(2)]
38 (see below under PARAMETERS ). The defaults provided are suitable
39 for low-complexity masking of database search query sequences [ -x
40 option required, see below].
43 OUTPUTS AND APPLICATIONS
44 ------------------------
46 (1) Readable segmented sequence [Default]. Regions of contrasting
47 complexity are displayed in "tree format". See EXAMPLES.
49 (2) Low-complexity masking (see Altschul et al, 1994). Produce a
50 masked FASTA-formatted file, ready for input as a query sequence for
51 database search programs such as BLAST or FASTA. The amino acids in
52 low-complexity regions are replaced with "x" characters [-x option].
55 (3) Database construction. Produce FASTA-formatted files containing
56 low-complexity segments [-l option], or high-complexity segments
57 [-h option], or both [-a option]. Each segment is a separate
58 sequence entry with an informative header line.
64 The SEG algorithm has two stages. First, identification of
65 approximate raw segments of low- complexity; second local
68 At the first stage, the stringency and resolution of the search for
69 low-complexity segments is determined by the W, K(1) and K(2)
70 parameters. All trigger windows are defined, including overlapping
71 windows, of length W and complexity less than or equal to K(1).
72 "Complexity" here is defined by equation (3) of Wootton & Federhen
73 (1993). Each trigger window is then extended into a contig in both
74 directions by merging with extension windows, which are overlapping
75 windows of length W and complexity less than or equal to K(2).
76 Each contig is a raw segment.
78 At the second stage, each raw segment is reduced to a single
79 optimal low-complexity segment, which may be the entire raw
80 segment but is usually a subsequence. The optimal subsequence has
81 the lowest value of the probability P(0) (equation (5) of Wootton
87 These three numeric parameters are in obligatory order after the
90 Trigger window length [ W ]. An integer greater than zero [ Default
93 Trigger complexity. [ K1 ]. The maximum complexity of a trigger
94 window in units of bits. K1 must be equal to or greater than zero.
95 The maximum value is 4.322 (log[base 2]20) for amino acid
96 sequences [ Default 2.2 ].
98 Extension complexity [ K2 ]. The maximum complexity of an extension
99 window in units of bits. Only values greater than K1 are effective
100 in extending triggered windows. Range of possible values is as for
107 The following options may be placed in any order in the command
108 line after the W, K1 and K2 parameters:
110 -a Output both low-complexity and high-complexity segments in a
111 FASTA-formatted file, as a set of separate entries with header
114 -c [characters-per-line] Number of sequence characters per line of
115 output [Default 60]. Other characters, such as residue numbers,
118 -h Output only the high-complexity segments in a FASTA-formatted
119 file, as a set of separate entries with header lines.
121 -l Output only the low-complexity segments in a FASTA-formatted
122 file, as a set of separate entries with header lines.
124 -m [length] Minimum length in residues for a high-complexity
125 segment [default 0]. Shorter segments are merged with adjacent
126 low-complexity segments.
128 -o Show all overlapping, independently-triggered low-complexity
129 segments [these are merged by default].
131 -q Produce an output format with the sequence in a numbered block
132 with markings to assist residue counting. The low-complexity and
133 high-complexity segments are in lower- and upper-case characters
136 -t [length] "Maximum trim length" parameter [default 100]. This
137 controls the search space (and search time) during the
138 optimization of raw segments (see ALGORITHM above). By default,
139 subsequences 100 or more residues shorter than the raw segment are
140 omitted from the search. This parameter may be increased to give
141 a more extensive search if raw segments are longer than 100 residues.
143 -x The masking option for amino acid sequences. Each input
144 sequence is represented by a single output sequence in FASTA-format
145 with low-complexity regions replaced by strings of "x" characters.
148 EXAMPLES OF PARAMETER SETS
149 --------------------------
151 Default parameters are given by 'seg sequence' (equivalent to 'seg
152 sequence 12 2.2 2.5'). These parameters are appropriate for low-
153 complexity masking of many amino acid sequences [with -x option ].
155 Database-database comparisons:
156 -----------------------------
157 More stringent (lower) complexity parameters are suitable when
158 masked sequences are compared with masked sequences. For example,
159 for BLAST or FASTA searches that compare two amino acid sequence
160 databases, the following masking may be applied to both databases:
162 seg database 12 1.8 2.0 -x
164 Homopolymer analysis:
166 To examine all homopolymeric subsequences of length (for example)
171 Non-globular regions of protein sequences:
172 -----------------------------------------
173 Many long non-globular domains may be diagnosed at longer window
176 seg sequence 45 3.4 3.75
178 For some shorter non-globular domains, the following set is
181 seg sequence 25 3.0 3.3
183 Nucleotide sequences:
185 The maximum value of the complexity parameters is 2 (log[base 2]4).
186 For masking, the following is approximately equivalent in effect
187 to the default parameters for amino acid sequences:
189 seg sequence.na 21 1.4 1.6
192 The following is a file named 'prion' in FASTA format:
194 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
195 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP
196 HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA
197 VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
198 NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV
205 gives the standard output below
208 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
210 1-49 MANLGCWMLVLFVATWSDLGLCKKRPKPGG
212 ppqggggwgqphgggwgqphgggwgqphgg 50-94
214 95-112 THSQWNKPSKPKTNMKHM
215 agaaaagavvgglggymlgsams 113-135
216 136-187 RPIIHFGSDYEDRYYRENMHRYPNQVYYRP
217 MDEYSNQNNFVHDCVNITIKQH
218 tvttttkgenftet 188-201
219 202-236 DVKMMERVVEQMCITQYERESQAYYQRGSS
221 sppvillisflifliv 237-252
224 The low-complexity sequences are on the left (lower case) and
225 high-complexity sequences are on the right (upper case). All
226 sequence segments read from left to right and their order in the
227 sequence is from top to bottom, as shown by the central column of
234 gives the following FASTA-formatted file:-
236 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
237 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx
238 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx
239 xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
240 NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx
248 segn, blast, saps, xnu
254 John Wootton: wootton@ncbi.nlm.nih.gov
255 Scott Federhen: federhen@ncbi.nlm.nih.gov
257 National Center for Biotechnology Information
258 Building 38A, Room 8N805
259 National Library of Medicine
260 National Institutes of Health
261 Bethesda, Maryland, MD 20894
268 Wootton, J.C., Federhen, S. (1993) Statistics of local complexity
269 in amino acid sequences and sequence databases. Computers &
270 Chemistry 17: 149-163.
276 Wootton, J.C. (1994) Non-globular domains in protein sequences:
277 automated segmentation using complexity measures. Computers &
278 Chemistry 18: (in press).
280 Altschul, S.F., Boguski, M., Gish, W., Wootton, J.C. (1994) Issues
281 in searching molecular sequence databases. Nature Genetics 6:
284 Wootton, J.C. (1994) Simple sequences of protein and DNA. In:
285 Nucleic Acid and Protein Sequence Analysis: A Practical Approach.
286 (Second Edition, Chapter 8, Bishop, M.J. and Rawlings, C.R. Eds.
287 IRL Press, Oxford) (In press).