sources/seg/seg.doc

   1
   2 DOCUMENTATION OF SEG (FROM 'MAN' PAGE)
   3 --------------------------------------
   4
   5
   6 NAME
   7 ----
   8      seg - segment sequence(s) by local complexity
   9
  10
  11 SYNOPSIS
  12 --------
  13      seg sequence [ W ] [ K(1) ] [ K(2) ] [ -x ] [ options ]
  14
  15
  16 DESCRIPTION
  17 -----------
  18
  19 seg divides sequences into contrasting segments of low-complexity
  20 and high-complexity.  Low-complexity segments defined by the
  21 algorithm represent "simple sequences" or "compositionally-biased
  22 regions".
  23
  24 Locally-optimized low-complexity segments are produced at defined
  25 levels of stringency, based on formal definitions of local
  26 compositional complexity (Wootton & Federhen, 1993).  The segment
  27 lengths and the number of segments per sequence are determined
  28 automatically by the algorithm.
  29
  30 The input is a FASTA-formatted sequence file, or a database file
  31 containing many FASTA-formatted  sequences.  seg is tuned for amino
  32 acid sequences.  For nucleotide sequences, see EXAMPLES OF
  33 PARAMETER SETS below.
  34
  35 The stringency of the search for low-complexity segments is
  36 determined by three user-defined parameters, trigger window length
  37 [ W ], trigger complexity [ K(1) ] and extension complexity [ K(2)]
  38 (see below under PARAMETERS ).  The defaults provided are suitable
  39 for low-complexity masking of database search query sequences [ -x
  40 option required, see below].
  41
  42
  43 OUTPUTS AND APPLICATIONS
  44 ------------------------
  45
  46 (1) Readable segmented sequence [Default].  Regions of contrasting
  47 complexity are displayed in "tree format".  See EXAMPLES.
  48
  49 (2) Low-complexity masking (see Altschul et al, 1994).  Produce a
  50 masked FASTA-formatted file, ready for  input as a query sequence for
  51 database search programs such as BLAST or FASTA.  The amino acids in
  52 low-complexity regions are replaced with "x" characters [-x option].
  53 See EXAMPLES.
  54
  55 (3) Database construction.  Produce FASTA-formatted files containing
  56 low-complexity segments [-l  option], or high-complexity segments
  57 [-h option], or both [-a option].  Each segment is a separate
  58 sequence entry with an informative header line.
  59
  60
  61 ALGORITHM
  62 ---------
  63
  64 The SEG algorithm has two stages.  First, identification of
  65 approximate raw segments of low- complexity; second local
  66 optimization.
  67
  68 At the first stage, the stringency and resolution of the search for
  69 low-complexity segments is determined  by the W, K(1) and K(2)
  70 parameters.  All trigger windows are defined, including overlapping
  71 windows, of length W and complexity less than or equal to K(1).
  72 "Complexity" here is defined by equation  (3) of Wootton & Federhen
  73 (1993).  Each trigger window is then extended into a contig in both
  74 directions by merging with extension windows, which are overlapping
  75 windows of length W and complexity  less than or equal to K(2).
  76 Each contig is a raw segment.
  77
  78 At the second stage, each raw segment is reduced to a single
  79 optimal low-complexity segment, which  may be the entire raw
  80 segment but is usually a subsequence.  The optimal subsequence has
  81 the lowest  value of the probability P(0) (equation (5) of Wootton
  82 & Federhen, 1993).
  83
  84 PARAMETERS
  85 ----------
  86
  87 These three numeric parameters are in obligatory order after the
  88 sequence file name.
  89
  90 Trigger window length [ W ].  An integer greater than zero [ Default
  91 12 ].
  92
  93 Trigger complexity. [ K1 ].  The maximum complexity of a trigger
  94 window in units of bits. K1 must  be equal to or greater than zero.
  95 The maximum value is 4.322 (log[base 2]20) for amino acid
  96 sequences [ Default 2.2 ].
  97
  98 Extension complexity [ K2 ].  The maximum complexity of an extension
  99 window in units of bits.  Only values greater than K1 are effective
 100 in extending triggered windows.  Range of possible values is as for
 101 K1 [ Default 2.5 ].
 102
 103
 104 OPTIONS
 105 -------
 106
 107 The following options may be placed in any order in the command
 108 line after the W, K1 and K2 parameters:
 109
 110 -a  Output both low-complexity and high-complexity segments in a
 111     FASTA-formatted file, as a set of  separate entries with header
 112     lines.
 113
 114 -c  [characters-per-line] Number of sequence characters per line of
 115     output [Default 60].  Other characters, such as residue numbers,
 116     are additional.
 117
 118 -h  Output only the high-complexity segments in a FASTA-formatted
 119     file, as a set of separate entries  with header lines.
 120
 121 -l  Output only the low-complexity segments in a FASTA-formatted
 122     file, as a set of separate entries with  header lines.
 123
 124 -m  [length] Minimum length in residues for a high-complexity
 125     segment [default 0].  Shorter segments are merged with adjacent
 126     low-complexity segments.
 127
 128 -o  Show all overlapping, independently-triggered low-complexity
 129     segments [these are merged by default].
 130
 131 -q  Produce an output format with the sequence in a numbered block
 132     with markings to assist residue counting.  The low-complexity and
 133     high-complexity segments are in lower- and upper-case characters
 134     respectively.
 135
 136 -t  [length] "Maximum trim length" parameter [default 100]. This
 137     controls the search space (and  search time) during the
 138     optimization of raw segments (see ALGORITHM above).  By default,
 139     subsequences 100 or more residues shorter than the raw segment are
 140     omitted from the search. This parameter may be increased to give
 141     a more extensive search if raw segments are longer than 100 residues.
 142
 143 -x  The masking option for amino acid sequences.  Each input
 144     sequence is represented by a single output sequence in FASTA-format
 145     with low-complexity regions replaced by strings of "x" characters.
 146
 147
 148 EXAMPLES OF PARAMETER SETS
 149 --------------------------
 150
 151 Default parameters are given by 'seg sequence' (equivalent to 'seg
 152 sequence 12 2.2 2.5').  These  parameters are appropriate for low-
 153 complexity masking of many amino acid sequences [with -x option  ].
 154
 155 Database-database comparisons:
 156 -----------------------------
 157 More stringent (lower) complexity parameters are suitable when
 158 masked sequences are compared with masked sequences.  For example,
 159 for BLAST or FASTA searches that compare two amino acid sequence
 160 databases, the following masking may be applied to both databases:
 161
 162   seg database 12 1.8 2.0 -x
 163
 164 Homopolymer analysis:
 165 --------------------
 166 To examine all homopolymeric subsequences of length (for example)
 167 7 or greater:
 168
 169   seg sequence 7 0 0
 170
 171 Non-globular regions of protein sequences:
 172 -----------------------------------------
 173 Many long non-globular domains may be diagnosed at longer window
 174 lengths, typically:
 175
 176   seg sequence 45 3.4 3.75
 177
 178 For some shorter non-globular domains, the following set is
 179 appropriate:
 180
 181   seg sequence 25 3.0 3.3
 182
 183 Nucleotide sequences:
 184 --------------------
 185 The maximum value of the complexity parameters is 2 (log[base 2]4).
 186 For masking, the following is approximately equivalent in effect
 187 to the default parameters for amino acid sequences:
 188
 189   seg sequence.na 21 1.4 1.6
 190
 191 EXAMPLES
 192 The following is a file named 'prion' in FASTA format:
 193
 194 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
 195 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP
 196 HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA
 197 VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
 198 NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV
 199 ILLISFLIFLIVG
 200
 201 The command line:
 202
 203   seg prion
 204
 205 gives the standard output below
 206
 207
 208 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
 209
 210                                   1-49   MANLGCWMLVLFVATWSDLGLCKKRPKPGG
 211                                          WNTGGSRYPGQGSPGGNRY
 212 ppqggggwgqphgggwgqphgggwgqphgg   50-94
 213                gwgqphgggwgqggg
 214                                  95-112  THSQWNKPSKPKTNMKHM
 215        agaaaagavvgglggymlgsams  113-135
 216                                 136-187  RPIIHFGSDYEDRYYRENMHRYPNQVYYRP
 217                                          MDEYSNQNNFVHDCVNITIKQH
 218                 tvttttkgenftet  188-201
 219                                 202-236  DVKMMERVVEQMCITQYERESQAYYQRGSS
 220                                          MVLFS
 221               sppvillisflifliv  237-252
 222                                 253-253  G
 223
 224 The low-complexity sequences are on the left (lower case) and
 225 high-complexity sequences are on the right (upper case).  All
 226 sequence segments read from left to right and their order in the
 227 sequence is from top to bottom, as shown by the central column of
 228 residue numbers.
 229
 230 The command line:
 231
 232   seg prion -x
 233
 234 gives the following FASTA-formatted file:-
 235
 236 >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
 237 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx
 238 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx
 239 xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
 240 NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx
 241 xxxxxxxxxxxxG
 242
 243
 244
 245 SEE ALSO
 246 --------
 247
 248 segn, blast, saps, xnu
 249
 250
 251 AUTHORS
 252 -------
 253
 254 John Wootton:     wootton@ncbi.nlm.nih.gov
 255 Scott Federhen:   federhen@ncbi.nlm.nih.gov
 256
 257 National Center for Biotechnology Information
 258 Building 38A, Room 8N805
 259 National Library of Medicine
 260 National Institutes of Health
 261 Bethesda, Maryland, MD 20894
 262 U.S.A.
 263
 264
 265 PRIMARY REFERENCE
 266 -----------------
 267
 268 Wootton, J.C., Federhen, S. (1993)  Statistics of local complexity
 269 in amino acid sequences and sequence  databases.  Computers &
 270 Chemistry 17: 149-163.
 271
 272
 273 OTHER REFERENCES
 274 ----------------
 275
 276 Wootton, J.C. (1994)  Non-globular domains in protein sequences:
 277 automated segmentation using complexity measures.  Computers &
 278 Chemistry 18: (in press).
 279
 280 Altschul, S.F., Boguski, M., Gish, W., Wootton, J.C. (1994)  Issues
 281 in searching molecular sequence  databases.  Nature Genetics 6:
 282 119-129.
 283
 284 Wootton, J.C. (1994)  Simple sequences of protein and DNA. In:
 285 Nucleic Acid and Protein Sequence  Analysis: A Practical Approach.
 286 (Second Edition, Chapter 8, Bishop, M.J. and Rawlings, C.R. Eds.
 287 IRL  Press, Oxford) (In press).
 288
 289