sources/oc/oc_manual.txt

   1
   2 OC - A Cluster Analysis Program
   3 -------------------------------
   4
   5 Usage notes - 2/April/1997
   6 --------------------------
   7
   8 Release 2.0 - 4 August 2002
   9 --------------------------
  10
  11 Release 2.1a (Version 2.1a) - 11 April 2005
  12 -----------------------------
  13
  14 Author: Geoffrey  J. Barton
  15
  16 email:  geoff@compbio.dundee.ac.uk
  17
  18
  19 Notes
  20 -----
  21
  22 If you use OC in your work, please remember to cite it.
  23
  24 If you modify it please send me a copy of your changes.  I will try to
  25 incorporate them into a common source.  However, if you plan any major
  26 changes, please contact me first.  It may be that I have already done
  27 what you need.
  28
  29
  30 Release notes
  31 -------------
  32
  33 Release 2.1a - Make version number consistent with Release number.
  34 Remove some regex code that fails on SunOS.
  35
  36 Release 2.1 (Version 1.3): The postscript output options in OC were
  37 never intended to be very pretty, but we needed to be able to print
  38 out large trees in postscript and so some small changes were made to
  39 help with this.  Firstly, ps output is now encapsulated postscript
  40 (eps) rather than simple ps.  Secondly, the minside and maxside
  41 command line options have been documented.  All postscript output is
  42 PORTRAIT and minside is the narrow side of the page and maxside the
  43 long side (in inches).  For example, define the (minside 11 maxside
  44 52), run OC then run the utility epssplit from
  45 http://home.clara.net/nox/software/epssplit.
  46
  47 Release 2.0 (Version 1.0): The main oc program is unchanged, but two
  48 new callable versions are in the distributions that start with 'noc'.
  49 Functions gjnoc.c and gjnoc2.c are callable from C and return data
  50 structures with the information on the clusters formed.  Of the two,
  51 gjnoc2 is probably most useful.  gjnoc was written to feed other
  52 programs we distribute (notably STAMP).
  53
  54 Introduction
  55 ------------
  56
  57 OC is a general purpose hierarchical cluster analysis program.  It was
  58 written to have NO software defined limits.  The maximum number of
  59 entities that you can cluster with OC is only limited by available
  60 memory (and time!!).  OC was written in 1993 to cluster large families
  61 of protein sequences.  Though developed for this purpose, there is
  62 nothing specific to proteins in its implementation, so it can be used
  63 to cluster any sort of data.
  64
  65 It has been used extensively in my group most notably in the 3Dee
  66 domains database (http://www.compbio.dundee.ac.uk) and also in the STAMP
  67 structure comparison package.  Over the years I have given the program
  68 to a few people but never got around to writing up any instructions or
  69 packaging it as a proper distribution.  These notes will hopefully
  70 make the program a little easier for others to use and understand.
  71
  72 OC requires a complete upper diagonal distance or similarity matrix as
  73 input and implements three simple clustering methods:
  74
  75 1. single linkage.
  76 2. complete linkage.
  77 3. means linkage - UPGMA.
  78
  79 These are explained further below.
  80
  81 Output is a list of the clusters as they form and optionally a
  82 dendrogram drawn in PostScript.  The dendrogram drawing functions are
  83 fairly primitive.  The program will also, optionally, output all
  84 clusters above some threshold.  See further details below.
  85
  86 Input file
  87 ----------
  88
  89 First line: An integer showing the number (N) of entities you are
  90 clustering.  Lines 2 to N+1: Labels for the entities.  These should be
  91 strings, without spaces.  Following N*(N-1)/2 lines: The upper
  92 diagonal matrix (free format).
  93
  94 For example, see file test.dis.
  95
  96 7
  97 First
  98 Second
  99 Third
 100 Fourth
 101 Fifth
 102 Sixth
 103 Seventh
 104 100.0
 105 100.0
 106 50.0
 107 33.0
 108 25.0
 109 20.0
 110 100.0
 111 50.0
 112 50.0
 113 33.0
 114 33.0
 115 100.0
 116 33.0
 117 20.0
 118 25.0
 119 33.0
 120 20.0
 121 25.0
 122 100.0
 123 100.0
 124 100.0
 125
 126 By upper diagonal, I mean that the numbers are the similarities (or distances)
 127 from comparison of entity 1 v 2, 1 v 3, 1 v 4, 1 v 5, 1 v 6, 1 v 7,
 128                                  2 v 3, 2 v 4, 2 v 5, 2 v 6, 2 v 7,
 129                                         3 v 4, 3 v 5, 3 v 6, 3 v 7,
 130                                                4 v 5, 4 v 6, 4 v 7,
 131                                                       5 v 6, 5 v 7,
 132                                                              6 v 7.
 133 In that order.
 134
 135 Running the program
 136 -------------------
 137
 138 The program expects the data file to be on standard input, thus you type:
 139
 140 oc  [arguments]  < test.dis > results.ocout
 141
 142 The first argument is one of 'sim' or 'dis' and defines whether the
 143 data are similarities, i.e. bigger numbers mean that the entities are
 144 more like each other, or distances, i.e. smaller numbers mean the
 145 entities are more like each other.
 146
 147 The second argument defines the clustering type, in explaining this I
 148 will assume that we are clustering on distances, not similarities.
 149
 150 Hierarchical clustering works by first finding the two entites that
 151 have the minimum distance between them.  Having joined those two
 152 entities into a cluster, the method then searches for the minimum
 153 distance between two entites, but taking those entities that are
 154 already clustered as a single unit. This process is repeated until
 155 there are no more entities to cluster.  What differs in the different
 156 clustering methods is how the 'distance' is calculated between two
 157 clusters that have already formed.
 158
 159 In 'single' linkage, the MINIMUM distance between the two clusters is
 160 taken as the distance.  This is the simplest cluster analysis method.
 161 It has the disadvantage that an outlier can cause two groups of
 162 entities to be clustered when most of the entities are really quite
 163 distant.  Single linkage tends to form a few big clusters early on in
 164 the hierarchy.
 165
 166 In 'complete' linkage, the MAXIMUM distance between the clusters is
 167 taken.  This has the advantage over single linkage in that  within a
 168 cluster, all pairs of entities will be within the distance at which
 169 the cluster was formed.  Complete linkage tends to form many smaller
 170 clusters.  The dendrograms have more structure to them and are often
 171 more informative than single linkage dendrograms.
 172
 173 In 'means' linkage as the name suggests, the mean distance between
 174 clusters is calculated.  This can be a good compromise between the
 175 extremes of single and complete linkage, but the distances at which
 176 clusters are formed are averages, not real distances.  As a
 177 consequence, the clusters can be more difficult to interpret.
 178
 179 There are many other clustering methods and most should be simple to
 180 add as options to oc.
 181
 182
 183 Examples
 184 --------
 185
 186 <orac:78>oc dis complete < test.dis
 187 Reading Upper Diagonal
 188 Read: 7 Entries
 189 CPU time: 0.000000 seconds
 190 Setting up unclust
 191 Setting up notparent
 192 Setting up clust
 193 CPU time: 0.000000 seconds
 194 Complete linkage on distance
 195 Doing Cluster Analysis...
 196 ## 0 20 2
 197 Entity Score: 20  Number of members: 2
 198  0 6
 199 ## 1 20 2
 200 Entity Score: 20  Number of members: 2
 201  2 5
 202 ## 2 33 2
 203 Entity Score: 33  Number of members: 2
 204  3 4
 205 ## 3 50 3
 206 Entity Score: 50  Number of members: 3
 207  1 3 4
 208 ## 4 100 5
 209 Entity Score: 100  Number of members: 5
 210  2 5 1 3 4
 211 ## 5 100 7
 212 Entity Score: 100  Number of members: 7
 213  0 6 2 5 1 3 4
 214 Total CPU time: 0.010000 seconds
 215
 216
 217 Complete linkage on the test.dis file.  There are 7 entities in the
 218 file.  They are numbered by the program 0 to 6.  As each cluster is
 219 formed, the program writes a double hash (##), then the cluster
 220 number, the score at which the cluster is formed and the number of
 221 entites in the cluster.  This is followed by a line, repeating the
 222 entity score and number of members in the cluster.  Finally, a line is
 223 written with the numbers of the entities in the cluster.
 224
 225
 226 For example:
 227
 228 ## 3 50 3
 229 Entity Score: 50  Number of members: 3
 230  1 3 4
 231
 232 means cluster number 3 was formed at a score (distance) of 50 and has
 233 three members, the entities 1,3 and 4.
 234
 235
 236 Since entity numbers are simply the order in which the data was read
 237 in from the file, it is usually better to refer to the entities by
 238 their labels.  To do this, just add the 'id' option to the command
 239 line.
 240
 241
 242 <orac:79>oc dis complete id < test.dis
 243 Reading Upper Diagonal
 244 Read: 7 Entries
 245 CPU time: 0.000000 seconds
 246 Setting up unclust
 247 Setting up notparent
 248 Setting up clust
 249 CPU time: 0.000000 seconds
 250 Complete linkage on distance
 251 Doing Cluster Analysis...
 252 ## 0 20 2
 253  First Seventh
 254 ## 1 20 2
 255  Third Sixth
 256 ## 2 33 2
 257  Fourth Fifth
 258 ## 3 50 3
 259  Second Fourth Fifth
 260 ## 4 100 5
 261  Third Sixth Second Fourth Fifth
 262 ## 5 100 7
 263  First Seventh Third Sixth Second Fourth Fifth
 264 Total CPU time: 0.000000 seconds
 265
 266 The output is similar to above, but the entities are labelled rather
 267 than numbered.
 268
 269 You can obtain an encapsulated PostScript dendrogram of the clustering by adding the
 270 argument ps <filename>.  For example:
 271
 272
 273 <orac:81>oc dis complete id ps test < test.dis
 274 Reading Upper Diagonal
 275 Read: 7 Entries
 276 CPU time: 0.000000 seconds
 277 Setting up unclust
 278 Setting up notparent
 279 Setting up clust
 280 CPU time: 0.000000 seconds
 281 Complete linkage on distance
 282 Doing Cluster Analysis...
 283 ## 0 20 2
 284  First Seventh
 285 ## 1 20 2
 286  Third Sixth
 287 ## 2 33 2
 288  Fourth Fifth
 289 ## 3 50 3
 290  Second Fourth Fifth
 291 ## 4 100 5
 292  Third Sixth Second Fourth Fifth
 293 ## 5 100 7
 294  First Seventh Third Sixth Second Fourth Fifth
 295 Total CPU time: 0.010000 seconds
 296
 297 This will create a file called test.eps which can be printed on a
 298 PostScript printer, or viewed with GhostView or some other PostScript
 299 viewer.  The dendrogram is scaled to fit on a single A4 page which is
 300 a little inconvenient if you have thousands of entities.  You can
 301 specify different dimensions for the plot by providing the "minside"
 302 and "maxside" commands to OC.  For example, to make a plot that is 11
 303 inches wide (the long side of A4) by 40 inches high do:
 304
 305 oc dis complete id ps test minside 11 maxside 40 < test.dis
 306
 307 The test.eps file that results will need some further processing with
 308 a utility to tile the postscript over multiple pages.  You could do
 309 this in Adobe Illustrator (but my experience is that this is a touch
 310 slow) it is better to use the "epssplit" utility from:
 311
 312 http://home.clara.net/nox/software/epssplit.  See the instructions
 313 there on how to use the epssplit program for this.
 314
 315 Adding the 'cut' option allows OC to output only clusters that form
 316 above that threshold.
 317
 318 For example,
 319
 320 <orac:84>oc dis complete id cut 20 < test.dis
 321 Reading Upper Diagonal
 322 Read: 7 Entries
 323 CPU time: 0.000000 seconds
 324 Setting up unclust
 325 Setting up notparent
 326 Setting up clust
 327 CPU time: 0.000000 seconds
 328 Complete linkage on distance
 329 Doing Cluster Analysis...
 330 ## 1 20 2
 331  Third Sixth
 332 ## 0 20 2
 333  First Seventh
 334
 335 UNCLUSTERED ENTITIES
 336 Second
 337 Fourth
 338 Fifth
 339 Total CPU time: 0.000000 seconds
 340
 341
 342 ... only outputs clusters that are formed above the threshold of 20 in
 343 this case.  Entities that are not clustered with any other entity are
 344 listed as UNCLUSTERED ENTITIES.  Note: You could get the same
 345 information from the standard OC output file with a little post
 346 processing.
 347
 348 Note:  if you use the gjnoc2 function, then the unclustered entities are added
 349 to the returned structure as if they were clusters of size 1.  The score for
 350 forming these clusters is reported as the 'cut' score.
 351
 352
 353 Other options:
 354 --------------
 355
 356 'log'           Takes logs before clustering.  This is not recommended -
 357                 I've not put any traps in for illegal numbersa.
 358 'timeclus'      Outputs the time to form each cluster.
 359 'amps'          Outputs .tree and .tord files for use
 360                 with the AMPS multiple alignment package.  oc simply replaces the
 361                 more limited program ORDER.
 362 'minside <dim>' Specify the length of the short (horizontal) side for the
 363                 postscript plot (in inches).
 364 'maxside <dim>' Specify the length of the long (vertical) side for the
 365                 postscript plot (in inches).
 366
 367
 368
 369 Installation Notes
 370 ------------------
 371
 372 OC is written in ANSI C and has four source code files.
 373
 374 oc.c            The oc program.
 375 oc.h            The oc header file.
 376 gjutil.c        Utility routines.
 377 gjutil.h        Header for the utility routines.
 378
 379 To compile under Linux using GCC do:
 380
 381 gcc -o oc -O3  oc.c gjutil.c gjtimes.c -I./gjutil -I./ -lm
 382
 383 For other operating systems and/or compilers you need to ensure that the compiler
 384 searches the current directory for header files and that you link the
 385 maths library.
 386
 387 Put the oc executable somewhere on your path.  Typing oc gets you:
 388
 389
 390 <orac:75>oc
 391 Cluster analysis program
 392
 393 Author:  Geoffrey J. Barton
 394 Copyright:  Geoffrey J. Barton (1993, 1997)
 395
 396 Please cite: OC a cluster analysis program, Barton, G. J.  1993
 397
 398 Usage: oc <sim/dis> <single/complete/means> <ps> <cut N>
 399
 400 Version 1.0 - Requires a file to be piped to standard input
 401 Format:  Line   1:  Number (N) of entities to cluster (e.g. 10)
 402 Format:  Lines 2 to 2+N-1:  Identifier codes for the entities (e.g. Entity1)
 403 Format:  N*(N-1)/2:  Distances, or similarities - ie the upper diagonal
 404
 405 Options:
 406 sim = similarity /  dis = distances
 407 method = single/complete/means
 408 ps <file> = plot out dendrogram to <file.ps>
 409 log = take logs before calculation
 410 cut = only show clusters above/below the cutoff
 411 id = output identifier codes rather than indexes for entities
 412 timeclus = output times to generate each cluster
 413 amps <file> = produce amps <file>.tree and <file>.tord files
 414
 415 See README file for brief instructions
 416
 417 The gjnoc and gjnoc2 functions come with two test programs to check that all is
 418 well.  There are also files that show the compile line needed to make these
 419 test programs.
 420
 421
 422 --------------------------------------------------------------------------------
 423 End of OC usage notes.
 424