2 OC - A Cluster Analysis Program
3 -------------------------------
5 Usage notes - 2/April/1997
6 --------------------------
8 Release 2.0 - 4 August 2002
9 --------------------------
11 Release 2.1a (Version 2.1a) - 11 April 2005
12 -----------------------------
14 Author: Geoffrey J. Barton
16 email: geoff@compbio.dundee.ac.uk
22 If you use OC in your work, please remember to cite it.
24 If you modify it please send me a copy of your changes. I will try to
25 incorporate them into a common source. However, if you plan any major
26 changes, please contact me first. It may be that I have already done
33 Release 2.1a - Make version number consistent with Release number.
34 Remove some regex code that fails on SunOS.
36 Release 2.1 (Version 1.3): The postscript output options in OC were
37 never intended to be very pretty, but we needed to be able to print
38 out large trees in postscript and so some small changes were made to
39 help with this. Firstly, ps output is now encapsulated postscript
40 (eps) rather than simple ps. Secondly, the minside and maxside
41 command line options have been documented. All postscript output is
42 PORTRAIT and minside is the narrow side of the page and maxside the
43 long side (in inches). For example, define the (minside 11 maxside
44 52), run OC then run the utility epssplit from
45 http://home.clara.net/nox/software/epssplit.
47 Release 2.0 (Version 1.0): The main oc program is unchanged, but two
48 new callable versions are in the distributions that start with 'noc'.
49 Functions gjnoc.c and gjnoc2.c are callable from C and return data
50 structures with the information on the clusters formed. Of the two,
51 gjnoc2 is probably most useful. gjnoc was written to feed other
52 programs we distribute (notably STAMP).
57 OC is a general purpose hierarchical cluster analysis program. It was
58 written to have NO software defined limits. The maximum number of
59 entities that you can cluster with OC is only limited by available
60 memory (and time!!). OC was written in 1993 to cluster large families
61 of protein sequences. Though developed for this purpose, there is
62 nothing specific to proteins in its implementation, so it can be used
63 to cluster any sort of data.
65 It has been used extensively in my group most notably in the 3Dee
66 domains database (http://www.compbio.dundee.ac.uk) and also in the STAMP
67 structure comparison package. Over the years I have given the program
68 to a few people but never got around to writing up any instructions or
69 packaging it as a proper distribution. These notes will hopefully
70 make the program a little easier for others to use and understand.
72 OC requires a complete upper diagonal distance or similarity matrix as
73 input and implements three simple clustering methods:
77 3. means linkage - UPGMA.
79 These are explained further below.
81 Output is a list of the clusters as they form and optionally a
82 dendrogram drawn in PostScript. The dendrogram drawing functions are
83 fairly primitive. The program will also, optionally, output all
84 clusters above some threshold. See further details below.
89 First line: An integer showing the number (N) of entities you are
90 clustering. Lines 2 to N+1: Labels for the entities. These should be
91 strings, without spaces. Following N*(N-1)/2 lines: The upper
92 diagonal matrix (free format).
94 For example, see file test.dis.
126 By upper diagonal, I mean that the numbers are the similarities (or distances)
127 from comparison of entity 1 v 2, 1 v 3, 1 v 4, 1 v 5, 1 v 6, 1 v 7,
128 2 v 3, 2 v 4, 2 v 5, 2 v 6, 2 v 7,
129 3 v 4, 3 v 5, 3 v 6, 3 v 7,
138 The program expects the data file to be on standard input, thus you type:
140 oc [arguments] < test.dis > results.ocout
142 The first argument is one of 'sim' or 'dis' and defines whether the
143 data are similarities, i.e. bigger numbers mean that the entities are
144 more like each other, or distances, i.e. smaller numbers mean the
145 entities are more like each other.
147 The second argument defines the clustering type, in explaining this I
148 will assume that we are clustering on distances, not similarities.
150 Hierarchical clustering works by first finding the two entites that
151 have the minimum distance between them. Having joined those two
152 entities into a cluster, the method then searches for the minimum
153 distance between two entites, but taking those entities that are
154 already clustered as a single unit. This process is repeated until
155 there are no more entities to cluster. What differs in the different
156 clustering methods is how the 'distance' is calculated between two
157 clusters that have already formed.
159 In 'single' linkage, the MINIMUM distance between the two clusters is
160 taken as the distance. This is the simplest cluster analysis method.
161 It has the disadvantage that an outlier can cause two groups of
162 entities to be clustered when most of the entities are really quite
163 distant. Single linkage tends to form a few big clusters early on in
166 In 'complete' linkage, the MAXIMUM distance between the clusters is
167 taken. This has the advantage over single linkage in that within a
168 cluster, all pairs of entities will be within the distance at which
169 the cluster was formed. Complete linkage tends to form many smaller
170 clusters. The dendrograms have more structure to them and are often
171 more informative than single linkage dendrograms.
173 In 'means' linkage as the name suggests, the mean distance between
174 clusters is calculated. This can be a good compromise between the
175 extremes of single and complete linkage, but the distances at which
176 clusters are formed are averages, not real distances. As a
177 consequence, the clusters can be more difficult to interpret.
179 There are many other clustering methods and most should be simple to
180 add as options to oc.
186 <orac:78>oc dis complete < test.dis
187 Reading Upper Diagonal
189 CPU time: 0.000000 seconds
193 CPU time: 0.000000 seconds
194 Complete linkage on distance
195 Doing Cluster Analysis...
197 Entity Score: 20 Number of members: 2
200 Entity Score: 20 Number of members: 2
203 Entity Score: 33 Number of members: 2
206 Entity Score: 50 Number of members: 3
209 Entity Score: 100 Number of members: 5
212 Entity Score: 100 Number of members: 7
214 Total CPU time: 0.010000 seconds
217 Complete linkage on the test.dis file. There are 7 entities in the
218 file. They are numbered by the program 0 to 6. As each cluster is
219 formed, the program writes a double hash (##), then the cluster
220 number, the score at which the cluster is formed and the number of
221 entites in the cluster. This is followed by a line, repeating the
222 entity score and number of members in the cluster. Finally, a line is
223 written with the numbers of the entities in the cluster.
229 Entity Score: 50 Number of members: 3
232 means cluster number 3 was formed at a score (distance) of 50 and has
233 three members, the entities 1,3 and 4.
236 Since entity numbers are simply the order in which the data was read
237 in from the file, it is usually better to refer to the entities by
238 their labels. To do this, just add the 'id' option to the command
242 <orac:79>oc dis complete id < test.dis
243 Reading Upper Diagonal
245 CPU time: 0.000000 seconds
249 CPU time: 0.000000 seconds
250 Complete linkage on distance
251 Doing Cluster Analysis...
261 Third Sixth Second Fourth Fifth
263 First Seventh Third Sixth Second Fourth Fifth
264 Total CPU time: 0.000000 seconds
266 The output is similar to above, but the entities are labelled rather
269 You can obtain an encapsulated PostScript dendrogram of the clustering by adding the
270 argument ps <filename>. For example:
273 <orac:81>oc dis complete id ps test < test.dis
274 Reading Upper Diagonal
276 CPU time: 0.000000 seconds
280 CPU time: 0.000000 seconds
281 Complete linkage on distance
282 Doing Cluster Analysis...
292 Third Sixth Second Fourth Fifth
294 First Seventh Third Sixth Second Fourth Fifth
295 Total CPU time: 0.010000 seconds
297 This will create a file called test.eps which can be printed on a
298 PostScript printer, or viewed with GhostView or some other PostScript
299 viewer. The dendrogram is scaled to fit on a single A4 page which is
300 a little inconvenient if you have thousands of entities. You can
301 specify different dimensions for the plot by providing the "minside"
302 and "maxside" commands to OC. For example, to make a plot that is 11
303 inches wide (the long side of A4) by 40 inches high do:
305 oc dis complete id ps test minside 11 maxside 40 < test.dis
307 The test.eps file that results will need some further processing with
308 a utility to tile the postscript over multiple pages. You could do
309 this in Adobe Illustrator (but my experience is that this is a touch
310 slow) it is better to use the "epssplit" utility from:
312 http://home.clara.net/nox/software/epssplit. See the instructions
313 there on how to use the epssplit program for this.
315 Adding the 'cut' option allows OC to output only clusters that form
316 above that threshold.
320 <orac:84>oc dis complete id cut 20 < test.dis
321 Reading Upper Diagonal
323 CPU time: 0.000000 seconds
327 CPU time: 0.000000 seconds
328 Complete linkage on distance
329 Doing Cluster Analysis...
339 Total CPU time: 0.000000 seconds
342 ... only outputs clusters that are formed above the threshold of 20 in
343 this case. Entities that are not clustered with any other entity are
344 listed as UNCLUSTERED ENTITIES. Note: You could get the same
345 information from the standard OC output file with a little post
348 Note: if you use the gjnoc2 function, then the unclustered entities are added
349 to the returned structure as if they were clusters of size 1. The score for
350 forming these clusters is reported as the 'cut' score.
356 'log' Takes logs before clustering. This is not recommended -
357 I've not put any traps in for illegal numbersa.
358 'timeclus' Outputs the time to form each cluster.
359 'amps' Outputs .tree and .tord files for use
360 with the AMPS multiple alignment package. oc simply replaces the
361 more limited program ORDER.
362 'minside <dim>' Specify the length of the short (horizontal) side for the
363 postscript plot (in inches).
364 'maxside <dim>' Specify the length of the long (vertical) side for the
365 postscript plot (in inches).
372 OC is written in ANSI C and has four source code files.
375 oc.h The oc header file.
376 gjutil.c Utility routines.
377 gjutil.h Header for the utility routines.
379 To compile under Linux using GCC do:
381 gcc -o oc -O3 oc.c gjutil.c gjtimes.c -I./gjutil -I./ -lm
383 For other operating systems and/or compilers you need to ensure that the compiler
384 searches the current directory for header files and that you link the
387 Put the oc executable somewhere on your path. Typing oc gets you:
391 Cluster analysis program
393 Author: Geoffrey J. Barton
394 Copyright: Geoffrey J. Barton (1993, 1997)
396 Please cite: OC a cluster analysis program, Barton, G. J. 1993
398 Usage: oc <sim/dis> <single/complete/means> <ps> <cut N>
400 Version 1.0 - Requires a file to be piped to standard input
401 Format: Line 1: Number (N) of entities to cluster (e.g. 10)
402 Format: Lines 2 to 2+N-1: Identifier codes for the entities (e.g. Entity1)
403 Format: N*(N-1)/2: Distances, or similarities - ie the upper diagonal
406 sim = similarity / dis = distances
407 method = single/complete/means
408 ps <file> = plot out dendrogram to <file.ps>
409 log = take logs before calculation
410 cut = only show clusters above/below the cutoff
411 id = output identifier codes rather than indexes for entities
412 timeclus = output times to generate each cluster
413 amps <file> = produce amps <file>.tree and <file>.tord files
415 See README file for brief instructions
417 The gjnoc and gjnoc2 functions come with two test programs to check that all is
418 well. There are also files that show the compile line needed to make these
422 --------------------------------------------------------------------------------
423 End of OC usage notes.