Summer Scholarship Programme

	Summer Scholarship Programme Project Summary

[EPCC home] [SSP home] [2001 projects] [2000 projects] [1999 projects] [1998 projects] [1997 projects] [1996 projects] [1995 projects] [1994 projects] [1993 projects]

An Approach to Clustering in Multidimensional Datasets Using Genetic Algorithms

Student

Daniel Chen, Harvard University, USA

Supervisors

Harvey Richardson, Thinking Machines Corp.

David Henty, EPCC

The aim of the project is to draw together a range of techniques to approach the analysis of clustering in multidimensional data sets. The project is divided as follows:

Clustering:

The method is to pick a number of cluster centres and then divide the volume based on the cluster centres. The simplest method is by bisection - the second by choosing a boundary equidistant between closest cluster centres. A good set of artificial test examples in two dimensions is essential.

The optimal clustering is found by use of a GA. The gene encodes the cluster centres (and number of clusters) and the evaluation function is a combination of the sum of distances to all points in the local space and the number of data points contained in this cluster. Much of the work would include investigation of "good" evaluation functions. Other optimisation techniques could also be studied.

Visualisation:

Utilize multidimensional visualization techniques to visualise the clusters found. Either AVS or the "OpenGL" graphics library could be used. If produced early enough, this visualisation would allow a good way of investigating the efficacy of the GA. Many packages already exist to produce the Voronoi diagrams which result from this type of spatial subdivision.

[Diagram cut]

Data:

Initially the project should develop clustering algorithms on a number of well defined, though artificial test cases.
Once working in conjuntion with GA (or other optimisation techniques), the method would be applied to a social science data set: The HPCI Human Systems Consortium have large datasets of population densities. They are interested in using clustering to define "cities". This dataset would have the advantage of being 2D, with simple visualisation possible, and would act as the first real data test case.

Further work:

Develop efficient clustering algorithms given a set of centre points.
Investigate different evaluation functions paying particular attention to how an optimal number of clusters could be defined.
Investigate different optimisation techniques e.g. steepest descent methods etc. and compare with the GA approach.
Test on a multidimensional dataset. A more demanding test would be to study data on a range of jobs from an HPC system, using the clustering algorithm on a multidimensional space. Such a data set possibly may be available from the T3D at Edinburgh. Each job will have a number of attributes:

User affiliation: EPCC, Edinburgh, SERC ...
Project age: months
Job submission date
Job memory requested: MB (maybe power of 2 scale)
Job cpu requested: minutes
Job memory used: MB (maybe power of 2 scale)
Job cpu used: minutes
PE number: npes (log scale)

Notes:

It is expected that data will be coded so as not no identify individuals
Data and results may be sensitive and their may be legal implications in the use of such data.

The final report for this project is available here.

Webpage maintained by mario@epcc.ed.ac.uk