|
Summer Scholarship Programme Project Summary
|
|
An Approach to Clustering in Multidimensional Datasets Using Genetic
Algorithms
Student
Daniel Chen, Harvard University, USA
Supervisors
Harvey Richardson, Thinking Machines Corp.
David Henty, EPCC
The aim of the project is to draw together a range of techniques to
approach the analysis of clustering in multidimensional data sets. The
project is divided as follows:
Clustering:
The method is to pick a number of cluster centres and then divide the
volume based on the cluster centres. The simplest method is by bisection
- the second by choosing a boundary equidistant between closest cluster
centres. A good set of artificial test examples in two dimensions is
essential.
The optimal clustering is found by use of a GA. The gene encodes the
cluster centres (and number of clusters) and the evaluation function is
a combination of the sum of distances to all points in the local space
and the number of data points contained in this cluster. Much of the
work would include investigation of "good" evaluation functions. Other
optimisation techniques could also be studied.
Visualisation:
Utilize multidimensional visualization techniques to visualise the
clusters found. Either AVS or the "OpenGL" graphics library could be
used. If produced early enough, this visualisation would allow a good
way of investigating the efficacy of the GA. Many packages already exist
to produce the Voronoi diagrams which result from this type of spatial
subdivision.
[Diagram cut]
Data:
- Initially the project should develop clustering algorithms on a
number of well defined, though artificial test cases.
- Once working in conjuntion with GA (or other optimisation
techniques), the method would be applied to a social science data set:
The HPCI Human Systems Consortium have large datasets of population
densities. They are interested in using clustering to define "cities".
This dataset would have the advantage of being 2D, with simple
visualisation possible, and would act as the first real data test
case.
Further work:
- Develop efficient clustering algorithms given a set of centre
points.
- Investigate different evaluation functions paying particular
attention to how an optimal number of clusters could be defined.
- Investigate different optimisation techniques e.g. steepest descent
methods etc. and compare with the GA approach.
- Test on a multidimensional dataset. A more demanding test would be
to study data on a range of jobs from an HPC system, using the
clustering algorithm on a multidimensional space. Such a data set
possibly may be available from the T3D at Edinburgh. Each job will have
a number of attributes:
- User affiliation: EPCC, Edinburgh, SERC ...
- Project age: months
- Job submission date
- Job memory requested: MB (maybe power of 2 scale)
- Job cpu requested: minutes
- Job memory used: MB (maybe power of 2 scale)
- Job cpu used: minutes
- PE number: npes (log scale)
Notes:
- It is expected that data will be coded so as not no identify
individuals
- Data and results may be sensitive and their may be legal implications
in the use of such data.
The final report for this project is available here.