Summer Scholarship Programme
Project Summary
[EPCC home] [SSP home] [2001 projects] [2000 projects] [1999 projects] [1998 projects] [1997 projects] [1996 projects] [1995 projects] [1994 projects] [1993 projects]

An Approach to Clustering in Multidimensional Datasets Using Genetic Algorithms

Student

Daniel Chen, Harvard University, USA

Supervisors

Harvey Richardson, Thinking Machines Corp.

David Henty, EPCC


The aim of the project is to draw together a range of techniques to approach the analysis of clustering in multidimensional data sets. The project is divided as follows:

Clustering:

The method is to pick a number of cluster centres and then divide the volume based on the cluster centres. The simplest method is by bisection - the second by choosing a boundary equidistant between closest cluster centres. A good set of artificial test examples in two dimensions is essential.

The optimal clustering is found by use of a GA. The gene encodes the cluster centres (and number of clusters) and the evaluation function is a combination of the sum of distances to all points in the local space and the number of data points contained in this cluster. Much of the work would include investigation of "good" evaluation functions. Other optimisation techniques could also be studied.

Visualisation:

Utilize multidimensional visualization techniques to visualise the clusters found. Either AVS or the "OpenGL" graphics library could be used. If produced early enough, this visualisation would allow a good way of investigating the efficacy of the GA. Many packages already exist to produce the Voronoi diagrams which result from this type of spatial subdivision.

[Diagram cut]

Data:

Further work: Notes:
  1. It is expected that data will be coded so as not no identify individuals
  2. Data and results may be sensitive and their may be legal implications in the use of such data.

The final report for this project is available here.
Webpage maintained by mario@epcc.ed.ac.uk