Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms
Dana Cristofor (Department of Computer Science, University of Massachusetts at Boston, USA)
Dan Simovici (Department of Computer Science, University of Massachusetts at Boston, USA)
Abstract: In a database with categorical attributes, each attribute defines a partition whose classes can be regarded as natural clusters of rows. In this paper we focus on finding a partition of the rows of a given database, that is as close as possible to the partitions associated to each attribute. We evaluate the closeness of two partitions by using a generalization of the classical conditional entropy. From this perspective, we wish to construct a partition (referred to as the median partition) such that the sum of the dissimilarities between this partition and all the partitions determined by the attributes of the database is minimal. Then, the problem of finding the median partition is an optimization problem, over the space of all partitions of the rows of the database, for which we give an approximative solution. To search more e#ciently the large space of possible partitions we use a genetic algorithm where the partitions are represented by chromosomes. Our genetic algorithm obtains better clustering results than the classical k-means algorithm.
1.) C. S. Calude, K. Salomaa, S. Yu (eds.). Advances and Trends in Automata and Formal Languages. A Collection of Papers in Honour of the 60th Birthday of Helmut Jürgensen.
Keywords: Gini index, Shannon entropy, categorical attributes, clustering, genetic algorithm, median partition, partitioning
Categories: E.4, H.2