Scalable model-based clustering by working on data summaries
Document Type
Conference paper
Source Publication
Proceedings - IEEE International Conference on Data Mining, ICDM
Publication Date
1-1-2003
First Page
91
Last Page
98
Abstract
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
DOI
10.1109/ICDM.2003.1250907
Print ISSN
15504786
Publisher Statement
Copyright © 2003 IEEE. Access to external full text or publisher's version may require subscription.
Additional Information
Paper presented at the 3rd IEEE International Conference on Data Mining, Nov 19-22, 2003, Melbourne, Florida.
ISBN of the source publication: 9780769519784
Full-text Version
Publisher’s Version
Language
English