Staff Publications

Scalable model-based clustering by working on data summaries

Huidong JIN, Department of Information Systems, Lingnan University, Tuen Mun, N.T, Hong Kong; Department of Computer Sci. and Eng., Chinese University of Hong Kong, Shatin, N.T, Hong Kong
Man Leung WONG, Department of Information Systems, Lingnan University, Tuen Mun, N.T, Hong Kong
Kwong Sak LEUNG, Department of Computer Sci. and Eng., Chinese University of Hong Kong, Shatin, N.T, Hong Kong

Document Type

Conference paper

Source Publication

Proceedings - IEEE International Conference on Data Mining, ICDM

Publication Date

1-1-2003

First Page

Last Page

Abstract

The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.

DOI

10.1109/ICDM.2003.1250907

Print ISSN

15504786

Publisher Statement

Additional Information

Paper presented at the 3rd IEEE International Conference on Data Mining, Nov 19-22, 2003, Melbourne, Florida.

ISBN of the source publication: 9780769519784

Full-text Version

Publisher’s Version

Language

English

Link to Full Text

COinS

Staff Publications

Scalable model-based clustering by working on data summaries

Document Type

Source Publication

Publication Date

First Page

Last Page

Abstract

DOI

Print ISSN

Publisher Statement

Additional Information

Full-text Version

Language

Search

Browse

Author Corner

Links

Staff Publications

Scalable model-based clustering by working on data summaries

Authors

Document Type

Source Publication

Publication Date

First Page

Last Page

Abstract

DOI

Print ISSN

Publisher Statement

Additional Information

Full-text Version

Language

Share

Search

Browse

Author Corner

Links