Clustering in massive data sets

Murtagh, Fionn

(2002)

Murtagh, Fionn (2002) Clustering in massive data sets
In: Handbook of Massive Data Sets. Kluwer, Norwell, MA, USA.

Our Full Text Deposits

Full text access: Open

Full Text - 456.13 KB

Links to Copies of this Item Held Elsewhere


Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far back as the 1960s still very often remain topical. More recent work is also covered in this article. This includes a solution for the statistical question of how many clusters there are in a dataset. We also look at one line of inquiry in the use of clustering for human-computer user interfaces. Finally, the visualization of data leads to the consideration of data arrays as images, and we speculate on future results to be expected here.

Information about this Version

This is a Published version
This version's date is: 2002
This item is not peer reviewed

Link to this Version

https://repository.royalholloway.ac.uk/items/44dda43c-f5a2-e725-0175-b47808ac9ca0/1/

Item TypeBook Item
TitleClustering in massive data sets
AuthorsMurtagh, Fionn
DepartmentsFaculty of Science\Computer Science

Identifiers

Deposited by () on 23-Dec-2009 in Royal Holloway Research Online.Last modified on 23-Dec-2009

References

1 D. Allard and C. Fraley. Non-parametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. Journal of the American Statistical Association, 92:1485-1493, 1997.

2 P. Arabie and L. J. Hubert. An overview of combinatorial data analysis, pages 5-63. In , Arabie et al. (1996), 1996.

3 P. Arabie, L. J. Hubert, and G. De Soete, editors. Clustering and Classification . Singapore: World Scientific, 1996.

4 S. Banerjee and A. Rosenfeld. Model-based cluster analysis. Pattern Recognition, 26:963-974, 1993.

5 J. D. Banfield and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803-821, 1993.

6 K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest neighbor queries. Technical report, Microsoft, 1999. Microsoft Research Technical Report MSR-TR-98-58.

7 J. L. Bentley and J. H. Friedman. Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27:97-105, 1978.

8 Jon Louis Bentley , Bruce W. Weide , Andrew C. Yao, Optimal Expected-Time Algorithms for Closest Point Problems, ACM Transactions on Mathematical Software (TOMS), v.6 n.4, p.563-580, Dec. 1980

9 Michael W. Berry , Zlatko Drmac , Elizabeth R. Jessup, Matrices, Vector Spaces, and Information Retrieval, SIAM Review, v.41 n.2, p.335-362, June 1999

10 M. W. Berry, B. Hendrickson, and P. Raghavan. Sparse matrix reordering schemes for browsing hypertext. In J. Renegar, M. Shub, and S. Smale, editors, Lectures in Applied Mathematics (LAM) Vol. 32: The Mathematics of Numerical Analysis, pages 99-123. American Mathematical Society, 1996.

11 Kevin S. Beyer , Jonathan Goldstein , Raghu Ramakrishnan , Uri Shaft, When Is ''Nearest Neighbor'' Meaningful?, Proceeding of the 7th International Conference on Database Theory, p.217-235, January 10-12, 1999

12 Allan Borodin , Rafail Ostrovsky , Yuval Rabani, Subquadratic approximation algorithms for clustering problems in high dimensional spaces, Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.435-444, May 01-04, 1999, Atlanta, Georgia, United States

13 A. J. Broder, Strategies for efficient incremental nearest neighbor search, Pattern Recognition, v.23 n.1-2, p.171-178, January 1990

14 A. Z. Broder. On the resemblance and containment of documents, pages 21-29. IEEE Computer Society, 1998.

15 Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States

16 M. Bruynooghe. Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, (3):24-42, 1977.

17 W. A. Burkhard , R. M. Keller, Some approaches to best-match file searching, Communications of the ACM, v.16 n.4, p.230-236, April 1973

18 S. D. Byers and A. E. Raftery. Nearest neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 93:577-584, 1998.

19 J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery. Model-based methods for textile fault detection. International Journal of Imaging Science and Technology, 10:339-346, 1999.

20 Cartia. Mapping the information landscape, client-server software system, 1999. Caria, Inc.

21 G. Celeux and G. Covaert. Gaussian parsimonious clustering models. Pattern Recognition, 28:781-793, 1995.

22 D. Cheriton and D. E. Tarjan. Finding minimum spanning trees. SIAM Journal on Computing, 5:724-742, 1976.

23 K. W. Church and J. I. Helfman. Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2:153-174, 1993.

24 W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society for Information Science, 28:341-344, 1977.

25 C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2, Proceedings of the 1992 IEEE Workshop, Piscataway, 1992. IEEE Press.

26 Christian Darken , John Moody, Note on learning rate schedules for stochastic optimization, Proceedings of the 1990 conference on Advances in neural information processing systems 3, p.832-838, October 1990, Denver, Colorado, United States

27 C. Darken and J. Moody. Towards faster stochastic gradient search. In Hanson Moody and Lippmann, editors, Advances in Neural Information Processing Systems 4, San Mateo, 1992. Morgan Kaufman.

28 B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, New York, 1991.

29 A. Dasgupta and A. E. Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93:294-302, 1998.

30 W. H. E. Day and H. Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1:7-24, 1984.

31 C. de Rham. La classification hiérarchique ascendante selon la méthode des voisins réciproques. Les Cahiers de l'Analyse des Données, V: 135-144, 1980.

32 D. Defays. An efficient algorithm for a complete link method. Computer Journal, 20:364-366, 1977.

33 C. Delannoy. Un algorithme rapide de recherche de plus proches voisins. RAIRO Informatique/Computer Science, 14:275-286, 1980.

34 A. R. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-22, 1977.

35 A. Dobrzycki, H. Ebeling, K. Glotfelty, P. Freeman, F. Damiani, M. Elvis, and T. Calderwood. Chandra Detect 1.0 User Guide. Chandra X-Ray Center, 1999. Smithsonian Astrophysical Observatory, Version 0.9.

36 Lauren B. Doyle, Semantic Road Maps for Literature Searchers, Journal of the ACM (JACM), v.8 n.4, p.553-578, Oct. 1961

37 C. M. Eastman and S. F. Weiss. Tree structures for high dimensionality nearest neighbor searching. Information Systems, 7:115-122, 1982.

38 H. Ebeling and G. Wiedenmann. Detecting structure in two dimensions combining voronoi tessellation and percolation. Physical Review E, 47:704-714, 1993.

39 E. Forgy. Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21:768, 1965.

40 Chris Fraley, Algorithms for Model-Based Gaussian Hierarchical Clustering, SIAM Journal on Scientific Computing, v.20 n.1, p.270-281, Aug. 1998

41 C. Fraley and A. E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578-588, 1999.

42 J. H. Friedman, F. Baskett, and L. J. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24:1000-1006, 1975.

43 Jerome H. Freidman , Jon Louis Bentley , Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software (TOMS), v.3 n.3, p.209-226, Sept. 1977

44 B. Fritzke. Some competitive learning methods, 1997.

45 K. Fukunaga and P. M. Narendra. A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, C-24:750-753, 1975.

46 V. J. Gillet, D. J. Wild, P. Willett, and J. Bradshaw. Similarity and dissimilarity methods for processing chemical structure databases. The Computer Journal, 41:547-558, 1998.

47 A. D. Gordon. Classification. Champman and Hall, 2 edition, 1999.

48 A. Griffiths, L. A. Robinson, and P. Willett. Hierarchic agglomerative clustering methods for automatic document classification. Journal of Documentation, 40:175-205, 1984.

49 D. Guillaume and F. Murtagh. Clustering of XML documents. Computer Physics Communications, 1999. submitted.

50 M. E. Hodgson. Reducing the computational requirements of the minimum-distance classifier. Remote Sensing of Environment, 25:117-128, 1988.

51 Ellis Horowitz , Sartaj Sahni, Fundamentals of Computer Alori, W. H. Freeman & Co., New York, NY, 1978

52 Anil K. Jain , Richard C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, 1988

53 G. Jammal and A. Bijaoui. Multiscale image restoration for photon imaging systems. In SPIE Conference on Signal and Image Processing: Wavelet Applications in Signal and Image Processing, VII, July 1999.

54 J. Juan. Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques. Les Cahiers de l'Analyse des Données, VII:219-225, 1982.

55 R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773-795, 1995.

56 J. Kittler. A method for determining k-nearest neighbors. Kybernetes, 7:313-315, 1978.

57 E. D. Kolaczyk. Nonparametric estimation of gamma-ray burst intensities using haar wavelets. Astrophysical Journal, 483:340-349, 1997.

58 Eyal Kushilevitz , Rafail Ostrovsky , Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.614-623, May 24-26, 1998, Dallas, Texas, United States

59 P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 1982. Technical note, Bell Laboratories, 1957.

60 J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, Berkeley, 1976. University of California Press.

61 R. B. Marimont and M. B. Shapiro. Nearest neighbor searches and the curse of dimensionality. Journal of the Institute of Mathematics and Its Applications, 24:59-70, 1979.

62 L. Micó, J. Oncina, and E. Vidal. An algorithm for finding nearest neighbors in constant average time with a linear space complexity. In The 11th International Conference on Pattern Recognition, volume II, pages 557-560, New York, 1992. IEEE Computer Science Press.

63 Andrew W. Moore, Very fast EM-based mixture model clustering using multiresolution kd-trees, Proceedings of the 1998 conference on Advances in neural information processing systems II, p.543-549, July 1999

64 Rajeev Motwani , Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995

65 S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types of gamma-ray bursts. The Astrophysical Journal, 508:314-327, 1998.

66 F. Murtagh. A very fast, exact nearest neighbor algorithm for use in information retrieval. Information Technology, 1:275-283, 1982.

67 F. Murtagh. Expected time complexity results for hierarchic clustering algorithms which use cluster centers. Information Processing Letters, 16:237-241, 1983.

68 F. Murtagh. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly, 1:101-113, 1984.

69 F. Murtagh. Multidimensional Clustering Algorithms. Würzburg: Physica-Verlag, 1985.

70 Fionn Murtagh, Comments on 'Parallel Algorithms for Hierarchical Clustering and Cluster Validity', IEEE Transactions on Pattern Analysis and Machine Intelligence, v.14 n.10, p.1056-1057, October 1992

71 F. Murtagh. Multivariate methods for data analysis. In A. Sandqvist and T.P. Ray, editors, Central Activity in Galaxies: From Observational Data to Astrophysical Diagnostics, pages 209-235, Berlin, 1993a. Springer-Verlag.

72 Fionn Murtagh, Search algorithms for numeric and quantitative data, Intelligent information retrieval: The case of astronomy and related space sciences, Kluwer Academic Publishers, Norwell, MA, 1993

73 F. Murtagh. Foreword to the special issue on clustering and classification. The Computer Journal, 41:517, 1998.

74 F. Murtagh and A. Heck. Multivariate Data Analysis. Dordrecht: Kluwer Academic, 1987.

75 F. Murtagh and M. H. Pajares. The Kohonen self-organizing map method: an assessment. Journal of Classification, 12:165-190, 1995.

76 F. Murtagh and A.E. Raftery. Fitting straight lines to point patterns. Pattern Recognition, 17:479-483, 1984.

77 F. Murtagh and J. L. Starck. Pattern clustering based on noise modeling in wavelet space. Pattern Recognition, 31:847-855, 1998.

78 F. Murtagh, J. L. Starck, and M. Berry. Overcoming the curse of dimensionality in clustering by means of the wavelet transform. The Computer Journal, 43:107-120, 2000.

79 Radford M. Neal , Geoffrey E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, Learning in graphical models, MIT Press, Cambridge, MA, 1999

80 H. Niemann, An efficient branch-and-bound nearest neighbour classifier, Pattern Recognition Letters, v.7 n.2, p.67-72, February 1988

81 B. K. Parsi and L. N. Kanal. An improved branch and bound algorithm for computing k-nearest neighbors. Pattern Recognition Letters, 3: 7-12, 1985.

82 Dan Pelleg , Andrew Moore, Accelerating exact k-means algorithms with geometric reasoning, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.277-281, August 15-18, 1999, San Diego, California, United States

83 S. A. Perry and P. Willett. A review of the use of inverted files for best match searching in information retrieval systems. Journal of Information Science, 6:59-66, 1983.

84 P. Poinçot, S. Lesteven, and F. Murtagh. A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement, 130:183-191, 1998.

85 P. Poinçot, S. Lesteven, and F. Murtagh. Maps of information spaces: assessments from astronomy. Journal of the American Society for Information Science, 1999. submitted.

86 V. Ramasubramanian , K. K. Paliwal, An efficient approximation-elimination algorithm for fast nearest-neighbor search based on a spherical distance coordinate formulation, Pattern Recognition Letters, v.13 n.7, p.471-480, July 1992

87 M. Richetin, G. Rives, and M. Naranjo. Algorithme rapide pour la détérmination des k plus proches voisins. RAIRO Informatique/Computer Science, 14:369-378, 1980.

88 F. J. Rohlf. Algorithm 76: Hierarchical clustering using the minimum spanning tree. The Computer Journal, 16:93-95, 1973.

89 F. J. Rohlf. A probabilistic minimum spanning tree algorithm. Information Processing Letters, 7:44-48, 1978.

90 F. J. Rohlf. Single link clustering algorithms. In P. R. Krishnaiah and L.N. Kanal, editors, Handbook of Statistics, volume 2, pages 267-284, Amsterdam, 1982. North-Holland.

91 E V Ruiz, An algorithm for finding nearest neighbours in (approximately) constant average time, Pattern Recognition Letters, v.4 n.3, p.145-157, July 1986

92 Gerard Salton , Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986

93 Masa-aki Sato , Shin Ishii, Reinforcement learning based on on-line EM algorithm, Proceedings of the 1998 conference on Advances in neural information processing systems II, p.1052-1058, July 1999

94 T. Schreiber. Efficient search for nearest neighbors. In A.S. Weigend and N.A Gershenfeld, editors, Predicting the Future and Understanding the Past: A Comparison of Approaches, New York, 1993. Addison-Wesley.

95 G. Schwarz. Estimating the dimension of a model. The Annals of Statistics , 6:461-464, 1978.

96 SDSS. Sloan digital sky survey, 1999.

97 Marvin Shapiro, The choice of reference points in best-match file searching, Communications of the ACM, v.20 n.5, p.339-343, May 1977

98 R. Sibson. Slink: an optimally efficient algorithm for the single link cluster method. The Computer Journal, 16:30-34, 1973.

99 A. F. Smeaton and C. J. van Rijsbergen. The nearest neighbor problem in information retrieval: an algorithm using upperbounds. ACM SIGIR Forum, 16:83-87, 1981.

100 P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. San Francisco: W. H. Freeman, 1973.

101 Helmuth Spath, The Cluster Dissection and Analysis Theory FORTRAN Programs Examples, Prentice-Hall, Inc., Upper Saddle River, NJ, 1985

102 J.-L. Starck , F. Murtagh , A. Bijaoui, Image processing and data analysis: the multiscale approach, Cambridge University Press, New York, NY, 1998

103 R. E. Tarjan. An improved algorithm for hierarchical clustering using strong components. Information Processing Letters, 17:37-41, 1983.

104 B. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for large databases. Technical report, Microsoft, 1999. Microsoft Research Technical Report MST-TR-99-31.

105 Stephen F. Weiss, A probabilistic algorithm for nearest neighbour searching, Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, p.325-333, June 23-27, 1980, Cambridge, England

106 H. D. White and K. W. McCain. Visualization of literatures. Annual Review of Information Science and Technology (ARIST), 32:99-168, 1997.

107 E. M. Rasmussen , P. Willett, Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor, Journal of Documentation, v.45 n.1, p.1-24, March 1989

108 A. C. Yao. An o(|e| log log |υ|) algorithm for finding minimum spanning trees. Information Processing Letters, 4:21-23, 1975.

109 T. P. Yunck. A technique to identify nearest neighbors. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6:678-683, 1976.

110 C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20:68-86, 1971.

111 G. Zheng, J. L. Starck, J.G. Campbell, and F. Murtagh. Multiscale transforms for filtering financial data streams. Journal of Computational Intelligence in Finance, 7, 1999.


Details