改进地标点采样的加速谱聚类算法

doi:10.16180/j.cnki.issn1007-7820.2021.05.009

摘要/Abstract

摘要：

传统的基于地标点的大规模加速谱聚类算法易受分布不均匀地标点和离群地标点影响。K-means等采样方法在面对大规模数据时,时间空间消耗较大。针对以上问题,文中提出了一种改进地标点采样的加速谱聚类算法。该算法通过地标点间成对相似度矩阵的标准差来衡量地标点的分布均匀程度,选取随机的多组地标点集中分布最均匀的一组,去除局部密度较低的离群地标点;利用获得的地标点集与原始数据集构造稀疏相似度矩阵,并对该矩阵奇异值分解得到的前k个右奇异特征向量矩阵进行K-means聚类,得到最终聚类结果。文中从理论上分析了该算法时间复杂度和空间复杂度。验证结果表明该算法在一些数据集上比随机采样方法的准确率高3%~10%,和K-means采样方法相比时间消耗少50%~60%。

关键词: 谱聚类, 大数据, 地标点采样, 离群点, 标准差, 稀疏相似度矩阵, 局部密度, 奇异值分解

Abstract:

In order to solve the problems that the traditional landmark-based spectral clustering algorithm is susceptible to unevenly distributed landmark points and outlier landmark points, and its sampling methods such as K-means consume a large time and space when face large-scale data. This study proposes an accelerated spectral clustering based on improved landmark selection. The algorithm uses the standard deviation of the pairwise similarity matrix between landmark points to measure the uniformity of the distribution of landmark points. It selectes the landmark points set uniformly distributed from landmark points sets generated randomly, and then removes outlier landmark points with low local density. The sparse similarity matrix is constructed by the obtained landmark points set and the original data set. K-means clustering is performed on data points generated by the first k right singular feature vectors of the landmark points set to obtain the final clustering result. This study theoretically analyzes the time complexity and space complexity of the algorithm and performed experimental verification. Experimental results show that the algorithm is 3%~10% higher than that of the random sampling method, and the time-consuming is 50%~60% less than that of the K-means sampling method.

Key words: spectral clustering, large data sets, landmark sampling, outlier point, standard deviation, sparse similarity matrix, local density, singular value decomposition

中图分类号:

TP301.6

徐航帆,刘丛,唐坚刚,彭敦陆. 改进地标点采样的加速谱聚类算法[J]. 电子科技, 2021, 34(5): 47-53.

XU Hangfan,LIU Cong,TANG Jiangang,PENG Dunlu. Accelerated Spectral Clustering Based on Improved Landmark Selection[J]. Electronic Science and Technology, 2021, 34(5): 47-53.

图/表 13

图1

表1

表2

表3

图2

图3

图4

图5

图6

图7

图8

图9

图10

参考文献 21

[1]	Wu X, Zhu X, Wu G Q, et al. Data mining with big data[J]. IEEE Transactions on Knowledge & Data Engineering, 2013,26(1):97-107.
[2]	Wu J, Liu H, Xiong H, et al. K-means-based consensus clustering: a unified view[J]. IEEE Transactions on Knowledge and Data Engineering, 2015,27(1):155-169. doi: 10.1109/TKDE.69
[3]	Song Q, Ni J, Wang G. A fast clustering-based feature subset selection algorithm for high-dimensional data[J]. IEEE Transactions on Knowledge & Data Engineering, 2013,25(1):1-14.
[4]	向志华, 邵亚丽. 一种结合贪心选择和特征加权的高维数据聚类算法[J]. 电子科技, 2019,32(11):70-73.
	Xiang Zhihua, Shao Yali. A high dimensional data clustering algorithm combining greedy selection and feature weighting[J]. Electronic Science and Technology, 2019,32(11):70-73.
[5]	Jia H, Ding S, Xu X, et al. The latest research progress on spectral clustering[J]. Neural Computing & Applications, 2014,24(7-8):1477-1486.
[6]	李根, 王亚刚, 周小伟, 等. 一种基于密度均值的谱聚类算法[J]. 电子科技, 2016,29(8):74-77.
	Li Gen, Wang Yagang, Zhou Xiaowei, et al. A spectral clustering algorithm based on average density[J]. Electronic Science and Technology, 2016,29(8):74-77.
[7]	Hagen L, Kahng A B. New spectral methods for ratio cut partitioning and clustering[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2002,11(9):1074-1085. doi: 10.1109/43.159993
[8]	Shi J, Malik J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000,22(8):888-905. doi: 10.1109/34.868688
[9]	Ng AY, Jordan M I, Weiss Y. On spectral clustering: analysis and an algorithm[J]. Neural Information Processing Systems, 2002,(14):849-856.
[10]	Fowlkes C, Belongie S, Chung F, et al. Spectral grouping using the Nystrom method[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2004,26(2):214-225.
[11]	Nyström E J. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben[J]. Acta Mathematica, 1930,54(1):185-204. doi: 10.1007/BF02547521
[12]	Li M, Kwok J T, Lu B L. Making large-scale nyström approximation possible[C]. Haifa:International Conference on Machine Learning, 2010.
[13]	贾洪杰, 丁世飞, 史忠植. 求解大规模谱聚类的近似加权核K-means算法[J]. 软件学报, 2015,26(11):2836-2846.
	Jia Hongjie, Ding Shifei, Shi Zhongzhi. Approximate weighted kernel K-means for large-scale spectral clustering[J]. Journal of Software, 2015,26(11):2836-2846.
[14]	Cai D, Chen X. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cybernetics, 2015,45(8):1669-1680. doi: 10.1109/TCYB.2014.2358564
[15]	Donoho D L, Elad M. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ'minimization[J]. Proceedings of the National Academy of Sciences of the United States of America, 2003,100(5):2197-2202.
[16]	Olshausen B A, Field D J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images[J]. Nature, 1996,381(6583):607-609. pmid: 8637596
[17]	Rafailidis D, Constantinou E, Manolopoulos Y. Landmark selection for spectral clustering based on Weighted PageRank[J]. Future Generation Computer Systems, 2017,68(3):465-472. doi: 10.1016/j.future.2016.03.006
[18]	叶茂, 刘文芬. 基于快速地标采样的大规模谱聚类算法[J]. 电子与信息学报, 2017,39(2):278-284.
	Ye Mao, Liu Wenfen. Large scale spectral clustering based on fast landmark sampling[J]. Journal of Electronics and Information Technology, 2017,39(2):278-284.
[19]	Fahad A, Alshatri N, Tari Z, et al. A survey of clustering algorithms for big data:taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014,2(3):267-279. doi: 10.1109/TETC.2014.2330519
[20]	Strehl A, Ghosh J. Cluster ensembles: a knowledge reuse framework for combining partitionings[J]. Journal of Machine Learning Research, 2002,3(3):583-617.
[21]	Munkres J. Algorithms for the assignment and transportation problems[J]. Journal of the Society for Industrial & Applied Mathematics, 1957,5(1):32-38.

算法	地标点选择	相似度矩阵	矩阵分解
LSC-R	O(1)	O(pn)	O(p³+p²n)
LSC-K	O(tpn)	O(pn)	O(p³+p²n)
Nystr?m	O(1)	O(pn)	O(p³+pn)
LSC-ILS	O(cp²)	O(pn)	O(d³+d²n)