电子科技 ›› 2021, Vol. 34 ›› Issue (5): 47-53.doi: 10.16180/j.cnki.issn1007-7820.2021.05.009

• • 上一篇    下一篇

改进地标点采样的加速谱聚类算法

徐航帆,刘丛,唐坚刚,彭敦陆   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200082
  • 收稿日期:2020-02-10 出版日期:2021-05-15 发布日期:2021-05-24
  • 作者简介:徐航帆(1995-),男,硕士研究生。研究方向:数据挖掘、聚类分析等。|刘丛(1983-),男,博士,副教授。研究方向:图像处理、智能计算、数据挖掘。
  • 基金资助:
    国家自然科学基金(61703278);国家自然科学基金(61772342)

Accelerated Spectral Clustering Based on Improved Landmark Selection

XU Hangfan,LIU Cong,TANG Jiangang,PENG Dunlu   

  1. School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2020-02-10 Online:2021-05-15 Published:2021-05-24
  • Supported by:
    National Nature Science of Foundation of China(61703278);National Nature Science of Foundation of China(61772342)

摘要:

传统的基于地标点的大规模加速谱聚类算法易受分布不均匀地标点和离群地标点影响。K-means等采样方法在面对大规模数据时,时间空间消耗较大。针对以上问题,文中提出了一种改进地标点采样的加速谱聚类算法。该算法通过地标点间成对相似度矩阵的标准差来衡量地标点的分布均匀程度,选取随机的多组地标点集中分布最均匀的一组,去除局部密度较低的离群地标点;利用获得的地标点集与原始数据集构造稀疏相似度矩阵,并对该矩阵奇异值分解得到的前k个右奇异特征向量矩阵进行K-means聚类,得到最终聚类结果。文中从理论上分析了该算法时间复杂度和空间复杂度。验证结果表明该算法在一些数据集上比随机采样方法的准确率高3%~10%,和K-means采样方法相比时间消耗少50%~60%。

关键词: 谱聚类, 大数据, 地标点采样, 离群点, 标准差, 稀疏相似度矩阵, 局部密度, 奇异值分解

Abstract:

In order to solve the problems that the traditional landmark-based spectral clustering algorithm is susceptible to unevenly distributed landmark points and outlier landmark points, and its sampling methods such as K-means consume a large time and space when face large-scale data. This study proposes an accelerated spectral clustering based on improved landmark selection. The algorithm uses the standard deviation of the pairwise similarity matrix between landmark points to measure the uniformity of the distribution of landmark points. It selectes the landmark points set uniformly distributed from landmark points sets generated randomly, and then removes outlier landmark points with low local density. The sparse similarity matrix is constructed by the obtained landmark points set and the original data set. K-means clustering is performed on data points generated by the first k right singular feature vectors of the landmark points set to obtain the final clustering result. This study theoretically analyzes the time complexity and space complexity of the algorithm and performed experimental verification. Experimental results show that the algorithm is 3%~10% higher than that of the random sampling method, and the time-consuming is 50%~60% less than that of the K-means sampling method.

Key words: spectral clustering, large data sets, landmark sampling, outlier point, standard deviation, sparse similarity matrix, local density, singular value decomposition

中图分类号: 

  • TP301.6