电子科技 ›› 2019, Vol. 32 ›› Issue (5): 38-44.doi: 10.16180/j.cnki.issn1007-7820.2019.05.008

• • 上一篇    下一篇

基于Spark框架的CFSFDP改进算法

李琪,张欣,张平康,张航   

  1. 贵州大学 大数据与信息工程学院,贵州 贵阳 550025
  • 收稿日期:2018-04-29 出版日期:2019-05-15 发布日期:2019-05-06
  • 作者简介:李琪(1990-),男,硕士研究生。研究方向:云计算与大数据。|张欣(1976-),男,博士,副教授。研究方向:下一代无线通信及应用等。|张平康(1993-),男,硕士研究生。研究方向:图像处理。|张航(1991-),男,硕士研究生。研究方向:云计算与大数据。
  • 基金资助:
    国家国际科技合作专项项目(2014DFA00670);贵州省研究生教育教学改革重点(课题黔教研合JG字[2016]15);贵州省科技厅工业攻关项目(黔科合GY字[2010]3056)

Improved CFSFDP Algorithm Based on Spark Framework

LI Qi,ZHANG Xin,ZHANG Pingkang,ZHANG Hang   

  1. School of Big Data and Information Engineering,Guizhou University,Guiyang 550025,China
  • Received:2018-04-29 Online:2019-05-15 Published:2019-05-06
  • Supported by:
    International Science & Technology Cooperation Program of China(2014DFA00670);Postgraduate Education Reform Project of Guizhou Province(课题黔教研合JG字[2016]15);Key Industry Project of Guizhou Science and Techonology Agency(黔科合GY字[2010]3056)

摘要:

CFSFDP算法是一种基于密度的新型聚类算法。文中针对算法需使用决策图人工选取聚类中心点的问题,利用斜率思想找出聚类中心点与非聚类中心点间的分界点,在消除主观误差的同时实现了中心点的自动求取,并最终将算法使用Spark框架进行了并行化实现。实验结果表明,文中算法在消除人为误差的同时提升了算法效率,且并行后的算法具有良好的加速比与扩展性,适用于海量数据的聚类分析。

关键词: Spark, CFSFDP算法, 决策图, 密度峰值, 聚类, 并行化

Abstract:

CFSFDP algorithm based on density is a clustering algorithm. In order to rid dependency on artificial selection of decision graph, this paper used the idea of slope to calculate the demarcation point of clustering center points and un-clustering center points. This improvement eliminated personal equation and realized auto-calculation of center points. Parallel processing for the algorithm was conducted through the Spark framework. The experiments showed that this algorithm was applicable to clustering analysis of mass data, since it improved efficiency by eliminating personal equation and displayed great speed up ratio and extendibility after paralleling.

Key words: Spark, CFSFDP algorithm, decision diagram, density peaks, clustering, parallel

中图分类号: 

  • TP301.6