›› 2014, Vol. 27 ›› Issue (2): 29-.

• 论文 • 上一篇    下一篇

基于Hadoop平台下的Canopy-Kmeans高效算法

赵庆   

  1. (西安电子科技大学 电子工程学院,陕西 西安 710071)
  • 出版日期:2014-02-15 发布日期:2014-01-12
  • 作者简介:赵庆(1988—),男,硕士研究生。研究方向:云计算,Hadoop平台下大数据及大规模数据挖掘。E-mail:522698733@qq.com

Efficient Algorithm of Canopy-Kmeans Based on Hadoop Platform

 ZHAO Qing   

  1. (School of Electronic Engineering,Xidian University,Xi'an 710071,China)
  • Online:2014-02-15 Published:2014-01-12

摘要:

介绍了Hadoop平台下MapReduce的编程模型;分析了传统聚类Kmeans和Canopy算法的优缺点,并提出了基于Canopy的改进Kmeans算法。针对Canopy-Kmeans算法中Canopy选取的随机性问题,采用“最小最大原则”对该算法进行改进,避免了Cannopy选取的盲目性。采用MapReduce并行编程方法,以海量新闻信息聚类作为应用背景。实验结果表明,此方法相对于传统Kmeans和Canopy算法有着更高的准确率和稳定性。

关键词: Hadoop, MapReduce, Canopy-Kmeans算法, 聚类

Abstract:

This paper studies MapReduce programming model under the Hadoop platform,analyzes the advantages and the disadvantages of traditional Kmeans and Canopy algorithms,and then proposes an improved Kmeans algorithm based on Canopy.The "minimum maximum principle" is used to improve the randomicity problem of Canopy-Kmeans algorithm to avoid the blindness of Cannopy.The MapReduce parallel programming method is carried out in massive news aggregation.The experiments show that this method has higher accuracy and stability than the traditional Kmeans and Canopy algorithms.

Key words: Hadoop;MapReduce;Canopy-Kmeans algorithm;clustering

中图分类号: 

  • TP301.6