基于云计算的数据挖掘系统设计

doi:10.16180/j.cnki.issn1007-7820.2019.08.015

摘要/Abstract

摘要：

为了高效、快速地解决呈指数增长的数据处理问题,提高数据储存、运算能力,文中提出了基于云计算的数据挖掘系统的设计。该系统首先分析了主流云计算平台Spark的组件构成和运行机制,深入研究其计算架构的编程原理。同时利用Spark进行了C4.5算法和K-medoids聚类算法的并行化设计,有效提高算法的运行速度、收敛速度和结果的稳定性。测试表明,在进行海量数据的分析处理时,文中提出的云计算平台在分类误差内,可有效提高整体系统的运算速度,分类效率也大幅提高。

关键词: 云计算, 数据挖掘, Spark, C4.5算法, K-medoids聚类算法

Abstract:

In order to solve exponentially increasing data processing problems and improve data storage and computing power efficiently and quickly, this paper proposed a cloud computing-based data mining system design. The system first analyzed the component composition and operation mechanism of the mainstream cloud computing platform Spark, and deeply studied the programming principle of its computing architecture. At the same time, Spark was used to parallelize the C4.5 algorithm and K-medoids clustering algorithm, which effectively improved the running speed, convergence speed and stability of the algorithm. The test showed that in the analysis and processing of massive data, the cloud computing platform proposed in this paper could effectively improve the computing speed of the whole system and improve the classification efficiency.

Key words: cloud computing, data mining, Spark, C4.5 algorithm, K-medoids clustering algorithm

中图分类号:

TN99

蓝机满. 基于云计算的数据挖掘系统设计[J]. 电子科技, 2019, 32(8): 70-74.

LAN Jiman. Design of Data Mining System Based on Cloud Computing[J]. Electronic Science and Technology, 2019, 32(8): 70-74.

图/表 6

图1

图2

图3

图4

图5

图6

参考文献 13

[1]	于连城, 张译, 张广德 , 等. 基于Canopy-k-means算法的电网数据挖掘算法的研究[J]. 国外电子测量技术, 2018,37(7):35-39.
	Yu Liancheng, Zhang Yin, Zhang Guangde , et al. Research on data mining algorithm of power grid based on Canopy-K-means algorithm[J]. Foreign Electronic Measurement Technology, 2018,37(7):35-39.
[2]	胡莹石, 陈家晨, 徐菱 . 云计算下数据挖掘平台架构及技术探究[J]. 无线互联科技, 2018,15(12):60-61,64.
	Hu Yingshi, Chen Jiachen, Xu Ling . Architecture and technology exploration of data mining platform under cloud computing[J]. Wireless Interconnection Technology, 2018,15(12):60-61,64.
[3]	毛典辉 . 基于Map Reduce的Canopy-Kmeans改进算法[J]. 计算机工程与应用, 2012,48(27):22-26,68.
	Mao Dianhui . An improved Canopy-Kmeans algorithm based on Map Reduce[J]. Computer Engineering and Applications, 2012,48(27):22-26,68.
[4]	张雪萍, 龚康莉, 赵广才 . 基于Map Reduce的K-Medoids并行算法[J]. 计算机应用, 2013,33(4):1023-1025,1035. doi: 10.3724/SP.J.1087.2013.01023
	Zhang Xueping, Gong Kangli, Zhao Guangcai . K-Medoids parallel algorithms based on MapReduce[J]. Computer Applications, 2013,33(4):1023-1025,1035. doi: 10.3724/SP.J.1087.2013.01023
[5]	王诏远, 王宏杰, 邢焕来 , 等. 基于Spark的蚁群优化算法[J]. 计算机应用, 2015(10):2777-2780,2797.
	Wang Zhaoyuan, Wang Hongjie, Xing Huanlai , et al. Ant colony optimization based on Spark[J]. Computer Applications, 2015(10):2777-2780,2797.
[6]	牛海玲, 鲁慧民, 刘振杰 . 基于Spark的Apriori算法的改进[J]. 东北师大学报:自然科学版, 2016(1):84-89.
	Niu Hailing, Lu Huimin, Liu Zhenjie . Improvement of Apriori algorithm based on Spark[J]. Northeast Normal University Journal:Natural Science Edition, 2016(1):84-89.
[7]	陈光平, 王文鹏, 黄俊 . 一种改进初始聚类中心选择的K-means算法[J]. 小型微型计算机系统, 2012,33(6):1320-1323.
	Chen Guangping, Wang Wenpeng, Huang Jun . A K-means algorithm for improving initial clustering center selection[J]. Minicomputer System, 2012,33(6):1320-1323.
[8]	赖向阳, 宫秀军, 韩来明 . 一种Map Reduce架构下基于遗传算法的K-Medoids聚类[J]. 计算机科学, 2017,44(3):23-26,58.
	Lai Xiangyang, Gong Xiujun, Han Laiming . K-Medoids clustering based on genetic algorithms in Map Reduce architecture[J]. Computer Science, 2017,44(3):23-26,58.
[9]	许晓燕 . 基于云计算的数据挖掘云服务模式研究[J]. 电脑知识与技术, 2018,14(19):16-17.
	Xu Xiaoyan . Research on cloud service model of data mining based on cloud computing[J]. Computer Knowledge and Technology, 2018,14(19):16-17.
[10]	张菁 . 云计算技术下海量数据挖掘的实现机制[J]. 安徽水利水电职业技术学院学报, 2018,18(1):62-64.
	Zhang Jing . Implementation mechanism of massive data mining under cloud computing technology[J]. Journal of Anhui Vocational and Technical College of Water Resources and Hydropower, 2018,18(1):62-64.
[11]	孙亮 . 数据挖掘服务模式应用云计算的优化策略探究[J]. 黑河学院学报, 2018,9(1):211-212.
	Sun Liang . Research on the optimization strategy of cloud computing in data mining service mode[J]. Journal of Heihe University, 2018,9(1):211-212.
[12]	刘飞, 唐雅娟, 刘瑶 . K-means聚类算法中聚类个数的方法研究[J]. 电子设计工程, 2017,25(15):9-13.
	Liu Fei, Tang Yajuan, Liu Yao . K-means clustering algorithms for the number of clustering methods[J]. Electronic Design Engineering, 2017,25(15):9-13.
[13]	李坤, 刘鹏, 吕雅洁 , 等. 基于Spark的LIBSVM参数优选并行化算法[J]. 南京大学学报:自然科学版, 2016(2):343-352.
	Li Kun, Liu Peng, Lv Yajie , et al. LIBSVM parallelization algorithm for parameter optimization based on Spark[J]. Journal of Nanjing University: Natural Science Edition, 2016(2):343-352.

[1]	吴良. 基于数据挖掘的集成信令存储和应用平台研究[J]. 电子科技, 2019, 32(8): 75-78.
[2]	李媛. 分布式手机信令数据采集与分析技术研究[J]. 电子科技, 2019, 32(6): 78-81.
[3]	李琪,张欣,张平康,张航. 基于Spark框架的CFSFDP改进算法[J]. 电子科技, 2019, 32(5): 38-44.
[4]	孙悦,袁健. 基于Spark的改进随机森林算法[J]. 电子科技, 2019, 32(4): 60-64.
[5]	吕腾飞,陈世平,王磊. 基于包簇概念的云资源分配成本优化模型[J]. 电子科技, 2019, 32(3): 31-36.
[6]	陆乐,陈世平. 基于包簇框架的云计算能耗优化算法[J]. 电子科技, 2019, 32(3): 61-66.
[7]	李存进,孙红. 改进自适应遗传算法在关联规则中的研究[J]. 电子科技, 2019, 32(12): 58-63.
[8]	赵传奇,岳春生. 异构环境虚拟化资源管理相关研究[J]. 电子科技, 2019, 32(1): 21-26.
[9]	张雪坚，张榆，钏涛，吕垚，向华伟. 基于大数据技术的IT运维数据管理系统构建方法[J]. , 2018, 31(4): 84-.
[10]	付鹏，沈莉莉. 基于源码模式挖掘的软件辅助开发技术研究[J]. , 2017, 30(4): 140-.
[11]	张心静，于嘉威，王红梅. 基于回溯的最大频繁项集挖掘算法[J]. , 2016, 29(8): 78-.
[12]	杜晓锋1，陈世平2. 云计算环境下支持多属性查找的混合对等网络[J]. , 2016, 29(7): 47-.
[13]	刘秀，李烨. 云计算环境下资源评级的虚拟机部署算法[J]. , 2016, 29(7): 51-.
[14]	张翠翠,阮树骅. 用于短频繁项的隐私保护关联规则挖掘方法[J]. , 2016, 29(5): 88-.
[15]	杜晓锋,陈世平. 一种基于HSFC的云资源定位算法[J]. , 2016, 29(4): 32-.