电子科技 ›› 2019, Vol. 32 ›› Issue (4): 60-64.doi: 10.16180/j.cnki.issn1007-7820.2019.04.013

• • 上一篇    下一篇

基于Spark的改进随机森林算法

孙悦,袁健   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200093
  • 收稿日期:2018-03-18 出版日期:2019-04-15 发布日期:2019-03-27
  • 作者简介:孙悦(1993-),男,硕士研究生。研究方向:数据挖掘。|袁健(1971-),女,博士,副教授。研究方向:云计算安全与大数据关联和智能交通。
  • 基金资助:
    国家自然科学基金(61775139)

Improved Random Forest Algorithm Based on Spark

SUN Yue,YUAN Jian   

  1. School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 210000,China
  • Received:2018-03-18 Online:2019-04-15 Published:2019-03-27
  • Supported by:
    National Natural Science Foundation of China(61775139)

摘要:

针对基于单机的经典随机森林算法无法满足海量数据处理需求的问题,文中采用Spark分布式存储计算技术设计并实现了改进的随机森林算法。首先计算特征的重要程度,将特征分为公共特征、独有特征和非重要特征;然后按顺序和比例分别在各个特征子空间中随机选择特征;最后通过Spark集群进行实验,分析改进的随机森林算法分类性能、加速比和效率。结果证实改进的算法提高了随机森林构建效率,可以用来解决海量数据挖掘问题,具有良好的可扩展性。

关键词: 随机森林, Spark, 特征空间, ReliefF算法, 高维数据, 分类模型

Abstract:

For the classical random forest algorithm based on single machine couldn't meet the demand of dealing with massive data, an improved random forest algorithm based on Spark was designed and implemented by using Spark distributed memory computing technology. Firstly, after calculating the importance of features the features were divided into public features, unique features, and non-important features;. Then, randomly features were selected in each feature subspace in order and proportion. Finally, the experiment was performed through Spark clusters to analyze the improved classification performance, speedup ratio and efficiency of the random forest algorithm. The result demonstrated that the improved algorithm could improve the efficiency of random forest construction and could be used to solve the massive data mining problem with good scalability.

Key words: random forest, spark, feature space, ReliefF algorithm, high dimensional data, classification model

中图分类号: 

  • TP311.13