Electronic Science and Technology ›› 2019, Vol. 32 ›› Issue (4): 60-64.doi: 10.16180/j.cnki.issn1007-7820.2019.04.013

Previous Articles     Next Articles

Improved Random Forest Algorithm Based on Spark

SUN Yue,YUAN Jian   

  1. School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 210000,China
  • Received:2018-03-18 Online:2019-04-15 Published:2019-03-27
  • Supported by:
    National Natural Science Foundation of China(61775139)

Abstract:

For the classical random forest algorithm based on single machine couldn't meet the demand of dealing with massive data, an improved random forest algorithm based on Spark was designed and implemented by using Spark distributed memory computing technology. Firstly, after calculating the importance of features the features were divided into public features, unique features, and non-important features;. Then, randomly features were selected in each feature subspace in order and proportion. Finally, the experiment was performed through Spark clusters to analyze the improved classification performance, speedup ratio and efficiency of the random forest algorithm. The result demonstrated that the improved algorithm could improve the efficiency of random forest construction and could be used to solve the massive data mining problem with good scalability.

Key words: random forest, spark, feature space, ReliefF algorithm, high dimensional data, classification model

CLC Number: 

  • TP311.13