J4 ›› 2014, Vol. 41 ›› Issue (2): 191-196.doi: 10.3969/j.issn.1001-2400.2014.02.031

• 研究论文 • 上一篇    

一种适用于Hadoop MapReduce环境的数据预取方法

张霄宏1,2;雒芬2;贾宗璞2;沈记全3
  

  1. (1. 中国科学院 深圳先进技术研究院,广东 深圳  518055;
    2. 河南理工大学 计算机科学与技术学院,河南 焦作  454003;
    3. 河南理工大学 现代教育中心,河南 焦作  454003)
  • 收稿日期:2013-01-13 出版日期:2014-04-20 发布日期:2014-05-30
  • 通讯作者: 张霄宏
  • 作者简介:张霄宏(1981-),女,讲师,博士,E-mail:xh.zhang@hpu.edu.cn.
  • 基金资助:

    国家自然科学基金资助项目(51274088);河南省教育厅资助项目(ITE12103);河南理工大学博士基金资助项目(B2012-099);河南理工大学矿山信息化省级重点实验室资助项目(KY2012-05)

Prefetching method for Hadoop MapReduce environments

ZHANG Xiaohong1,2;LUO Fen2;JIA Zongpu2;SHEN Jiquan3   

  1. (1. Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen  518055, China;
    2. School of Computer Science and Technology, Henan Polytechnic Univ., Jiaozuo  454000, China;
    3. Center of Modern Education, Henan Polytechnic Univ., Jiaozuo  454000, China)
  • Received:2013-01-13 Online:2014-04-20 Published:2014-05-30
  • Contact: ZHANG Xiaohong

摘要:

为解决由Reduce任务引起的远程数据访问延时和资源竞争导致的系统性能问题,提出了一种基于预调度的数据预取方法.该方法通过预取数据来隐藏由Reduce任务引起的远程数据访问延时,通过控制与Reduce任务相关的资源分配来减少由其引起的资源竞争.此方法已在Hadoop-0.20.2中实现.实验结果表明,与缺省的Hadoop MapReduce及Hadoop Online Prototype相比,该方法可将系统性能提高10 %以上.

关键词: MapReduce, 分布式计算, 预取, 调度

Abstract:

Due to the data dependency and the special task execution mode in MapReduce environments, reduce tasks always cause massive remote data access delay and unnecessary resource competition, which degrades the system performance. To solve the performance problem, we propose a pre-fetching method based on pre-scheduling. The method hides the remote data access delay by pre-fetching, and controls the resource competition by adjusting resource allocation of reduce tasks. The method is implemented in Hadoop-0.20.2. The experimental results show that the method improves the system performance by more than 10%, compared with default Hadoop MapReduce and Hadoop Online Prototype.

Key words: MapReduce, distributed computing, pre-fetching, scheduling

中图分类号: 

  • TP316.4