西安电子科技大学学报 ›› 2024, Vol. 51 ›› Issue (2): 76-83.doi: 10.19665/j.issn1001-2400.20230504

• 信息与通信工程 • 上一篇    下一篇

面向国产异构DCU平台的大规模并行矩量法研究

贾瑞鹏1(), 林中朝1(), 左胜1(), 张玉1(), 杨美红2()   

  1. 1.西安电子科技大学 电子工程学院,陕西 西安 710071
    2.齐鲁工业大学 计算机科学与技术学院,山东 济南 250000
  • 收稿日期:2023-03-21 出版日期:2024-04-20 发布日期:2023-10-13
  • 通讯作者: 林中朝(1988—),男,副教授,E-mail:zclin@xidian.edu.cn
  • 作者简介:贾瑞鹏(1996—),男,西安电子科技大学博士研究生,E-mail:rpjia@stu.xidian.edu.cn;
    左 胜(1992—),男,副研究员,E-mail:zuosheng0503@163.com;
    张 玉(1978—),男,教授,E-mail:yuzhang@mail.xidian.edu.cn;
    杨美红(1966—),女,研究员,E-mail:yangmh@sdas.rog
  • 基金资助:
    陕西省重点研发计划(2023-ZDLGY-09);陕西省重点研发计划(2022ZDLGY02-01);陕西省重点研发计划(2021GXLH-02);中央高校基本科研业务费专项资金(QTZX23018)

Study of the parallel MoM on a domestic heterogeneous DCU platform

JIA Ruipeng1(), LIN Zhongchao1(), ZUO Sheng1(), ZHANG Yu1(), YANG Meihong2()   

  1. 1. School of Electronic Engineering,Xidian University,Xi’an 710071,China
    2. School of Computer Science and Technology,Qilu University of Technology,Ji’nan 250000,China
  • Received:2023-03-21 Online:2024-04-20 Published:2023-10-13

摘要:

面向国产异构众核处理器超级计算机发展趋势,实现了基于CPU+DCU国产异构并行系统的大规模并行高阶矩量法。在同构并行矩量法负载均衡策略的基础上,提出了一种“MPI+openMP+DCU”的高效异构并行编程框架,解决了计算任务与计算能力不匹配的问题,实现了矩量法异构并行计算过程的负载均衡。采用细粒度任务划分策略与异步通信技术,对深度计算处理器计算过程进行了流水线优化设计,实现了计算与通信重叠,提升了矩量法异构协同计算的效率。通过与有限元法的仿真结果对比,验证了CPU+DCU异构并行矩量法的准确性。基于国产深度计算处理器异构平台的可扩展性分析结果表明,与单纯CPU计算相比,所实现的CPU+DCU异构协同计算方法能够获得5.5~7.0倍的加速效果,且在国家超级计算西安中心能够实现全系统运行,并行规模从360节点扩展到3 600节点(共1 036 800个处理器核心),并行效率可以达到约73.5%。

关键词: 高阶矩量法, 国产异构并行系统, 深度计算处理器, 异构协同并行计算

Abstract:

In view of the current development trend of the domestic supercomputer CPU+DCU heterogeneous architecture,the research on the CPU+DCU massively heterogeneous parallel higher-order method of moments is carried out.First,the basic implementation strategy of DCU to accelerate the calculation of the method of moments is given.Based on the load balancing parallel strategy of the isomorphic parallel moment of methods,an efficient heterogeneous parallel programming framework of "MPI+openMP+DCU" is proposed to address the problem of mismatch between computing tasks and computing power.In addition,the fine-grained task division strategy and asynchronous communication technology are adopted to optimize the design of the pipeline for the DCU computation process,thus realizing the overlapping of computation and communication and improving the acceleration performance of the program.The accuracy of the CPU+DCU heterogeneous parallel moment of methods is verified by comparing the simulation results with those by the finite element method.The scalability analytical results based on the domestic DCU heterogeneous platform show that the implemented CPU+DCU heterogeneous co-computing program can obtain 5.5~7.0 times acceleration effect at different parallel scales,and that the parallel efficiency reaches 73.5% when scaled from 360 nodes to 3600 nodes(1,036,800 cores in total).

Key words: method of moments, domestic heterogeneous platforms, deep computing unit(DCU), parallel algorithm

中图分类号: 

  • TN820