西安电子科技大学学报 ›› 2020, Vol. 47 ›› Issue (2): 75-82.doi: 10.19665/j.issn1001-2400.2020.02.011

• • 上一篇    下一篇

飞行器强化学习多模在轨控制

张英1,2,3,韦闽峰2,3,4,王世会2,3,陶磊岩5,曹健1(),张兴1   

  1. 1.北京大学 软件与微电子学院,北京 100871
    2.北京航天自动控制研究所,北京 100854
    3.宇航智能控制技术国家级重点实验室,北京 100854
    4.北京理工大学 自动化学院,北京 100081
    5. 北京遥感设备研究所,北京 100854
  • 收稿日期:2019-08-30 出版日期:2020-04-20 发布日期:2020-04-26
  • 通讯作者: 曹健
  • 作者简介:张英(1982—),女,高级工程师,E-mail:zhangying_@pku.edu.cn
  • 基金资助:
    国家自然科学基金(51877008)

Aircraft reinforcement learning multi-mode control in orbit

ZHANG Ying1,2,3,WEI Minfeng2,3,4,WANG Shihui2,3,TAO Leiyan5,CAO Jian1(),ZHANG Xing1   

  1. 1.School of Software and Microelectronics, Peking University, Beijing, 100871, China
    2.Beijing Aerospace Automatic Control Institute, Beijing, 100854, China
    3.National Key Laboratory of Science and Technology on Aerospace Intelligent Control, Beijing, 100854, China
    4.School of Automation, Beijing Institute of Technology, Beijing, 100081, China
    5. Beijing Institute of Remote Sensing Equipment, Beijing, 100854, China
  • Received:2019-08-30 Online:2020-04-20 Published:2020-04-26
  • Contact: Jian CAO

摘要:

为了提高飞行器控制系统长期在轨飞行的可靠性,提出了一种基于强化学习的多模式控制系统方案。该系统包括传感器模块、控制模块和执行模块。其中,传感器模块用于向控制模块实时输入飞行器敏感的飞行数据,该数据分为可供飞行器控制直接使用的具有历史相关性的多维结构化浮点数据以及某特定传感器独有的物理表征量;控制模块使用实时并行化决策机制,分为输入层、特征抽取层和全连接层;执行模块用于接收控制模块实时输出的驱动数据,包括用于决策的状态最优值和用于评价的动作输出值。系统根据用于决策的回报最优值决定使用哪些具体的执行模块,而某个被选定的具体执行模块的输出值取决于用于评价的动作输出值。该系统使飞行器在多模式输入输出状态下具备15ms快响应,5.23GOPs/sec/W(性能功耗比单位)性能功耗比的能力。

关键词: 飞行器, 控制系统, 多模式, 强化学习

Abstract:

In order to improve the long-term in orbit flight reliability of the aircraft control system, a multi-mode control scheme is proposed based on reinforcement learning. This system includes a sensor module, a control module and an execution module. The sensor module is used to input the sensitive flight data of the aircraft to the control module in real time. This data is divided into multidimensional structured floating point data with historical relevance that can be directly used for aircraft control and the unique physical representation quantity of a particular sensor. The control module is divided into an input layer, a feature extraction layer and a full connection layer. The execution module is used to receive the driving data from the control module in real time, which includes the optimal state value for decision-making and the action output value for evaluation. The system decides which specific execution modules to use based on the optimal return value for decision making, with the output value of a selected specific execution module depending on the output value of the action used for evaluation. The system enables the aircraft to complete a long-term orbit operation in the multi-mode input and output state with 15ms fast response and 5.23GOP/s/W Performance per Watt.

Key words: aircraft, control system, multi-mode, reinforcement learning

中图分类号: 

  • TN911.22