电子科技 ›› 2025, Vol. 38 ›› Issue (8): 11-18.doi: 10.16180/j.cnki.issn1007-7820.2025.08.002

• • 上一篇    下一篇

基于平滑探索的倒立摆虚实迁移学习控制方法

皇甫嘉琪1,2, 薛杰2, 牟海明2, 李清都2()   

  1. 1.上海理工大学 健康科学与工程学院,上海 200093
    2.上海理工大学 机器智能研究院,上海 200093
  • 收稿日期:2024-01-12 修回日期:2024-02-08 出版日期:2025-08-15 发布日期:2025-07-10
  • 通讯作者: 李清都(1980-),男,E-mail:liqd@usst.edu.cn,博士,教授。研究方向:仿生机器人理论与技术、复杂机械系统。
  • 作者简介:皇甫嘉琪(2000-),女,硕士研究生。研究方向:强化学习、倒立摆控制等。
    薛杰(1997-),男,博士研究生。研究方向:机器人运动控制、强化学习等。
  • 基金资助:
    国家自然科学基金(92048205)

Smooth Exploration-Based Control Method for Inverted Pendulum Virtual-Reality Migration Learning

HUANGFU Jiaqi1,2, XUE Jie2, MOU Haiming2, LI Qingdu2()   

  1. 1. School of Health Science and Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
    2. Institute of Machine Intelligence,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2024-01-12 Revised:2024-02-08 Online:2025-08-15 Published:2025-07-10
  • Supported by:
    National Natural Science Foundation of China(92048205)

摘要:

倒立摆的非线性和欠驱动特性使其成为强化学习(Reinforcement Learning, RL)算法的基准测试案例。在将仿真学习的RL策略部署到实物平台时,控制信号存在突变和震荡,导致策略部署失败,并且具有高功耗、过度的系统磨损和硬件损害问题。针对该问题,文中提出了一种用于RL策略平滑探索的正则化项。为解决实物部署阶段的策略突变问题,设计突变正则化项来约束策略在探索阶段的突变。设计了震荡正则化项来解决策略的小范围震荡问题,并对相似状态的值函数进行约束。将平滑探索正则化项应用于近端策略优化(Proximal Policy Optimization, PPO)算法进行倒立摆虚实迁移实验。实验结果表明,平滑探索的PPO算法在仿真中的训练速度提升了40%,成功实现了虚实迁移,具有较强的平滑性和鲁棒性。

关键词: 倒立摆, 强化学习, 平滑探索, 突变正则化项, 震荡正则化项, 近端策略优化算法, PPO算法, 虚实迁移

Abstract:

The nonlinear and underactuated nature of the inverted pendulum makes it a benchmark test case for RL(Reinforcement Learning) algorithms. When the simulation-learned RL strategy is deployed to the physical platform, the control signal has mutations and oscillations, which leads to the failure of the strategy deployment, and the problems of high power consumption, excessive system wear and hardware damage. To solve this problem, a regularization term for smoothing exploration of RL strategy is proposed in this study. In order to solve the policy mutation problem in the physical deployment stage, the mutation regularization term is designed to constrain the policy mutation in the exploration stage. Oscillation regularization term is designed to solve the small-range oscillation problem of the strategy, and the value functions of similar states are constrained. The smooth exploration regularization term is applied to the PPO(Proximal Policy Optimization) algorithm to carry out the virtual real transfer experiment of inverted pendulum. The experimental results show that the training speed of PPO algorithm for smooth exploration is increased by 40% in simulation, and the virtual-real transfer is successfully realized, which has strong smoothness and robustness.

Key words: inverted pendulum, reinforcement learning, smooth exploration, mutation regularization term, oscillation regularization term, proximal policy optimization algorithm, PPO algorithm, virtual-to-real migration

中图分类号: 

  • TP18