CN115179280A

CN115179280A - Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning

Info

Publication number: CN115179280A
Application number: CN202210705509.0A
Authority: CN
Inventors: 王志; 丁泓宇; 王博; 陈春林; 辛博; 朱张青
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-10-14
Anticipated expiration: 2042-06-21
Also published as: CN115179280B

Abstract

The invention discloses a reward shaping method based on a magnetic field in mechanical arm control for reinforcement learning, which comprises the following steps: s1, designing a task environment, setting relevant parameters of a mechanical arm, a target object and an obstacle, and setting a hyper-parameter of a reinforcement learning algorithm; s2, respectively regarding the target object and the obstacle as permanent magnets with the same shapes as the target object and the obstacle, and determining a calculation mode of the magnetic field intensity distribution of the three-dimensional space; s3, interacting the mechanical arm with the environment, collecting training data, and calculating the magnetic field intensity of the mechanical arm tail end coordinates in the target object and barrier magnetic fields to obtain a magnetic field reward function; s4, converting the magnetic field reward function into a shaping reward function based on potential energy by using a DPBA algorithm, and storing the shaping reward function and training data into an experience playback pool; and S5, collecting data of one batch from the experience playback pool, and training the mechanical arm to complete a specified task by using a reinforcement learning algorithm. The invention can provide richer orientation information of the target object and the barrier for the mechanical arm, thereby improving the learning efficiency of the reinforcement learning algorithm.

Description

A Magnetic Field-Based Reward Shaping Approach for Reinforcement Learning Robotic Arm Control

技术领域technical field

本发明属于机器人控制领域，具体涉及一种用于强化学习机械臂控制中基于磁场的奖励塑形方法。The invention belongs to the field of robot control, and in particular relates to a magnetic field-based reward shaping method used in reinforcement learning robotic arm control.

背景技术Background technique

传统的机械臂控制方法，通常需要基于运动学和动力学方程对机械臂进行建模并求解末端位姿和各个关节的角度值。随着工业应用场景的复杂性和动态性不断提高，传统基于模型的机械臂控制方法的计算复杂度也越来越高，无法及时适应外部环境的变化，缺乏对环境的自主学习和泛化能力。Traditional manipulator control methods usually need to model the manipulator based on kinematics and dynamic equations and solve the end pose and angle values of each joint. As the complexity and dynamics of industrial application scenarios continue to increase, the computational complexity of traditional model-based manipulator control methods is also getting higher and higher, unable to adapt to changes in the external environment in time, and lack of autonomous learning and generalization capabilities for the environment .

近年来，强化学习由于其处理序列决策问题的独特优势，被广泛应用于机械臂控制任务中。其通过将传感器获取的环境状态信息直接映射到机械臂执行的动作上，实现端到端(end-to-end)的控制，为复杂连续高维系统的控制问题提供新的解决思路。强化学习的优化目标是在马尔可夫决策过程(MDP)中寻找使得累计奖励值最大的最优策略，因此设计一个科学的奖励函数尤为重要。关于机械臂运动控制任务中奖励函数的设计，现有方法的设置较为简单，其中包括方位奖励函数设计、启发式奖励函数设计等，无法在复杂动态环境中为机械臂提供丰富的奖励信号，未能有效提高学习效率。In recent years, reinforcement learning has been widely used in robotic arm control tasks due to its unique advantages in dealing with sequential decision-making problems. It realizes end-to-end control by directly mapping the environmental state information obtained by the sensor to the actions performed by the manipulator, and provides a new solution to the control problem of complex continuous high-dimensional systems. The optimization goal of reinforcement learning is to find the optimal strategy that maximizes the cumulative reward value in the Markov Decision Process (MDP), so it is particularly important to design a scientific reward function. Regarding the design of the reward function in the motion control task of the robot arm, the settings of the existing methods are relatively simple, including the design of the azimuth reward function and the design of the heuristic reward function, etc., which cannot provide rich reward signals for the robot arm in the complex dynamic environment. It can effectively improve the learning efficiency.

公开号为CN113894787A的专利文件公开了一种用于机械臂强化学习运动规划中启发式奖励函数的设计方法，包括：建立机械臂运动规划问题的启发式函数；根据启发式函数，构建机械臂运动规划的启发式奖励函数；确定启发式奖励函数中的参数取值；利用构建的启发式奖励函数训练机械臂运动规划的神经网络运动规划器。该发明基于机械臂末端位置到目标位置的直线距离来设置启发式奖励函数，无法提供更高阶的奖励信号，且无法保证学习策略的最优性。Patent document with publication number CN113894787A discloses a design method for heuristic reward function in robotic arm reinforcement learning motion planning, including: establishing a heuristic function for robotic arm motion planning problem; constructing robotic arm motion according to the heuristic function Planning heuristic reward function; determine the parameter values in the heuristic reward function; use the constructed heuristic reward function to train a neural network motion planner for robotic arm motion planning. The invention sets the heuristic reward function based on the straight-line distance from the end position of the robotic arm to the target position, which cannot provide higher-order reward signals and cannot guarantee the optimality of the learning strategy.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有强化学习机械臂控制方法中的奖励函数在复杂动态环境中提供信息有限的问题，提出了一种用于强化学习机械臂控制中基于磁场的奖励塑形方法，能够在保证最优策略不变的情况下，为机械臂提供关于目标物和障碍物更为丰富的方位信息，从而提高强化学习算法的学习效率和收敛速度。The purpose of the present invention is to solve the problem that the reward function in the existing reinforcement learning robotic arm control method provides limited information in complex dynamic environments, and proposes a magnetic field-based reward shaping method for reinforcement learning robotic arm control. Under the condition of keeping the optimal strategy unchanged, it provides the robotic arm with richer orientation information about the target and obstacles, thereby improving the learning efficiency and convergence speed of the reinforcement learning algorithm.

本发明的技术方案为：一种用于强化学习机械臂控制中基于磁场的奖励塑形方法，其特征在于，包括以下步骤：The technical scheme of the present invention is: a magnetic field-based reward shaping method for reinforcement learning robotic arm control, which is characterized in that it includes the following steps:

S1、设计任务环境，设定机械臂、目标物和障碍物的相关参数，设置强化学习算法的各项超参数；S1. Design the task environment, set the relevant parameters of the manipulator, the target and the obstacle, and set the various hyperparameters of the reinforcement learning algorithm;

S2、将目标物视为同等形状的方形永磁体，确定其磁化方向和三维空间磁场强度分布的计算方式，障碍物同理；S2. Treat the target as a square permanent magnet of the same shape, determine its magnetization direction and the calculation method of the magnetic field intensity distribution in three-dimensional space, and the same is true for obstacles;

S3、机械臂与环境交互，收集训练数据，并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度，经过标准化和归一化处理后得到磁场奖励函数；S3. The manipulator interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the manipulator in the target and obstacle magnetic fields according to the next state, and obtains the magnetic field reward function after normalization and normalization;

S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数，并和训练数据一起存放于经验回放池；S4. Use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data;

S5、从经验回放池中采集一个批次的数据，使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略。S5. Collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment.

作为优选，所述步骤S1包括以下几个步骤：Preferably, the step S1 includes the following steps:

步骤1.1，设计任务环境的状态观测值和机械臂的动作值，具体包括：Step 1.1, design the state observation value of the task environment and the action value of the manipulator, including:

a、环境状态观测值包含机械臂三个关节的转角、机械臂末端的坐标，以及目标物和障碍物中心点的坐标；a. The observation value of the environmental state includes the rotation angles of the three joints of the manipulator, the coordinates of the end of the manipulator, and the coordinates of the center point of the target and obstacles;

b、机械臂的动作值为三个关节电机的转角速度，即在单位时间步长里三个关节旋转的角度。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step.

步骤1.2，建立与机械臂的连接，设置三个关节转动的速度和加速度范围；规定目标物和障碍物随机生成的方式，确保目标物在机械臂末端可达到的范围之内，并且目标物和障碍物不相交。Step 1.2, establish the connection with the manipulator, set the speed and acceleration range of the rotation of the three joints; specify the random generation method of the target and obstacles, ensure that the target is within the reach of the end of the manipulator, and the target and Obstacles do not intersect.

步骤1.3，设置强化学习算法基本的超参数，至少包括：探索噪声，经验回放池

的大小；每次训练的更新次数K，每次更新所用数据批次的大小N；神经网络的层数，每层的节点数、激活函数；折扣因子γ；策略网络μ_θ(s)和值函数网络Q_φ(s，a)参数更新的优化器、学习率，目标网络

和

的软更新步长τ。Step 1.3, set the basic hyperparameters of the reinforcement learning algorithm, including at least: exploring noise, experience playback pool

The number of updates K for each training, the size N of the data batch used for each update; the number of layers of the neural network, the number of nodes in each layer, the activation function; the discount factor γ; the policy network μ _θ (s) and the value Optimizer, learning rate, target network for parameter update of function network Q _φ (s, a)

and

The soft update step size τ.

在步骤S2中，方形永磁体三维空间中磁场强度分布的解析计算方法如下：In step S2, the analytical calculation method of the magnetic field intensity distribution in the three-dimensional space of the square permanent magnet is as follows:

假设磁化方向为z轴正方向，磁化强度为M_c，对于沿x轴、y轴、z轴长度分别为l，w，h的方形永磁体，其在三维空间中任意一点P(x，y，z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为：Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M _c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:

其中，Γ(γ₁，γ₂，γ₃)和

为两个辅助函数，表达式如下：where Γ(γ ₁ , γ ₂ , γ ₃ ) and

For two auxiliary functions, the expressions are as follows:

其中，∈为一极小值。于是，可以得到方形永磁体在三维空间中任意一点的磁场强度为：Among them, ∈ is a minimum value. Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:

作为优选，所述步骤S3包括以下几个步骤：Preferably, the step S3 includes the following steps:

步骤3.1，将机械臂三个关节的转角初始化为零，读取机械臂末端的坐标；随机设置目标物和障碍物的位置，读取目标物和障碍物中心点在世界坐标系中的坐标。得到状态观测值的初始值。Step 3.1, initialize the rotation angles of the three joints of the manipulator to zero, read the coordinates of the end of the manipulator; randomly set the positions of the target and obstacles, and read the coordinates of the center points of the target and obstacles in the world coordinate system. Get the initial value of the state observation.

步骤3.2，机械臂根据当前状态观测值s和策略，输出动作并对其施加噪声得到a，与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下，控制机械臂运动至下一状态。Step 3.2, according to the current state observation value s and the strategy, the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, it obtains the next state s' and the original reward value r. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state.

步骤3.3，将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet.

假设下一状态中机械臂末端在世界坐标系中的坐标为

目标物磁场坐标系原点相对于世界坐标系原点的平移量为(T_x，T_y，T_z)，目标物磁场坐标系相对于世界坐标系绕x轴、y轴、z轴的旋转角度为θ_x，θ_y，θ_z，其正方向遵循右手螺旋定则。那么机械臂末端在目标物磁场坐标系中的坐标

可表示为：Assume that the coordinates of the end of the manipulator in the world coordinate system in the next state are

The displacement of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T _x , _Ty , T _z ), and the rotation angle of the magnetic field coordinate system of the target object relative to the world coordinate system around the x-axis, y-axis, and z-axis is θ _x , θ _y , θ _z , the positive directions follow the right-hand spiral rule. Then the coordinates of the end of the manipulator in the target magnetic field coordinate system

can be expressed as:

其中，

分别为坐标系绕x轴、y轴、z轴的旋转变换矩阵，具体如下：in,

are the rotation transformation matrices of the coordinate system around the x-axis, y-axis, and z-axis, respectively, as follows:

步骤3.4，计算下一状态中机械臂末端坐标在目标物磁场和障碍物磁场中的磁场强度，并对其进行标准化处理。Step 3.4: Calculate the magnetic field strength of the coordinates of the end of the manipulator in the target magnetic field and the obstacle magnetic field in the next state, and normalize it.

假设环境中存在1个目标物和n个障碍物，其中目标物和障碍物磁体的磁场强度计算函数分别为

机械臂末端在目标物和障碍物磁体的磁场坐标系中的坐标分别为

于是，可以算得机械臂末端坐标在目标物和障碍物磁场中的磁场强度分别为

Assuming that there are 1 target and n obstacles in the environment, the calculation functions of the magnetic field strength of the target and obstacle magnets are respectively

The coordinates of the end of the manipulator in the magnetic field coordinate system of the target and obstacle magnets are respectively

Therefore, it can be calculated that the magnetic field strengths of the coordinates of the end of the manipulator in the target and obstacle magnetic fields are respectively:

将

存入磁场强度回放池

并根据当前

中磁场强度的均值

和标准差

来将算得的

映射到标准高斯分布上，标准化处理后的磁场强度

表达如下：Will

Stored in the magnetic field strength playback pool

and according to the current

Mean value of medium magnetic field strength

and standard deviation

it will count

Mapping to a standard Gaussian distribution, normalized magnetic field strength

The expression is as follows:

步骤3.5，计算目标物和障碍物磁体的联合磁场强度，并对其进行归一化处理，得到磁场奖励函数。Step 3.5: Calculate the combined magnetic field strength of the target and obstacle magnets, and normalize them to obtain a magnetic field reward function.

定义目标物和障碍物磁体的联合磁场强度为：The combined magnetic field strengths of the target and obstacle magnets are defined as:

利用Softsign函数对联合磁场强度进行归一化处理，并将输出的结果定义为磁场奖励函数r^M。具体如下：The joint magnetic field strength is normalized by the Softsign function, and the output result is defined as the magnetic field reward function r ^M . details as follows:

作为优选，所述步骤S4包括以下几个步骤：Preferably, the step S4 includes the following steps:

步骤4.1，定义DPBA算法中的势函数神经网络为Φ_ψ(s，a)，其输入为状态观测值和机械臂的动作值，输出为当前状态动作对的势能值，ψ为神经网络的参数。用于更新势函数神经网络的损失函数定义为：Step 4.1, define the potential function neural network in the DPBA algorithm as Φ _ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . The loss function used to update the potential function neural network is defined as:

其中，y为势函数的无梯度“标签值”，具体表达如下：Among them, y is the gradient-free "label value" of the potential function, which is specifically expressed as follows:

y＝-r^M+γΦ_ψ(s′，a′)y=-r ^M +γΦ _ψ (s′, a′)

其中，r^M为步骤3.5得到的磁场奖励函数，γ为折扣因子。采用梯度下降方法更新势函数神经网络的参数如下：Among them, r ^M is the magnetic field reward function obtained in step 3.5, and γ is the discount factor. The parameters of the potential function neural network are updated using the gradient descent method as follows:

其中，η为势函数神经网络更新的学习率。Among them, η is the learning rate of the potential function neural network update.

步骤4.2，根据更新前的参数Ψ和更新后的参数Ψ′，可计算基于势能的塑形奖励函数如下：Step 4.2, according to the parameter Ψ before the update and the parameter Ψ' after the update, the shaping reward function based on the potential energy can be calculated as follows:

f^M＝γΦ_ψ′(s′，a′)-Φ_ψ(s，a)f ^M = γΦ _ψ' (s', a') - Φ _ψ (s, a)

当势函数Φ_ψ(s，a)初始化为零并用以上方式更新至最终收敛时，可以实现将磁场奖励函数完全转换为基于势能的塑形奖励函数，即：f^M＝r^M。When the potential function Φ _ψ (s, a) is initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential energy-based shaping reward function can be achieved, ie: f ^M =r ^M .

将塑形奖励f^M与步骤3.1中得到的原始奖励r结合，将(s，a，r+f^M，s′)作为一组训练数据存放至经验回放池，用于后续强化学习算法的训练。根据最优策略不变性定理，算法由奖励函数r+f^M学习到的最优策略，与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2，直到机械臂末端达到目标物，或机械臂碰到障碍物或地面，或经历设定的最大时间步长。Combine the shaping reward f ^M with the original reward r obtained in step 3.1, and store (s, a, r+f ^M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f ^M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or ground, or goes through the set maximum time step.

作为优选，所述步骤S5包括以下几个步骤：Preferably, the step S5 includes the following steps:

步骤5.1，在经验回放池中随机采样一个批次的数据(S，A，R+F^M，S′)，其中用(s_i，a_i，r_i+f_i ^M，s_i+1)表示单个训练数据。Step 5.1, randomly sample a batch of data (S, A, R+ ^FM , S′) in the experience replay pool, where (s _i , a _i , r _i + f _i ^M , s _i+1 ) represents a single training data.

步骤5.2，计算用于更新值函数网络参数的损失函数：Step 5.2, calculate the loss function used to update the network parameters of the value function:

其中，y_i为值函数的无梯度“标签值”，具体表达如下：Among them, y _i is the gradient-free "label value" of the value function, which is specifically expressed as follows:

采用梯度下降方法更新值函数网络的参数如下：The parameters of the value function network are updated using the gradient descent method as follows:

其中，β为值函数网络更新的学习率。Among them, β is the learning rate of the value function network update.

步骤5.3，计算用于更新策略网络参数的损失函数：Step 5.3, calculate the loss function for updating the parameters of the policy network:

采用梯度下降方法更新策略网络的参数如下：The parameters of the policy network are updated using the gradient descent method as follows:

其中，α为策略网络更新的学习率。Among them, α is the learning rate of policy network update.

步骤5.4，软更新目标网络的参数：Step 5.4, soft update the parameters of the target network:

步骤5.5，重复步骤5.1至步骤5.4共K次，结束此轮回合。重复步骤S3至步骤S5，直到算法完全收敛，得到机械臂在动态环境下避开障碍物并达到目标物的最优策略网络

Step 5.5, repeat steps 5.1 to 5.4 for a total of K times to end this round. Steps S3 to S5 are repeated until the algorithm is completely converged, and the optimal strategy network for the robot arm to avoid obstacles and reach the target in a dynamic environment is obtained.

本发明的有益效果如下：与现有技术相比，本发明所提出的一种用于强化学习机械臂控制中基于磁场的奖励塑形方法，能够在保证最优策略不变的情况下，为机械臂提供关于目标物和障碍物更为丰富的方位信息，从而在复杂动态环境中有效提高强化学习算法的学习效率和收敛速度。The beneficial effects of the present invention are as follows: compared with the prior art, a magnetic field-based reward shaping method for reinforcement learning robotic arm control proposed by the present invention can ensure that the optimal strategy remains unchanged, as The robotic arm provides richer orientation information about targets and obstacles, thereby effectively improving the learning efficiency and convergence speed of reinforcement learning algorithms in complex dynamic environments.

附图说明Description of drawings

图1是本发明实施例在仿真环境中的任务场景图，其中，A为机械臂末端，B为障碍物，C为目标物；1 is a task scene diagram of an embodiment of the present invention in a simulation environment, wherein A is the end of the robotic arm, B is an obstacle, and C is a target;

图2是本发明的算法整体框架图；Fig. 2 is the algorithm overall frame diagram of the present invention;

图3是本发明实施例中方形永磁体的磁场坐标系示意图；3 is a schematic diagram of a magnetic field coordinate system of a square permanent magnet in an embodiment of the present invention;

图4是本发明基于磁场的奖励塑形方法与同类算法在仿真环境中的实验效果对比图。FIG. 4 is a comparison diagram of experimental effects between the magnetic field-based reward shaping method of the present invention and similar algorithms in a simulation environment.

具体实施方式Detailed ways

下面结合附图详细说明本发明的具体实施方式。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

实施例：如附图1所示：本实施例以3个自由度的Dobot Magician机械臂为例，设计的任务场景是利用强化学习算法在动态环境下，控制机械臂完成在不碰到障碍物的前提下移动末端到达目标物的任务。其中，目标物和障碍物为不同尺寸的长方体，且在每个回合中的位置是随机变化的。本实施例所述的一种用于强化学习机械臂控制的磁场奖励函数设计框架如附图2所示，至少包括以下几个步骤：Example: As shown in Figure 1: This example takes the Dobot Magician robotic arm with 3 degrees of freedom as an example, and the designed task scenario is to use the reinforcement learning algorithm to control the robotic arm to complete the task without encountering obstacles in a dynamic environment. On the premise of moving the end to reach the target task. Among them, the target and obstacle are cuboids of different sizes, and their positions in each round change randomly. A magnetic field reward function design framework for reinforcement learning robotic arm control described in this embodiment is shown in FIG. 2 and includes at least the following steps:

步骤S1、设计任务环境，设定机械臂、目标物和障碍物的相关参数，设置强化学习算法的各项超参数，具体包括以下步骤：Step S1, design the task environment, set the relevant parameters of the robotic arm, the target object and the obstacle, and set various hyperparameters of the reinforcement learning algorithm, which specifically includes the following steps:

b、机械臂的动作值为三个关节电机的转角速度，即在单位时间步长里三个关节旋转的角度，其范围限制在[-1°，1°]。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step, and its range is limited to [-1°, 1°].

步骤1.3，根据采用的强化学习算法设置基本的超参数。本实施例中采用“Lillicrap,Timothy P.,et al."Continuous control with deep reinforcementlearning."arXiv preprint arXiv:1509.02971(2015).”提出的适用于连续状态动作空间的DDPG算法，需要设置的超参数包括：探索噪声

经验回放池

的大小10⁶；每次训练的更新次数K＝20，每次更新所用数据批次的大小N＝128；神经网络的层数2，每层的节点数256，激活函数ReLU，并随机初始化神经网络的参数；折扣因子γ＝0.99；策略网络μ_θ(s)和值函数网络Q_φ(s，a)参数更新的优化器为Adam，学习率分别为3×10^-4和10^-3，目标网络

和

的软更新步长τ＝10^-3。Step 1.3, set basic hyperparameters according to the reinforcement learning algorithm used. In this example, the DDPG algorithm proposed by "Lillicrap, Timothy P., et al."Continuous control with deep reinforcement learning."arXiv preprint arXiv:1509.02971(2015)." is used for continuous state action space, and the hyperparameters that need to be set Includes: Explore Noise

Experience Playback Pool

The size of 10 ⁶ ; the number of updates for each training K=20, the size of the data batch used for each update is N=128; the number of layers of the neural network is 2, the number of nodes in each layer is 256, the activation function ReLU, and the neural network is randomly initialized parameters of the network; discount factor γ=0.99; the optimizer for parameter update of policy network μ _θ (s) and value function network Q _φ (s, a) is Adam, and the learning rates are 3×10 ^-4 and 10 ^-3 , respectively, target network

and

The soft update step size τ=10 ^-3 .

步骤S2、将目标物视为同等形状的方形永磁体，确定其磁化方向和三维空间磁场强度分布的计算方式，障碍物同理。In step S2, the target is regarded as a square permanent magnet of the same shape, and the calculation method of its magnetization direction and the magnetic field intensity distribution in three-dimensional space is determined, and the same is true for obstacles.

本实施方式以方形永磁体为例，以解析方法计算其磁场强度分布，其他形状的永磁体可用类似的解析方法或基于物理仿真的模拟方法得到磁场强度分布，方形永磁体的磁场坐标系如附图3所示。假设磁化方向为z轴正方向，磁化强度为M_c，对于沿x轴、y轴、z轴长度分别为l，w，h的方形永磁体，其在三维空间中任意一点P(x，y，z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为：In this embodiment, a square permanent magnet is used as an example, and its magnetic field intensity distribution is calculated by an analytical method. For permanent magnets of other shapes, a similar analytical method or a simulation method based on physical simulation can be used to obtain the magnetic field intensity distribution. The magnetic field coordinate system of the square permanent magnet is shown in the appendix. shown in Figure 3. Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M _c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:

其中，Γ(γ₁，γ₂，γ₃)和

For two auxiliary functions, the expressions are as follows:

其中，∈为一极小值，本实施例中取∈＝10^-7。于是，可以得到方形永磁体在三维空间中任意一点的磁场强度为：Among them, ∈ is a minimum value, and in this embodiment, ∈=10 ⁻⁷ . Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:

在本实施例中，目标物尺寸设置为l＝0.03m，w＝0.045m，h＝0.02m，障碍物尺寸设置为l＝0.038m，w＝0.047m，h＝0.12m。In this embodiment, the size of the target is set as l=0.03m, w=0.045m, h=0.02m, and the size of the obstacle is set as l=0.038m, w=0.047m, h=0.12m.

步骤S3、机械臂与环境交互，收集训练数据，并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度，经过标准化和归一化处理后得到磁场奖励函数，具体包括以下步骤：Step S3, the robotic arm interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the robotic arm in the target and obstacle magnetic fields according to the next state, and obtains a magnetic field reward function after standardization and normalization, which specifically includes The following steps:

步骤3.2，机械臂根据当前状态观测值s和策略网络μ_θ(s)，输出动作并对其施加噪声得到a，与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下，控制机械臂运动至下一状态。在本实施例中，三个关节的工作范围分别为[-90°，90°]，[0°，85°]，[-10°，90°]，原始奖励函数设置如下：Step 3.2, according to the current state observation value s and the policy network μ _θ (s), the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, the next state s' and the original reward value r are obtained. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state. In this embodiment, the working ranges of the three joints are respectively [-90°, 90°], [0°, 85°], [-10°, 90°], and the original reward function is set as follows:

步骤3.3，将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。本实施例以目标物磁体为例，障碍物磁体的坐标变换同理可求。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet. In this embodiment, the target magnet is taken as an example, and the coordinate transformation of the obstacle magnet can be obtained in the same way.

假设下一状态中机械臂末端在世界坐标系中的坐标为

can be expressed as:

其中，

假设环境中存在1个目标物和n个障碍物(本实施例中n＝1)，其中目标物和障碍物磁体的磁场强度计算函数分别为

Assuming that there are 1 target and n obstacles in the environment (n=1 in this embodiment), the calculation functions of the magnetic field strengths of the target and obstacle magnets are respectively

由于不同计算方法得到的磁场强度的区间范围往往是不同的，因此需要对算得的

进行标准化处理。本发明通过引入一个磁场强度回放池

来存放

的值，并根据当前

中磁场强度的均值

和标准差

来将算得的

映射到标准高斯分布上，具体如下：Since the interval range of the magnetic field strength obtained by different calculation methods is often different, it is necessary to

Standardize. The present invention replays the cell by introducing a magnetic field strength

to store

value, and according to the current

Mean value of medium magnetic field strength

and standard deviation

it will count

Mapped to the standard Gaussian distribution as follows:

其中，

为标准化处理后的磁场强度。本实施例中，磁场强度回放池

的大小为10⁶。in,

is the normalized magnetic field strength. In this embodiment, the magnetic field strength playback cell

The size is 10 ⁶ .

在本发明中，机械臂的任务是需要在避开障碍物的前提下将末端移动至目标物，因此目标物和障碍物磁体的联合磁场强度定义为：In the present invention, the task of the robotic arm is to move the end to the target while avoiding obstacles, so the combined magnetic field strength of the target and obstacle magnets is defined as:

其中，目标物磁体对机械臂末端的“吸引”与所有n个障碍物对机械臂末端的“排斥”作用相等。根据磁场强度分布的特性，目标物附近的磁场强度将会趋于正无穷，障碍物附近的磁场强度将会趋于负无穷。为了将联合磁场强度控制在合理的范围内，同时保持其分布规律，本发明利用Softsign函数对联合磁场强度进行归一化处理，并将输出的结果定义为磁场奖励函数r^M。具体如下：Among them, the "attraction" of the target magnet to the end of the manipulator is equal to the "repulsion" of all n obstacles to the end of the manipulator. According to the characteristics of the magnetic field intensity distribution, the magnetic field intensity near the target will tend to positive infinity, and the magnetic field intensity near the obstacle will tend to negative infinity. In order to control the joint magnetic field strength within a reasonable range and keep its distribution law, the present invention uses the Softsign function to normalize the joint magnetic field strength, and defines the output result as the magnetic field reward function r ^M . details as follows:

步骤S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数，并和训练数据一起存放于经验回放池，DPBA算法为文献“Harutyunyan A,Devlin S,Vrancx P,etal.Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015,29(1).”提出，能够实现将任意专家给定的奖励函数转化为满足基于势能的奖励塑形的表达形式，以满足最优策略不变性定理。具体包括以下步骤：Step S4, use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data. The DPBA algorithm is the document "Harutyunyan A, Devlin S, Vrancx P, et al. Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015, 29(1).” proposed that the reward function given by any expert can be transformed into one that satisfies potential-based reward shaping. expression to satisfy the optimal policy invariance theorem. Specifically include the following steps:

步骤4.1，定义DPBA算法中的势函数神经网络为Φ_ψ(s，a)，其输入为状态观测值和机械臂的动作值，输出为当前状态动作对的势能值，ψ为神经网络的参数。本实施例中，势函数神经网络拥有两个隐藏层，每层的节点数均为256，激活函数均为ReLU；其参数更新的优化器为Adam，学习率为10^-4。用于更新势函数神经网络的损失函数定义如下：Step 4.1, define the potential function neural network in the DPBA algorithm as Φ _ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . In this embodiment, the potential function neural network has two hidden layers, the number of nodes in each layer is 256, and the activation function is ReLU; the optimizer for parameter updating is Adam, and the learning rate is 10 ^-4 . The loss function used to update the potential function neural network is defined as follows:

y＝-r^M+γΦ_ψ(s′，a′)y=-r ^M +γΦ _ψ (s′, a′)

将塑形奖励f^M与步骤3.1中得到的原始奖励r结合，将(s，a，r+f^M，s′)作为一组训练数据存放至经验回放池，用于后续强化学习算法的训练。根据最优策略不变性定理，算法由奖励函数r+f^M学习到的最优策略，与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2，直到机械臂末端达到目标物，或机械臂碰到障碍物或地面，或经历设定的最大时间步长(本实施例中设置为200)。Combine the shaping reward f ^M with the original reward r obtained in step 3.1, and store (s, a, r+f ^M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f ^M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or the ground, or experiences the set maximum time step (set to 200 in this embodiment).

步骤S5、从经验回放池中采集一个批次的数据，使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略，具体包括以下步骤：Step S5, collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment, which specifically includes the following steps:

步骤5.1，在经验回放池中随机采样一个批次的N组数据(S，A，R+F^M，S′)，其中用(s_i，a_i，r_i+f_i ^M，s_i+1)表示单个训练数据。Step 5.1, randomly sample a batch of N groups of data (S, A, R+F ^M , S′) in the experience replay pool, where (s _i , a _i , r _i +f _i ^M , s _{i+ 1} ) represents a single training data.

本实施例中对比了本发明公开的方法和同类算法在附图1仿真环境中训练的效果，结果对比如附图4所示。从图中可以看出，基于磁场的奖励塑形方法在学习过程中每个回合的成功率明显高于原始奖励和基于距离的奖励塑形方法，由此可见，本发明公开的方法能有效提高机械臂运动控制任务中强化学习算法的学习效率。In this embodiment, the training effects of the method disclosed in the present invention and similar algorithms in the simulation environment of FIG. 1 are compared, and the results are compared as shown in FIG. 4 . It can be seen from the figure that the success rate of the magnetic field-based reward shaping method in each round in the learning process is significantly higher than that of the original reward and the distance-based reward shaping method. It can be seen that the method disclosed in the present invention can effectively improve the Learning efficiency of reinforcement learning algorithms in robotic arm motion control tasks.

以上所述的仅是本发明的优选实施方式，应当指出，对于本领域的普通技术人员来说，在不脱离本发明创造构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, some modifications and improvements can be made without departing from the inventive concept of the present invention, which belong to the present invention. The scope of protection of the invention.

Claims

1. A magnetic field-based reward shaping method used in reinforcement learning mechanical arm control is characterized by comprising the following steps:

s1, designing a task environment, setting relevant parameters of a mechanical arm, a target object and an obstacle, and setting various hyper-parameters of a reinforcement learning algorithm;

s2, regarding the target object as a square permanent magnet with the same shape, and determining the magnetization direction and the calculation mode of the three-dimensional space magnetic field intensity distribution, wherein the obstacles are similar;

s3, the mechanical arm interacts with the environment, training data are collected, the magnetic field intensity of the tail end coordinates of the mechanical arm in the magnetic fields of the target object and the barrier object is calculated according to the next state, and a magnetic field reward function is obtained after standardization and normalization processing;

s4, converting the magnetic field reward function into a shaping reward function based on potential energy by using a DPBA algorithm, and storing the shaping reward function and training data in an experience playback pool;

and S5, collecting data of one batch from the experience playback pool, and training the mechanical arm to avoid the obstacle and reach the optimal strategy of the target object in the dynamic environment by using a reinforcement learning algorithm.

2. The method for magnetic field-based reward shaping in reinforcement learning robot arm control according to claim 1, wherein the step S1 comprises the steps of:

step 1.1, designing a state observation value of a task environment and an action value of a mechanical arm, and specifically comprising the following steps:

a. the environment state observation value comprises the corners of three joints of the mechanical arm, the coordinates of the tail end of the mechanical arm and the coordinates of the center points of the target object and the barrier;

b. the action value of the mechanical arm is the angular velocity of the three joint motors, namely the rotating angle of the three joints in unit time step;

step 1.2, establishing connection with a mechanical arm, and setting rotating speeds and acceleration ranges of three joints; the method comprises the steps of specifying a random generation mode of a target object and an obstacle, ensuring that the target object is within the range which can be reached by the tail end of a mechanical arm and the target object and the obstacle do not intersect;

step 1.3, setting basic hyper-parameters of a reinforcement learning algorithm, at least comprising the following steps: exploring noise, experience playback pools

The size of (d); updating times K of each training and updating the size N of the used data batch each time; the number of layers of the neural network, the number of nodes of each layer and an activation function; a discount factor γ; policy network mu _θ (s) sum function network Q _φ (s, a) optimizer of parameter update, learning rate, target network

And

is updated by the soft update step size tau.

3. The method as claimed in claim 1, wherein in step S2, the method for analytically calculating the magnetic field intensity distribution in the three-dimensional space of the square permanent magnet is as follows:

assuming that the magnetization direction is positive z-axis direction and the magnetization intensity is M _c For a square permanent magnet with lengths l, w, h along the x-axis, y-axis and z-axis respectively, the magnetic field strength components of the square permanent magnet at any point P (x, y, z) in the three-dimensional space in the directions of the x-axis, y-axis and z-axis can be expressed as follows:

wherein, gamma (gamma) ₁ ，γ ₂ ，γ ₃ ) And

for two auxiliary functions, the expression is as follows:

wherein epsilon is a minimum value; thus, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as follows:

。

4. the method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S3 comprises the steps of:

step 3.1, initializing the corners of three joints of the mechanical arm to zero, and reading the coordinates of the tail end of the mechanical arm; randomly setting the positions of a target object and an obstacle, reading the coordinates of the central points of the target object and the obstacle in a world coordinate system, and obtaining an initial value of a state observation value;

step 3.2, the mechanical arm outputs actions and applies noise to the actions according to the current state observation value s and the strategy to obtain a, and a next state s' and an original reward value r are obtained after interaction with the environment; under the condition of ensuring that the rotation angles of three joints of the mechanical arm are within corresponding working ranges in the next state, controlling the mechanical arm to move to the next state;

3.3, converting the terminal coordinates of the mechanical arm in the next state from a world coordinate system to a magnetic field coordinate system of the target object magnet and the barrier object magnet;

let the coordinates of the end of the arm in the world coordinate system in the next state be

The translation amount of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T) _x ，T _y ，T _z ) The rotation angle of the target object magnetic field coordinate system around the x-axis, the y-axis and the z-axis relative to the world coordinate system is theta _x ，θ _y ，θ _z The positive direction of the magnetic field of the mechanical arm follows the right-hand spiral rule, and then the coordinate of the tail end of the mechanical arm in the magnetic field coordinate system of the target object

Can be expressed as:

wherein,

the transformation matrix is a rotation transformation matrix of a coordinate system around an x axis, a y axis and a z axis respectively, and comprises the following specific steps:

step 3.4, calculating the magnetic field intensity of the terminal coordinates of the mechanical arm in the target object magnetic field and the barrier object magnetic field in the next state, and carrying out standardization treatment on the magnetic field intensity:

assuming that 1 target object and n obstacles exist in the environment, the magnetic field intensity of the target object magnet and the magnetic field intensity of the obstacle magnet are respectively calculated as functions

The coordinates of the tail end of the mechanical arm in the magnetic field coordinate systems of the target object and the obstacle magnet are respectively

Then, the magnetic field strength of the end coordinates of the robot arm in the target and obstacle magnetic fields can be calculated to be H _T ，

H is to be _T ，

Magnetic field intensity storing and replaying pool

And according to the current

Mean value of medium magnetic field strength mu _T ，

And standard deviation σ _T ，

Will calculate H _T ，

Mapping to standard Gaussian distribution, and normalizing the magnetic field intensity

The expression is as follows:

step 3.5, calculating the combined magnetic field intensity of the target object and the barrier magnet, and carrying out normalization processing on the combined magnetic field intensity to obtain a magnetic field reward function:

the combined magnetic field strength of the target and obstacle magnets is defined as:

the combined magnetic field strength is normalized by utilizing a Softsign function, and the output result is defined as a magnetic field reward function r ^M The method comprises the following steps:

。

5. the method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S4 comprises the steps of:

step 4.1, defining the potential function neural network in the DPBA algorithm as phi _ψ (s, a), inputting the state observation value and the action value of the mechanical arm, outputting the potential energy value of the current state action pair, and psi is a parameter of the neural network; the loss function used to update the potential function neural network is defined as:

wherein y is a gradient-free "tag value" of the potential function, and is specifically expressed as follows:

y＝-r ^M +γΦ _ψ (s′，a′)

wherein r is ^M The magnetic field reward function obtained in the step 3.5, and gamma is a discount factor; the parameters for updating the potential function neural network by adopting the gradient descent method are as follows:

wherein eta is the updated learning rate of the potential function neural network;

step 4.2, according to the parameter Ψ before updating and the parameter Ψ' after updating, a shaping reward function based on potential energy can be calculated as follows:

f ^M ＝γΦ _ψ′ (s′，a′)-Φ _ψ (s，a)

current potential function phi _ψ (s, a) when initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential-based shaping reward function can be achieved, namely: f. of ^M ＝r ^M ；

Will shape reward f ^M (s, a, r + f) is combined with the original prize, r, obtained in step 3.1 ^M S') as a set of training data to an experience replay pool for subsequent reinforcement learning algorithmsTraining; according to the law of invariance of the optimal strategy, the algorithm consists of a reward function r + f ^M The learned optimal strategy is consistent with the optimal strategy learned by the original reward function r; and repeating the steps 3.2 to 4.2 until the tail end of the mechanical arm reaches the target object, or the mechanical arm touches an obstacle or the ground, or a set maximum time step is experienced.

6. The method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S5 comprises the steps of:

step 5.1, randomly sampling a batch of data (S, A, R + F) in an empirical playback pool ^M S'), wherein (S) is _i ，a _i ，r _i +f _i ^M ，s _i+1 ) Representing a single training data;

step 5.2, calculating a loss function for updating the value function network parameters:

wherein, y _i The non-gradient "tag value" for the value function is expressed as follows:

the parameters of the value function network are updated by the gradient descent method as follows:

wherein, beta is the learning rate of the value function network update;

step 5.3, calculating a loss function for updating the policy network parameters:

the parameters for updating the policy network by adopting the gradient descent method are as follows:

wherein, alpha is the learning rate of the strategy network update;

step 5.4, soft updating the parameters of the target network:

step 5.5, repeating the steps 5.1 to 5.4 for K times, and finishing the cycle of the combination; repeating the steps S3 to S5 until the algorithm is completely converged to obtain the optimal strategy network for the mechanical arm to avoid the obstacle and reach the target object in the dynamic environment