+

CN115179280A - Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning - Google Patents

Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning Download PDF

Info

Publication number
CN115179280A
CN115179280A CN202210705509.0A CN202210705509A CN115179280A CN 115179280 A CN115179280 A CN 115179280A CN 202210705509 A CN202210705509 A CN 202210705509A CN 115179280 A CN115179280 A CN 115179280A
Authority
CN
China
Prior art keywords
magnetic field
mechanical arm
target object
function
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210705509.0A
Other languages
Chinese (zh)
Other versions
CN115179280B (en
Inventor
王志
丁泓宇
王博
陈春林
辛博
朱张青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210705509.0A priority Critical patent/CN115179280B/en
Publication of CN115179280A publication Critical patent/CN115179280A/en
Application granted granted Critical
Publication of CN115179280B publication Critical patent/CN115179280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1674Programme controls characterised by safety, monitoring, diagnostic
    • B25J9/1676Avoiding collision or forbidden zones

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a reward shaping method based on a magnetic field in mechanical arm control for reinforcement learning, which comprises the following steps: s1, designing a task environment, setting relevant parameters of a mechanical arm, a target object and an obstacle, and setting a hyper-parameter of a reinforcement learning algorithm; s2, respectively regarding the target object and the obstacle as permanent magnets with the same shapes as the target object and the obstacle, and determining a calculation mode of the magnetic field intensity distribution of the three-dimensional space; s3, interacting the mechanical arm with the environment, collecting training data, and calculating the magnetic field intensity of the mechanical arm tail end coordinates in the target object and barrier magnetic fields to obtain a magnetic field reward function; s4, converting the magnetic field reward function into a shaping reward function based on potential energy by using a DPBA algorithm, and storing the shaping reward function and training data into an experience playback pool; and S5, collecting data of one batch from the experience playback pool, and training the mechanical arm to complete a specified task by using a reinforcement learning algorithm. The invention can provide richer orientation information of the target object and the barrier for the mechanical arm, thereby improving the learning efficiency of the reinforcement learning algorithm.

Description

一种用于强化学习机械臂控制中基于磁场的奖励塑形方法A Magnetic Field-Based Reward Shaping Approach for Reinforcement Learning Robotic Arm Control

技术领域technical field

本发明属于机器人控制领域,具体涉及一种用于强化学习机械臂控制中基于磁场的奖励塑形方法。The invention belongs to the field of robot control, and in particular relates to a magnetic field-based reward shaping method used in reinforcement learning robotic arm control.

背景技术Background technique

传统的机械臂控制方法,通常需要基于运动学和动力学方程对机械臂进行建模并求解末端位姿和各个关节的角度值。随着工业应用场景的复杂性和动态性不断提高,传统基于模型的机械臂控制方法的计算复杂度也越来越高,无法及时适应外部环境的变化,缺乏对环境的自主学习和泛化能力。Traditional manipulator control methods usually need to model the manipulator based on kinematics and dynamic equations and solve the end pose and angle values of each joint. As the complexity and dynamics of industrial application scenarios continue to increase, the computational complexity of traditional model-based manipulator control methods is also getting higher and higher, unable to adapt to changes in the external environment in time, and lack of autonomous learning and generalization capabilities for the environment .

近年来,强化学习由于其处理序列决策问题的独特优势,被广泛应用于机械臂控制任务中。其通过将传感器获取的环境状态信息直接映射到机械臂执行的动作上,实现端到端(end-to-end)的控制,为复杂连续高维系统的控制问题提供新的解决思路。强化学习的优化目标是在马尔可夫决策过程(MDP)中寻找使得累计奖励值最大的最优策略,因此设计一个科学的奖励函数尤为重要。关于机械臂运动控制任务中奖励函数的设计,现有方法的设置较为简单,其中包括方位奖励函数设计、启发式奖励函数设计等,无法在复杂动态环境中为机械臂提供丰富的奖励信号,未能有效提高学习效率。In recent years, reinforcement learning has been widely used in robotic arm control tasks due to its unique advantages in dealing with sequential decision-making problems. It realizes end-to-end control by directly mapping the environmental state information obtained by the sensor to the actions performed by the manipulator, and provides a new solution to the control problem of complex continuous high-dimensional systems. The optimization goal of reinforcement learning is to find the optimal strategy that maximizes the cumulative reward value in the Markov Decision Process (MDP), so it is particularly important to design a scientific reward function. Regarding the design of the reward function in the motion control task of the robot arm, the settings of the existing methods are relatively simple, including the design of the azimuth reward function and the design of the heuristic reward function, etc., which cannot provide rich reward signals for the robot arm in the complex dynamic environment. It can effectively improve the learning efficiency.

公开号为CN113894787A的专利文件公开了一种用于机械臂强化学习运动规划中启发式奖励函数的设计方法,包括:建立机械臂运动规划问题的启发式函数;根据启发式函数,构建机械臂运动规划的启发式奖励函数;确定启发式奖励函数中的参数取值;利用构建的启发式奖励函数训练机械臂运动规划的神经网络运动规划器。该发明基于机械臂末端位置到目标位置的直线距离来设置启发式奖励函数,无法提供更高阶的奖励信号,且无法保证学习策略的最优性。Patent document with publication number CN113894787A discloses a design method for heuristic reward function in robotic arm reinforcement learning motion planning, including: establishing a heuristic function for robotic arm motion planning problem; constructing robotic arm motion according to the heuristic function Planning heuristic reward function; determine the parameter values in the heuristic reward function; use the constructed heuristic reward function to train a neural network motion planner for robotic arm motion planning. The invention sets the heuristic reward function based on the straight-line distance from the end position of the robotic arm to the target position, which cannot provide higher-order reward signals and cannot guarantee the optimality of the learning strategy.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有强化学习机械臂控制方法中的奖励函数在复杂动态环境中提供信息有限的问题,提出了一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,能够在保证最优策略不变的情况下,为机械臂提供关于目标物和障碍物更为丰富的方位信息,从而提高强化学习算法的学习效率和收敛速度。The purpose of the present invention is to solve the problem that the reward function in the existing reinforcement learning robotic arm control method provides limited information in complex dynamic environments, and proposes a magnetic field-based reward shaping method for reinforcement learning robotic arm control. Under the condition of keeping the optimal strategy unchanged, it provides the robotic arm with richer orientation information about the target and obstacles, thereby improving the learning efficiency and convergence speed of the reinforcement learning algorithm.

本发明的技术方案为:一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,其特征在于,包括以下步骤:The technical scheme of the present invention is: a magnetic field-based reward shaping method for reinforcement learning robotic arm control, which is characterized in that it includes the following steps:

S1、设计任务环境,设定机械臂、目标物和障碍物的相关参数,设置强化学习算法的各项超参数;S1. Design the task environment, set the relevant parameters of the manipulator, the target and the obstacle, and set the various hyperparameters of the reinforcement learning algorithm;

S2、将目标物视为同等形状的方形永磁体,确定其磁化方向和三维空间磁场强度分布的计算方式,障碍物同理;S2. Treat the target as a square permanent magnet of the same shape, determine its magnetization direction and the calculation method of the magnetic field intensity distribution in three-dimensional space, and the same is true for obstacles;

S3、机械臂与环境交互,收集训练数据,并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度,经过标准化和归一化处理后得到磁场奖励函数;S3. The manipulator interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the manipulator in the target and obstacle magnetic fields according to the next state, and obtains the magnetic field reward function after normalization and normalization;

S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数,并和训练数据一起存放于经验回放池;S4. Use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data;

S5、从经验回放池中采集一个批次的数据,使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略。S5. Collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment.

作为优选,所述步骤S1包括以下几个步骤:Preferably, the step S1 includes the following steps:

步骤1.1,设计任务环境的状态观测值和机械臂的动作值,具体包括:Step 1.1, design the state observation value of the task environment and the action value of the manipulator, including:

a、环境状态观测值包含机械臂三个关节的转角、机械臂末端的坐标,以及目标物和障碍物中心点的坐标;a. The observation value of the environmental state includes the rotation angles of the three joints of the manipulator, the coordinates of the end of the manipulator, and the coordinates of the center point of the target and obstacles;

b、机械臂的动作值为三个关节电机的转角速度,即在单位时间步长里三个关节旋转的角度。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step.

步骤1.2,建立与机械臂的连接,设置三个关节转动的速度和加速度范围;规定目标物和障碍物随机生成的方式,确保目标物在机械臂末端可达到的范围之内,并且目标物和障碍物不相交。Step 1.2, establish the connection with the manipulator, set the speed and acceleration range of the rotation of the three joints; specify the random generation method of the target and obstacles, ensure that the target is within the reach of the end of the manipulator, and the target and Obstacles do not intersect.

步骤1.3,设置强化学习算法基本的超参数,至少包括:探索噪声,经验回放池

Figure BDA0003706057030000021
的大小;每次训练的更新次数K,每次更新所用数据批次的大小N;神经网络的层数,每层的节点数、激活函数;折扣因子γ;策略网络μθ(s)和值函数网络Qφ(s,a)参数更新的优化器、学习率,目标网络
Figure BDA0003706057030000022
Figure BDA0003706057030000023
的软更新步长τ。Step 1.3, set the basic hyperparameters of the reinforcement learning algorithm, including at least: exploring noise, experience playback pool
Figure BDA0003706057030000021
The number of updates K for each training, the size N of the data batch used for each update; the number of layers of the neural network, the number of nodes in each layer, the activation function; the discount factor γ; the policy network μ θ (s) and the value Optimizer, learning rate, target network for parameter update of function network Q φ (s, a)
Figure BDA0003706057030000022
and
Figure BDA0003706057030000023
The soft update step size τ.

在步骤S2中,方形永磁体三维空间中磁场强度分布的解析计算方法如下:In step S2, the analytical calculation method of the magnetic field intensity distribution in the three-dimensional space of the square permanent magnet is as follows:

假设磁化方向为z轴正方向,磁化强度为Mc,对于沿x轴、y轴、z轴长度分别为l,w,h的方形永磁体,其在三维空间中任意一点P(x,y,z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为:Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:

Figure BDA0003706057030000031
Figure BDA0003706057030000031

其中,Γ(γ1,γ2,γ3)和

Figure BDA0003706057030000032
为两个辅助函数,表达式如下:where Γ(γ 1 , γ 2 , γ 3 ) and
Figure BDA0003706057030000032
For two auxiliary functions, the expressions are as follows:

Figure BDA0003706057030000033
Figure BDA0003706057030000033

其中,∈为一极小值。于是,可以得到方形永磁体在三维空间中任意一点的磁场强度为:Among them, ∈ is a minimum value. Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:

Figure BDA0003706057030000034
Figure BDA0003706057030000034

作为优选,所述步骤S3包括以下几个步骤:Preferably, the step S3 includes the following steps:

步骤3.1,将机械臂三个关节的转角初始化为零,读取机械臂末端的坐标;随机设置目标物和障碍物的位置,读取目标物和障碍物中心点在世界坐标系中的坐标。得到状态观测值的初始值。Step 3.1, initialize the rotation angles of the three joints of the manipulator to zero, read the coordinates of the end of the manipulator; randomly set the positions of the target and obstacles, and read the coordinates of the center points of the target and obstacles in the world coordinate system. Get the initial value of the state observation.

步骤3.2,机械臂根据当前状态观测值s和策略,输出动作并对其施加噪声得到a,与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下,控制机械臂运动至下一状态。Step 3.2, according to the current state observation value s and the strategy, the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, it obtains the next state s' and the original reward value r. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state.

步骤3.3,将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet.

假设下一状态中机械臂末端在世界坐标系中的坐标为

Figure BDA0003706057030000035
目标物磁场坐标系原点相对于世界坐标系原点的平移量为(Tx,Ty,Tz),目标物磁场坐标系相对于世界坐标系绕x轴、y轴、z轴的旋转角度为θx,θy,θz,其正方向遵循右手螺旋定则。那么机械臂末端在目标物磁场坐标系中的坐标
Figure BDA0003706057030000036
可表示为:Assume that the coordinates of the end of the manipulator in the world coordinate system in the next state are
Figure BDA0003706057030000035
The displacement of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T x , Ty , T z ), and the rotation angle of the magnetic field coordinate system of the target object relative to the world coordinate system around the x-axis, y-axis, and z-axis is θ x , θ y , θ z , the positive directions follow the right-hand spiral rule. Then the coordinates of the end of the manipulator in the target magnetic field coordinate system
Figure BDA0003706057030000036
can be expressed as:

Figure BDA0003706057030000037
Figure BDA0003706057030000037

其中,

Figure BDA0003706057030000038
分别为坐标系绕x轴、y轴、z轴的旋转变换矩阵,具体如下:in,
Figure BDA0003706057030000038
are the rotation transformation matrices of the coordinate system around the x-axis, y-axis, and z-axis, respectively, as follows:

Figure BDA0003706057030000041
Figure BDA0003706057030000041

步骤3.4,计算下一状态中机械臂末端坐标在目标物磁场和障碍物磁场中的磁场强度,并对其进行标准化处理。Step 3.4: Calculate the magnetic field strength of the coordinates of the end of the manipulator in the target magnetic field and the obstacle magnetic field in the next state, and normalize it.

假设环境中存在1个目标物和n个障碍物,其中目标物和障碍物磁体的磁场强度计算函数分别为

Figure BDA0003706057030000042
机械臂末端在目标物和障碍物磁体的磁场坐标系中的坐标分别为
Figure BDA0003706057030000043
于是,可以算得机械臂末端坐标在目标物和障碍物磁场中的磁场强度分别为
Figure BDA0003706057030000044
Assuming that there are 1 target and n obstacles in the environment, the calculation functions of the magnetic field strength of the target and obstacle magnets are respectively
Figure BDA0003706057030000042
The coordinates of the end of the manipulator in the magnetic field coordinate system of the target and obstacle magnets are respectively
Figure BDA0003706057030000043
Therefore, it can be calculated that the magnetic field strengths of the coordinates of the end of the manipulator in the target and obstacle magnetic fields are respectively:
Figure BDA0003706057030000044

Figure BDA0003706057030000045
存入磁场强度回放池
Figure BDA0003706057030000046
并根据当前
Figure BDA0003706057030000047
中磁场强度的均值
Figure BDA0003706057030000048
和标准差
Figure BDA0003706057030000049
来将算得的
Figure BDA00037060570300000410
映射到标准高斯分布上,标准化处理后的磁场强度
Figure BDA00037060570300000411
表达如下:Will
Figure BDA0003706057030000045
Stored in the magnetic field strength playback pool
Figure BDA0003706057030000046
and according to the current
Figure BDA0003706057030000047
Mean value of medium magnetic field strength
Figure BDA0003706057030000048
and standard deviation
Figure BDA0003706057030000049
it will count
Figure BDA00037060570300000410
Mapping to a standard Gaussian distribution, normalized magnetic field strength
Figure BDA00037060570300000411
The expression is as follows:

Figure BDA00037060570300000412
Figure BDA00037060570300000412

步骤3.5,计算目标物和障碍物磁体的联合磁场强度,并对其进行归一化处理,得到磁场奖励函数。Step 3.5: Calculate the combined magnetic field strength of the target and obstacle magnets, and normalize them to obtain a magnetic field reward function.

定义目标物和障碍物磁体的联合磁场强度为:The combined magnetic field strengths of the target and obstacle magnets are defined as:

Figure BDA00037060570300000413
Figure BDA00037060570300000413

利用Softsign函数对联合磁场强度进行归一化处理,并将输出的结果定义为磁场奖励函数rM。具体如下:The joint magnetic field strength is normalized by the Softsign function, and the output result is defined as the magnetic field reward function r M . details as follows:

Figure BDA00037060570300000414
Figure BDA00037060570300000414

作为优选,所述步骤S4包括以下几个步骤:Preferably, the step S4 includes the following steps:

步骤4.1,定义DPBA算法中的势函数神经网络为Φψ(s,a),其输入为状态观测值和机械臂的动作值,输出为当前状态动作对的势能值,ψ为神经网络的参数。用于更新势函数神经网络的损失函数定义为:Step 4.1, define the potential function neural network in the DPBA algorithm as Φ ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . The loss function used to update the potential function neural network is defined as:

Figure BDA0003706057030000051
Figure BDA0003706057030000051

其中,y为势函数的无梯度“标签值”,具体表达如下:Among them, y is the gradient-free "label value" of the potential function, which is specifically expressed as follows:

y=-rM+γΦψ(s′,a′)y=-r M +γΦ ψ (s′, a′)

其中,rM为步骤3.5得到的磁场奖励函数,γ为折扣因子。采用梯度下降方法更新势函数神经网络的参数如下:Among them, r M is the magnetic field reward function obtained in step 3.5, and γ is the discount factor. The parameters of the potential function neural network are updated using the gradient descent method as follows:

Figure BDA0003706057030000052
Figure BDA0003706057030000052

其中,η为势函数神经网络更新的学习率。Among them, η is the learning rate of the potential function neural network update.

步骤4.2,根据更新前的参数Ψ和更新后的参数Ψ′,可计算基于势能的塑形奖励函数如下:Step 4.2, according to the parameter Ψ before the update and the parameter Ψ' after the update, the shaping reward function based on the potential energy can be calculated as follows:

fM=γΦψ′(s′,a′)-Φψ(s,a)f M = γΦ ψ' (s', a') - Φ ψ (s, a)

当势函数Φψ(s,a)初始化为零并用以上方式更新至最终收敛时,可以实现将磁场奖励函数完全转换为基于势能的塑形奖励函数,即:fM=rMWhen the potential function Φ ψ (s, a) is initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential energy-based shaping reward function can be achieved, ie: f M =r M .

将塑形奖励fM与步骤3.1中得到的原始奖励r结合,将(s,a,r+fM,s′)作为一组训练数据存放至经验回放池,用于后续强化学习算法的训练。根据最优策略不变性定理,算法由奖励函数r+fM学习到的最优策略,与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2,直到机械臂末端达到目标物,或机械臂碰到障碍物或地面,或经历设定的最大时间步长。Combine the shaping reward f M with the original reward r obtained in step 3.1, and store (s, a, r+f M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or ground, or goes through the set maximum time step.

作为优选,所述步骤S5包括以下几个步骤:Preferably, the step S5 includes the following steps:

步骤5.1,在经验回放池中随机采样一个批次的数据(S,A,R+FM,S′),其中用(si,ai,ri+fi M,si+1)表示单个训练数据。Step 5.1, randomly sample a batch of data (S, A, R+ FM , S′) in the experience replay pool, where (s i , a i , r i + f i M , s i+1 ) represents a single training data.

步骤5.2,计算用于更新值函数网络参数的损失函数:Step 5.2, calculate the loss function used to update the network parameters of the value function:

Figure BDA0003706057030000053
Figure BDA0003706057030000053

其中,yi为值函数的无梯度“标签值”,具体表达如下:Among them, y i is the gradient-free "label value" of the value function, which is specifically expressed as follows:

Figure BDA0003706057030000054
Figure BDA0003706057030000054

采用梯度下降方法更新值函数网络的参数如下:The parameters of the value function network are updated using the gradient descent method as follows:

Figure BDA0003706057030000055
Figure BDA0003706057030000055

其中,β为值函数网络更新的学习率。Among them, β is the learning rate of the value function network update.

步骤5.3,计算用于更新策略网络参数的损失函数:Step 5.3, calculate the loss function for updating the parameters of the policy network:

Figure BDA0003706057030000061
Figure BDA0003706057030000061

采用梯度下降方法更新策略网络的参数如下:The parameters of the policy network are updated using the gradient descent method as follows:

Figure BDA0003706057030000062
Figure BDA0003706057030000062

其中,α为策略网络更新的学习率。Among them, α is the learning rate of policy network update.

步骤5.4,软更新目标网络的参数:Step 5.4, soft update the parameters of the target network:

Figure BDA0003706057030000063
Figure BDA0003706057030000063

步骤5.5,重复步骤5.1至步骤5.4共K次,结束此轮回合。重复步骤S3至步骤S5,直到算法完全收敛,得到机械臂在动态环境下避开障碍物并达到目标物的最优策略网络

Figure BDA0003706057030000064
Step 5.5, repeat steps 5.1 to 5.4 for a total of K times to end this round. Steps S3 to S5 are repeated until the algorithm is completely converged, and the optimal strategy network for the robot arm to avoid obstacles and reach the target in a dynamic environment is obtained.
Figure BDA0003706057030000064

本发明的有益效果如下:与现有技术相比,本发明所提出的一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,能够在保证最优策略不变的情况下,为机械臂提供关于目标物和障碍物更为丰富的方位信息,从而在复杂动态环境中有效提高强化学习算法的学习效率和收敛速度。The beneficial effects of the present invention are as follows: compared with the prior art, a magnetic field-based reward shaping method for reinforcement learning robotic arm control proposed by the present invention can ensure that the optimal strategy remains unchanged, as The robotic arm provides richer orientation information about targets and obstacles, thereby effectively improving the learning efficiency and convergence speed of reinforcement learning algorithms in complex dynamic environments.

附图说明Description of drawings

图1是本发明实施例在仿真环境中的任务场景图,其中,A为机械臂末端,B为障碍物,C为目标物;1 is a task scene diagram of an embodiment of the present invention in a simulation environment, wherein A is the end of the robotic arm, B is an obstacle, and C is a target;

图2是本发明的算法整体框架图;Fig. 2 is the algorithm overall frame diagram of the present invention;

图3是本发明实施例中方形永磁体的磁场坐标系示意图;3 is a schematic diagram of a magnetic field coordinate system of a square permanent magnet in an embodiment of the present invention;

图4是本发明基于磁场的奖励塑形方法与同类算法在仿真环境中的实验效果对比图。FIG. 4 is a comparison diagram of experimental effects between the magnetic field-based reward shaping method of the present invention and similar algorithms in a simulation environment.

具体实施方式Detailed ways

下面结合附图详细说明本发明的具体实施方式。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

实施例:如附图1所示:本实施例以3个自由度的Dobot Magician机械臂为例,设计的任务场景是利用强化学习算法在动态环境下,控制机械臂完成在不碰到障碍物的前提下移动末端到达目标物的任务。其中,目标物和障碍物为不同尺寸的长方体,且在每个回合中的位置是随机变化的。本实施例所述的一种用于强化学习机械臂控制的磁场奖励函数设计框架如附图2所示,至少包括以下几个步骤:Example: As shown in Figure 1: This example takes the Dobot Magician robotic arm with 3 degrees of freedom as an example, and the designed task scenario is to use the reinforcement learning algorithm to control the robotic arm to complete the task without encountering obstacles in a dynamic environment. On the premise of moving the end to reach the target task. Among them, the target and obstacle are cuboids of different sizes, and their positions in each round change randomly. A magnetic field reward function design framework for reinforcement learning robotic arm control described in this embodiment is shown in FIG. 2 and includes at least the following steps:

步骤S1、设计任务环境,设定机械臂、目标物和障碍物的相关参数,设置强化学习算法的各项超参数,具体包括以下步骤:Step S1, design the task environment, set the relevant parameters of the robotic arm, the target object and the obstacle, and set various hyperparameters of the reinforcement learning algorithm, which specifically includes the following steps:

步骤1.1,设计任务环境的状态观测值和机械臂的动作值,具体包括:Step 1.1, design the state observation value of the task environment and the action value of the manipulator, including:

a、环境状态观测值包含机械臂三个关节的转角、机械臂末端的坐标,以及目标物和障碍物中心点的坐标;a. The observation value of the environmental state includes the rotation angles of the three joints of the manipulator, the coordinates of the end of the manipulator, and the coordinates of the center point of the target and obstacles;

b、机械臂的动作值为三个关节电机的转角速度,即在单位时间步长里三个关节旋转的角度,其范围限制在[-1°,1°]。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step, and its range is limited to [-1°, 1°].

步骤1.2,建立与机械臂的连接,设置三个关节转动的速度和加速度范围;规定目标物和障碍物随机生成的方式,确保目标物在机械臂末端可达到的范围之内,并且目标物和障碍物不相交。Step 1.2, establish the connection with the manipulator, set the speed and acceleration range of the rotation of the three joints; specify the random generation method of the target and obstacles, ensure that the target is within the reach of the end of the manipulator, and the target and Obstacles do not intersect.

步骤1.3,根据采用的强化学习算法设置基本的超参数。本实施例中采用“Lillicrap,Timothy P.,et al."Continuous control with deep reinforcementlearning."arXiv preprint arXiv:1509.02971(2015).”提出的适用于连续状态动作空间的DDPG算法,需要设置的超参数包括:探索噪声

Figure BDA0003706057030000071
经验回放池
Figure BDA0003706057030000072
的大小106;每次训练的更新次数K=20,每次更新所用数据批次的大小N=128;神经网络的层数2,每层的节点数256,激活函数ReLU,并随机初始化神经网络的参数;折扣因子γ=0.99;策略网络μθ(s)和值函数网络Qφ(s,a)参数更新的优化器为Adam,学习率分别为3×10-4和10-3,目标网络
Figure BDA0003706057030000073
Figure BDA0003706057030000074
的软更新步长τ=10-3。Step 1.3, set basic hyperparameters according to the reinforcement learning algorithm used. In this example, the DDPG algorithm proposed by "Lillicrap, Timothy P., et al."Continuous control with deep reinforcement learning."arXiv preprint arXiv:1509.02971(2015)." is used for continuous state action space, and the hyperparameters that need to be set Includes: Explore Noise
Figure BDA0003706057030000071
Experience Playback Pool
Figure BDA0003706057030000072
The size of 10 6 ; the number of updates for each training K=20, the size of the data batch used for each update is N=128; the number of layers of the neural network is 2, the number of nodes in each layer is 256, the activation function ReLU, and the neural network is randomly initialized parameters of the network; discount factor γ=0.99; the optimizer for parameter update of policy network μ θ (s) and value function network Q φ (s, a) is Adam, and the learning rates are 3×10 -4 and 10 -3 , respectively, target network
Figure BDA0003706057030000073
and
Figure BDA0003706057030000074
The soft update step size τ=10 -3 .

步骤S2、将目标物视为同等形状的方形永磁体,确定其磁化方向和三维空间磁场强度分布的计算方式,障碍物同理。In step S2, the target is regarded as a square permanent magnet of the same shape, and the calculation method of its magnetization direction and the magnetic field intensity distribution in three-dimensional space is determined, and the same is true for obstacles.

本实施方式以方形永磁体为例,以解析方法计算其磁场强度分布,其他形状的永磁体可用类似的解析方法或基于物理仿真的模拟方法得到磁场强度分布,方形永磁体的磁场坐标系如附图3所示。假设磁化方向为z轴正方向,磁化强度为Mc,对于沿x轴、y轴、z轴长度分别为l,w,h的方形永磁体,其在三维空间中任意一点P(x,y,z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为:In this embodiment, a square permanent magnet is used as an example, and its magnetic field intensity distribution is calculated by an analytical method. For permanent magnets of other shapes, a similar analytical method or a simulation method based on physical simulation can be used to obtain the magnetic field intensity distribution. The magnetic field coordinate system of the square permanent magnet is shown in the appendix. shown in Figure 3. Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:

Figure BDA0003706057030000081
Figure BDA0003706057030000081

其中,Γ(γ1,γ2,γ3)和

Figure BDA0003706057030000082
为两个辅助函数,表达式如下:where Γ(γ 1 , γ 2 , γ 3 ) and
Figure BDA0003706057030000082
For two auxiliary functions, the expressions are as follows:

Figure BDA0003706057030000083
Figure BDA0003706057030000083

其中,∈为一极小值,本实施例中取∈=10-7。于是,可以得到方形永磁体在三维空间中任意一点的磁场强度为:Among them, ∈ is a minimum value, and in this embodiment, ∈=10 −7 . Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:

Figure BDA0003706057030000084
Figure BDA0003706057030000084

在本实施例中,目标物尺寸设置为l=0.03m,w=0.045m,h=0.02m,障碍物尺寸设置为l=0.038m,w=0.047m,h=0.12m。In this embodiment, the size of the target is set as l=0.03m, w=0.045m, h=0.02m, and the size of the obstacle is set as l=0.038m, w=0.047m, h=0.12m.

步骤S3、机械臂与环境交互,收集训练数据,并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度,经过标准化和归一化处理后得到磁场奖励函数,具体包括以下步骤:Step S3, the robotic arm interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the robotic arm in the target and obstacle magnetic fields according to the next state, and obtains a magnetic field reward function after standardization and normalization, which specifically includes The following steps:

步骤3.1,将机械臂三个关节的转角初始化为零,读取机械臂末端的坐标;随机设置目标物和障碍物的位置,读取目标物和障碍物中心点在世界坐标系中的坐标。得到状态观测值的初始值。Step 3.1, initialize the rotation angles of the three joints of the manipulator to zero, read the coordinates of the end of the manipulator; randomly set the positions of the target and obstacles, and read the coordinates of the center points of the target and obstacles in the world coordinate system. Get the initial value of the state observation.

步骤3.2,机械臂根据当前状态观测值s和策略网络μθ(s),输出动作并对其施加噪声得到a,与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下,控制机械臂运动至下一状态。在本实施例中,三个关节的工作范围分别为[-90°,90°],[0°,85°],[-10°,90°],原始奖励函数设置如下:Step 3.2, according to the current state observation value s and the policy network μ θ (s), the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, the next state s' and the original reward value r are obtained. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state. In this embodiment, the working ranges of the three joints are respectively [-90°, 90°], [0°, 85°], [-10°, 90°], and the original reward function is set as follows:

Figure BDA0003706057030000085
Figure BDA0003706057030000085

步骤3.3,将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。本实施例以目标物磁体为例,障碍物磁体的坐标变换同理可求。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet. In this embodiment, the target magnet is taken as an example, and the coordinate transformation of the obstacle magnet can be obtained in the same way.

假设下一状态中机械臂末端在世界坐标系中的坐标为

Figure BDA0003706057030000091
目标物磁场坐标系原点相对于世界坐标系原点的平移量为(Tx,Ty,Tz),目标物磁场坐标系相对于世界坐标系绕x轴、y轴、z轴的旋转角度为θx,θy,θz,其正方向遵循右手螺旋定则。那么机械臂末端在目标物磁场坐标系中的坐标
Figure BDA0003706057030000092
可表示为:Assume that the coordinates of the end of the manipulator in the world coordinate system in the next state are
Figure BDA0003706057030000091
The displacement of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T x , Ty , T z ), and the rotation angle of the magnetic field coordinate system of the target object relative to the world coordinate system around the x-axis, y-axis, and z-axis is θ x , θ y , θ z , the positive directions follow the right-hand spiral rule. Then the coordinates of the end of the manipulator in the target magnetic field coordinate system
Figure BDA0003706057030000092
can be expressed as:

Figure BDA0003706057030000093
Figure BDA0003706057030000093

其中,

Figure BDA0003706057030000094
分别为坐标系绕x轴、y轴、z轴的旋转变换矩阵,具体如下:in,
Figure BDA0003706057030000094
are the rotation transformation matrices of the coordinate system around the x-axis, y-axis, and z-axis, respectively, as follows:

Figure BDA0003706057030000095
Figure BDA0003706057030000095

步骤3.4,计算下一状态中机械臂末端坐标在目标物磁场和障碍物磁场中的磁场强度,并对其进行标准化处理。Step 3.4: Calculate the magnetic field strength of the coordinates of the end of the manipulator in the target magnetic field and the obstacle magnetic field in the next state, and normalize it.

假设环境中存在1个目标物和n个障碍物(本实施例中n=1),其中目标物和障碍物磁体的磁场强度计算函数分别为

Figure BDA0003706057030000096
机械臂末端在目标物和障碍物磁体的磁场坐标系中的坐标分别为
Figure BDA0003706057030000097
于是,可以算得机械臂末端坐标在目标物和障碍物磁场中的磁场强度分别为
Figure BDA0003706057030000098
Assuming that there are 1 target and n obstacles in the environment (n=1 in this embodiment), the calculation functions of the magnetic field strengths of the target and obstacle magnets are respectively
Figure BDA0003706057030000096
The coordinates of the end of the manipulator in the magnetic field coordinate system of the target and obstacle magnets are respectively
Figure BDA0003706057030000097
Therefore, it can be calculated that the magnetic field strengths of the coordinates of the end of the manipulator in the target and obstacle magnetic fields are respectively:
Figure BDA0003706057030000098

由于不同计算方法得到的磁场强度的区间范围往往是不同的,因此需要对算得的

Figure BDA0003706057030000099
进行标准化处理。本发明通过引入一个磁场强度回放池
Figure BDA00037060570300000910
来存放
Figure BDA00037060570300000911
的值,并根据当前
Figure BDA00037060570300000912
中磁场强度的均值
Figure BDA00037060570300000913
和标准差
Figure BDA00037060570300000914
来将算得的
Figure BDA00037060570300000915
映射到标准高斯分布上,具体如下:Since the interval range of the magnetic field strength obtained by different calculation methods is often different, it is necessary to
Figure BDA0003706057030000099
Standardize. The present invention replays the cell by introducing a magnetic field strength
Figure BDA00037060570300000910
to store
Figure BDA00037060570300000911
value, and according to the current
Figure BDA00037060570300000912
Mean value of medium magnetic field strength
Figure BDA00037060570300000913
and standard deviation
Figure BDA00037060570300000914
it will count
Figure BDA00037060570300000915
Mapped to the standard Gaussian distribution as follows:

Figure BDA00037060570300000916
Figure BDA00037060570300000916

其中,

Figure BDA00037060570300000917
为标准化处理后的磁场强度。本实施例中,磁场强度回放池
Figure BDA00037060570300000918
的大小为106。in,
Figure BDA00037060570300000917
is the normalized magnetic field strength. In this embodiment, the magnetic field strength playback cell
Figure BDA00037060570300000918
The size is 10 6 .

步骤3.5,计算目标物和障碍物磁体的联合磁场强度,并对其进行归一化处理,得到磁场奖励函数。Step 3.5: Calculate the combined magnetic field strength of the target and obstacle magnets, and normalize them to obtain a magnetic field reward function.

在本发明中,机械臂的任务是需要在避开障碍物的前提下将末端移动至目标物,因此目标物和障碍物磁体的联合磁场强度定义为:In the present invention, the task of the robotic arm is to move the end to the target while avoiding obstacles, so the combined magnetic field strength of the target and obstacle magnets is defined as:

Figure BDA0003706057030000101
Figure BDA0003706057030000101

其中,目标物磁体对机械臂末端的“吸引”与所有n个障碍物对机械臂末端的“排斥”作用相等。根据磁场强度分布的特性,目标物附近的磁场强度将会趋于正无穷,障碍物附近的磁场强度将会趋于负无穷。为了将联合磁场强度控制在合理的范围内,同时保持其分布规律,本发明利用Softsign函数对联合磁场强度进行归一化处理,并将输出的结果定义为磁场奖励函数rM。具体如下:Among them, the "attraction" of the target magnet to the end of the manipulator is equal to the "repulsion" of all n obstacles to the end of the manipulator. According to the characteristics of the magnetic field intensity distribution, the magnetic field intensity near the target will tend to positive infinity, and the magnetic field intensity near the obstacle will tend to negative infinity. In order to control the joint magnetic field strength within a reasonable range and keep its distribution law, the present invention uses the Softsign function to normalize the joint magnetic field strength, and defines the output result as the magnetic field reward function r M . details as follows:

Figure BDA0003706057030000102
Figure BDA0003706057030000102

步骤S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数,并和训练数据一起存放于经验回放池,DPBA算法为文献“Harutyunyan A,Devlin S,Vrancx P,etal.Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015,29(1).”提出,能够实现将任意专家给定的奖励函数转化为满足基于势能的奖励塑形的表达形式,以满足最优策略不变性定理。具体包括以下步骤:Step S4, use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data. The DPBA algorithm is the document "Harutyunyan A, Devlin S, Vrancx P, et al. Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015, 29(1).” proposed that the reward function given by any expert can be transformed into one that satisfies potential-based reward shaping. expression to satisfy the optimal policy invariance theorem. Specifically include the following steps:

步骤4.1,定义DPBA算法中的势函数神经网络为Φψ(s,a),其输入为状态观测值和机械臂的动作值,输出为当前状态动作对的势能值,ψ为神经网络的参数。本实施例中,势函数神经网络拥有两个隐藏层,每层的节点数均为256,激活函数均为ReLU;其参数更新的优化器为Adam,学习率为10-4。用于更新势函数神经网络的损失函数定义如下:Step 4.1, define the potential function neural network in the DPBA algorithm as Φ ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . In this embodiment, the potential function neural network has two hidden layers, the number of nodes in each layer is 256, and the activation function is ReLU; the optimizer for parameter updating is Adam, and the learning rate is 10 -4 . The loss function used to update the potential function neural network is defined as follows:

Figure BDA0003706057030000103
Figure BDA0003706057030000103

其中,y为势函数的无梯度“标签值”,具体表达如下:Among them, y is the gradient-free "label value" of the potential function, which is specifically expressed as follows:

y=-rM+γΦψ(s′,a′)y=-r M +γΦ ψ (s′, a′)

其中,rM为步骤3.5得到的磁场奖励函数,γ为折扣因子。采用梯度下降方法更新势函数神经网络的参数如下:Among them, r M is the magnetic field reward function obtained in step 3.5, and γ is the discount factor. The parameters of the potential function neural network are updated using the gradient descent method as follows:

Figure BDA0003706057030000111
Figure BDA0003706057030000111

其中,η为势函数神经网络更新的学习率。Among them, η is the learning rate of the potential function neural network update.

步骤4.2,根据更新前的参数Ψ和更新后的参数Ψ′,可计算基于势能的塑形奖励函数如下:Step 4.2, according to the parameter Ψ before the update and the parameter Ψ' after the update, the shaping reward function based on the potential energy can be calculated as follows:

fM=γΦψ′(s′,a′)-Φψ(s,a)f M = γΦ ψ' (s', a') - Φ ψ (s, a)

当势函数Φψ(s,a)初始化为零并用以上方式更新至最终收敛时,可以实现将磁场奖励函数完全转换为基于势能的塑形奖励函数,即:fM=rMWhen the potential function Φ ψ (s, a) is initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential energy-based shaping reward function can be achieved, ie: f M =r M .

将塑形奖励fM与步骤3.1中得到的原始奖励r结合,将(s,a,r+fM,s′)作为一组训练数据存放至经验回放池,用于后续强化学习算法的训练。根据最优策略不变性定理,算法由奖励函数r+fM学习到的最优策略,与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2,直到机械臂末端达到目标物,或机械臂碰到障碍物或地面,或经历设定的最大时间步长(本实施例中设置为200)。Combine the shaping reward f M with the original reward r obtained in step 3.1, and store (s, a, r+f M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or the ground, or experiences the set maximum time step (set to 200 in this embodiment).

步骤S5、从经验回放池中采集一个批次的数据,使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略,具体包括以下步骤:Step S5, collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment, which specifically includes the following steps:

步骤5.1,在经验回放池中随机采样一个批次的N组数据(S,A,R+FM,S′),其中用(si,ai,ri+fi M,si+1)表示单个训练数据。Step 5.1, randomly sample a batch of N groups of data (S, A, R+F M , S′) in the experience replay pool, where (s i , a i , r i +f i M , s i+ 1 ) represents a single training data.

步骤5.2,计算用于更新值函数网络参数的损失函数:Step 5.2, calculate the loss function used to update the network parameters of the value function:

Figure BDA0003706057030000112
Figure BDA0003706057030000112

其中,yi为值函数的无梯度“标签值”,具体表达如下:Among them, y i is the gradient-free "label value" of the value function, which is specifically expressed as follows:

Figure BDA0003706057030000113
Figure BDA0003706057030000113

采用梯度下降方法更新值函数网络的参数如下:The parameters of the value function network are updated using the gradient descent method as follows:

Figure BDA0003706057030000114
Figure BDA0003706057030000114

其中,β为值函数网络更新的学习率。Among them, β is the learning rate of the value function network update.

步骤5.3,计算用于更新策略网络参数的损失函数:Step 5.3, calculate the loss function for updating the parameters of the policy network:

Figure BDA0003706057030000115
Figure BDA0003706057030000115

采用梯度下降方法更新策略网络的参数如下:The parameters of the policy network are updated using the gradient descent method as follows:

Figure BDA0003706057030000116
Figure BDA0003706057030000116

其中,α为策略网络更新的学习率。Among them, α is the learning rate of policy network update.

步骤5.4,软更新目标网络的参数:Step 5.4, soft update the parameters of the target network:

Figure BDA0003706057030000121
Figure BDA0003706057030000121

步骤5.5,重复步骤5.1至步骤5.4共K次,结束此轮回合。重复步骤S3至步骤S5,直到算法完全收敛,得到机械臂在动态环境下避开障碍物并达到目标物的最优策略网络

Figure BDA0003706057030000122
Step 5.5, repeat steps 5.1 to 5.4 for a total of K times to end this round. Steps S3 to S5 are repeated until the algorithm is completely converged, and the optimal strategy network for the robot arm to avoid obstacles and reach the target in a dynamic environment is obtained.
Figure BDA0003706057030000122

本实施例中对比了本发明公开的方法和同类算法在附图1仿真环境中训练的效果,结果对比如附图4所示。从图中可以看出,基于磁场的奖励塑形方法在学习过程中每个回合的成功率明显高于原始奖励和基于距离的奖励塑形方法,由此可见,本发明公开的方法能有效提高机械臂运动控制任务中强化学习算法的学习效率。In this embodiment, the training effects of the method disclosed in the present invention and similar algorithms in the simulation environment of FIG. 1 are compared, and the results are compared as shown in FIG. 4 . It can be seen from the figure that the success rate of the magnetic field-based reward shaping method in each round in the learning process is significantly higher than that of the original reward and the distance-based reward shaping method. It can be seen that the method disclosed in the present invention can effectively improve the Learning efficiency of reinforcement learning algorithms in robotic arm motion control tasks.

以上所述的仅是本发明的优选实施方式,应当指出,对于本领域的普通技术人员来说,在不脱离本发明创造构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, some modifications and improvements can be made without departing from the inventive concept of the present invention, which belong to the present invention. The scope of protection of the invention.

Claims (6)

1. A magnetic field-based reward shaping method used in reinforcement learning mechanical arm control is characterized by comprising the following steps:
s1, designing a task environment, setting relevant parameters of a mechanical arm, a target object and an obstacle, and setting various hyper-parameters of a reinforcement learning algorithm;
s2, regarding the target object as a square permanent magnet with the same shape, and determining the magnetization direction and the calculation mode of the three-dimensional space magnetic field intensity distribution, wherein the obstacles are similar;
s3, the mechanical arm interacts with the environment, training data are collected, the magnetic field intensity of the tail end coordinates of the mechanical arm in the magnetic fields of the target object and the barrier object is calculated according to the next state, and a magnetic field reward function is obtained after standardization and normalization processing;
s4, converting the magnetic field reward function into a shaping reward function based on potential energy by using a DPBA algorithm, and storing the shaping reward function and training data in an experience playback pool;
and S5, collecting data of one batch from the experience playback pool, and training the mechanical arm to avoid the obstacle and reach the optimal strategy of the target object in the dynamic environment by using a reinforcement learning algorithm.
2. The method for magnetic field-based reward shaping in reinforcement learning robot arm control according to claim 1, wherein the step S1 comprises the steps of:
step 1.1, designing a state observation value of a task environment and an action value of a mechanical arm, and specifically comprising the following steps:
a. the environment state observation value comprises the corners of three joints of the mechanical arm, the coordinates of the tail end of the mechanical arm and the coordinates of the center points of the target object and the barrier;
b. the action value of the mechanical arm is the angular velocity of the three joint motors, namely the rotating angle of the three joints in unit time step;
step 1.2, establishing connection with a mechanical arm, and setting rotating speeds and acceleration ranges of three joints; the method comprises the steps of specifying a random generation mode of a target object and an obstacle, ensuring that the target object is within the range which can be reached by the tail end of a mechanical arm and the target object and the obstacle do not intersect;
step 1.3, setting basic hyper-parameters of a reinforcement learning algorithm, at least comprising the following steps: exploring noise, experience playback pools
Figure FDA0003706057020000011
The size of (d); updating times K of each training and updating the size N of the used data batch each time; the number of layers of the neural network, the number of nodes of each layer and an activation function; a discount factor γ; policy network mu θ (s) sum function network Q φ (s, a) optimizer of parameter update, learning rate, target network
Figure FDA0003706057020000012
And
Figure FDA0003706057020000013
is updated by the soft update step size tau.
3. The method as claimed in claim 1, wherein in step S2, the method for analytically calculating the magnetic field intensity distribution in the three-dimensional space of the square permanent magnet is as follows:
assuming that the magnetization direction is positive z-axis direction and the magnetization intensity is M c For a square permanent magnet with lengths l, w, h along the x-axis, y-axis and z-axis respectively, the magnetic field strength components of the square permanent magnet at any point P (x, y, z) in the three-dimensional space in the directions of the x-axis, y-axis and z-axis can be expressed as follows:
Figure FDA0003706057020000021
wherein, gamma (gamma) 1 ,γ 2 ,γ 3 ) And
Figure FDA0003706057020000022
for two auxiliary functions, the expression is as follows:
Figure FDA0003706057020000023
wherein epsilon is a minimum value; thus, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as follows:
Figure FDA0003706057020000024
4. the method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S3 comprises the steps of:
step 3.1, initializing the corners of three joints of the mechanical arm to zero, and reading the coordinates of the tail end of the mechanical arm; randomly setting the positions of a target object and an obstacle, reading the coordinates of the central points of the target object and the obstacle in a world coordinate system, and obtaining an initial value of a state observation value;
step 3.2, the mechanical arm outputs actions and applies noise to the actions according to the current state observation value s and the strategy to obtain a, and a next state s' and an original reward value r are obtained after interaction with the environment; under the condition of ensuring that the rotation angles of three joints of the mechanical arm are within corresponding working ranges in the next state, controlling the mechanical arm to move to the next state;
3.3, converting the terminal coordinates of the mechanical arm in the next state from a world coordinate system to a magnetic field coordinate system of the target object magnet and the barrier object magnet;
let the coordinates of the end of the arm in the world coordinate system in the next state be
Figure FDA0003706057020000025
The translation amount of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T) x ,T y ,T z ) The rotation angle of the target object magnetic field coordinate system around the x-axis, the y-axis and the z-axis relative to the world coordinate system is theta x ,θ y ,θ z The positive direction of the magnetic field of the mechanical arm follows the right-hand spiral rule, and then the coordinate of the tail end of the mechanical arm in the magnetic field coordinate system of the target object
Figure FDA0003706057020000031
Can be expressed as:
Figure FDA0003706057020000032
wherein,
Figure FDA0003706057020000033
the transformation matrix is a rotation transformation matrix of a coordinate system around an x axis, a y axis and a z axis respectively, and comprises the following specific steps:
Figure FDA0003706057020000034
step 3.4, calculating the magnetic field intensity of the terminal coordinates of the mechanical arm in the target object magnetic field and the barrier object magnetic field in the next state, and carrying out standardization treatment on the magnetic field intensity:
assuming that 1 target object and n obstacles exist in the environment, the magnetic field intensity of the target object magnet and the magnetic field intensity of the obstacle magnet are respectively calculated as functions
Figure FDA0003706057020000035
The coordinates of the tail end of the mechanical arm in the magnetic field coordinate systems of the target object and the obstacle magnet are respectively
Figure FDA0003706057020000036
Then, the magnetic field strength of the end coordinates of the robot arm in the target and obstacle magnetic fields can be calculated to be H T
Figure FDA0003706057020000037
H is to be T
Figure FDA0003706057020000038
Magnetic field intensity storing and replaying pool
Figure FDA0003706057020000039
And according to the current
Figure FDA00037060570200000310
Mean value of medium magnetic field strength mu T
Figure FDA00037060570200000311
And standard deviation σ T
Figure FDA00037060570200000312
Will calculate H T
Figure FDA00037060570200000313
Mapping to standard Gaussian distribution, and normalizing the magnetic field intensity
Figure FDA00037060570200000314
The expression is as follows:
Figure FDA00037060570200000315
step 3.5, calculating the combined magnetic field intensity of the target object and the barrier magnet, and carrying out normalization processing on the combined magnetic field intensity to obtain a magnetic field reward function:
the combined magnetic field strength of the target and obstacle magnets is defined as:
Figure FDA00037060570200000316
the combined magnetic field strength is normalized by utilizing a Softsign function, and the output result is defined as a magnetic field reward function r M The method comprises the following steps:
Figure FDA0003706057020000041
5. the method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S4 comprises the steps of:
step 4.1, defining the potential function neural network in the DPBA algorithm as phi ψ (s, a), inputting the state observation value and the action value of the mechanical arm, outputting the potential energy value of the current state action pair, and psi is a parameter of the neural network; the loss function used to update the potential function neural network is defined as:
Figure FDA0003706057020000042
wherein y is a gradient-free "tag value" of the potential function, and is specifically expressed as follows:
y=-r M +γΦ ψ (s′,a′)
wherein r is M The magnetic field reward function obtained in the step 3.5, and gamma is a discount factor; the parameters for updating the potential function neural network by adopting the gradient descent method are as follows:
Figure FDA0003706057020000043
wherein eta is the updated learning rate of the potential function neural network;
step 4.2, according to the parameter Ψ before updating and the parameter Ψ' after updating, a shaping reward function based on potential energy can be calculated as follows:
f M =γΦ ψ′ (s′,a′)-Φ ψ (s,a)
current potential function phi ψ (s, a) when initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential-based shaping reward function can be achieved, namely: f. of M =r M
Will shape reward f M (s, a, r + f) is combined with the original prize, r, obtained in step 3.1 M S') as a set of training data to an experience replay pool for subsequent reinforcement learning algorithmsTraining; according to the law of invariance of the optimal strategy, the algorithm consists of a reward function r + f M The learned optimal strategy is consistent with the optimal strategy learned by the original reward function r; and repeating the steps 3.2 to 4.2 until the tail end of the mechanical arm reaches the target object, or the mechanical arm touches an obstacle or the ground, or a set maximum time step is experienced.
6. The method for magnetic field-based reward shaping in reinforcement learning robot arm control of claim 1, wherein the step S5 comprises the steps of:
step 5.1, randomly sampling a batch of data (S, A, R + F) in an empirical playback pool M S'), wherein (S) is i ,a i ,r i +f i M ,s i+1 ) Representing a single training data;
step 5.2, calculating a loss function for updating the value function network parameters:
Figure FDA0003706057020000051
wherein, y i The non-gradient "tag value" for the value function is expressed as follows:
Figure FDA0003706057020000052
the parameters of the value function network are updated by the gradient descent method as follows:
Figure FDA0003706057020000053
wherein, beta is the learning rate of the value function network update;
step 5.3, calculating a loss function for updating the policy network parameters:
Figure FDA0003706057020000054
the parameters for updating the policy network by adopting the gradient descent method are as follows:
Figure FDA0003706057020000055
wherein, alpha is the learning rate of the strategy network update;
step 5.4, soft updating the parameters of the target network:
Figure FDA0003706057020000056
step 5.5, repeating the steps 5.1 to 5.4 for K times, and finishing the cycle of the combination; repeating the steps S3 to S5 until the algorithm is completely converged to obtain the optimal strategy network for the mechanical arm to avoid the obstacle and reach the target object in the dynamic environment
Figure FDA0003706057020000057
CN202210705509.0A 2022-06-21 2022-06-21 Magnetic field-based reward shaping method for reinforcement learning mechanical arm control Active CN115179280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210705509.0A CN115179280B (en) 2022-06-21 2022-06-21 Magnetic field-based reward shaping method for reinforcement learning mechanical arm control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210705509.0A CN115179280B (en) 2022-06-21 2022-06-21 Magnetic field-based reward shaping method for reinforcement learning mechanical arm control

Publications (2)

Publication Number Publication Date
CN115179280A true CN115179280A (en) 2022-10-14
CN115179280B CN115179280B (en) 2025-07-22

Family

ID=83514905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210705509.0A Active CN115179280B (en) 2022-06-21 2022-06-21 Magnetic field-based reward shaping method for reinforcement learning mechanical arm control

Country Status (1)

Country Link
CN (1) CN115179280B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117140527A (en) * 2023-09-27 2023-12-01 中山大学·深圳 A robotic arm control method and system based on deep reinforcement learning algorithm
CN118123816A (en) * 2024-02-18 2024-06-04 东莞理工学院 Deep reinforcement learning robot arm motion planning method, system and storage medium based on constraint model
CN120326639A (en) * 2025-06-19 2025-07-18 北京中联国成科技有限公司 A robot high-precision anti-collision system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104914874A (en) * 2015-06-09 2015-09-16 长安大学 Unmanned aerial vehicle attitude control system and method based on self-adaption complementation fusion
US20190104994A1 (en) * 2017-10-09 2019-04-11 Vanderbilt University Robotic capsule system with magnetic actuation and localization
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111515961A (en) * 2020-06-02 2020-08-11 南京大学 Reinforcement learning reward method suitable for mobile mechanical arm
CN113889737A (en) * 2021-09-30 2022-01-04 西华大学 Substrate integrated waveguide parameter optimization method and structure based on reinforcement learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104914874A (en) * 2015-06-09 2015-09-16 长安大学 Unmanned aerial vehicle attitude control system and method based on self-adaption complementation fusion
US20190104994A1 (en) * 2017-10-09 2019-04-11 Vanderbilt University Robotic capsule system with magnetic actuation and localization
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111515961A (en) * 2020-06-02 2020-08-11 南京大学 Reinforcement learning reward method suitable for mobile mechanical arm
CN113889737A (en) * 2021-09-30 2022-01-04 西华大学 Substrate integrated waveguide parameter optimization method and structure based on reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈立群: "仰之弥高 钻之弥深──《航天器姿态动力学》评介", 商丘师范学院学报, no. 4, 30 December 1997 (1997-12-30) *
魏治华: "基于强化学习的移动机器人导航及环境状态探测的研究", 硕士学位论文, 15 January 2007 (2007-01-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117140527A (en) * 2023-09-27 2023-12-01 中山大学·深圳 A robotic arm control method and system based on deep reinforcement learning algorithm
CN117140527B (en) * 2023-09-27 2024-04-26 中山大学·深圳 Mechanical arm control method and system based on deep reinforcement learning algorithm
CN118123816A (en) * 2024-02-18 2024-06-04 东莞理工学院 Deep reinforcement learning robot arm motion planning method, system and storage medium based on constraint model
CN120326639A (en) * 2025-06-19 2025-07-18 北京中联国成科技有限公司 A robot high-precision anti-collision system and method

Also Published As

Publication number Publication date
CN115179280B (en) 2025-07-22

Similar Documents

Publication Publication Date Title
CN115179280A (en) Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning
CN110750096B (en) Collision avoidance planning method for mobile robots based on deep reinforcement learning in static environment
Juang et al. Wall-following control of a hexapod robot using a data-driven fuzzy controller learned through differential evolution
CN106096729B (en) A kind of depth-size strategy learning method towards complex task in extensive environment
Bai et al. Path planning of autonomous mobile robot in comprehensive unknown environment using deep reinforcement learning
CN116533234A (en) Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning
CN110362081B (en) Mobile robot path planning method
CN114888801A (en) Mechanical arm control method and system based on offline strategy reinforcement learning
Toan et al. Mapless navigation with deep reinforcement learning based on the convolutional proximal policy optimization network
CN116922391B (en) Autonomous learning and optimizing method for spatial mechanical arm skills
Luo et al. Calibration-free monocular vision-based robot manipulations with occlusion awareness
CN111309035A (en) Multi-robot cooperative movement and dynamic obstacle avoidance method, device, equipment and medium
Yan et al. Path planning for mobile robot's continuous action space based on deep reinforcement learning
CN119200410A (en) Obstacle avoidance trajectory planning and adaptive tracking control method for tower cranes for intelligent construction
CN116852347A (en) A state estimation and decision control method for autonomous grasping of non-cooperative targets
CN119388427B (en) Method and system for controlling the stable motion trajectory of a robotic arm when grasping a large thin plate
Luo et al. Balance between efficient and effective learning: Dense2sparse reward shaping for robot manipulation with environment uncertainty
Ding et al. Magnetic field-based reward shaping for goal-conditioned reinforcement learning
Kumar et al. Kinematic control of a redundant manipulator using an inverse-forward adaptive scheme with a KSOM based hint generator
Fang et al. Quadrotor navigation in dynamic environments with deep reinforcement learning
CN118536684A (en) Multi-agent path planning method based on deep reinforcement learning
Duan et al. Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots
Sang et al. Motion planning of space robot obstacle avoidance based on DDPG algorithm
Toan et al. Environment exploration for mapless navigation based on deep reinforcement learning
Shen et al. Energy-Efficient Motion Planning and Control for Robotic Arms via Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载