CN115179280A - Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning - Google Patents
Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning Download PDFInfo
- Publication number
- CN115179280A CN115179280A CN202210705509.0A CN202210705509A CN115179280A CN 115179280 A CN115179280 A CN 115179280A CN 202210705509 A CN202210705509 A CN 202210705509A CN 115179280 A CN115179280 A CN 115179280A
- Authority
- CN
- China
- Prior art keywords
- magnetic field
- mechanical arm
- target object
- function
- reward
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1674—Programme controls characterised by safety, monitoring, diagnostic
- B25J9/1676—Avoiding collision or forbidden zones
Landscapes
- Engineering & Computer Science (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Manipulator (AREA)
- Feedback Control In General (AREA)
Abstract
Description
技术领域technical field
本发明属于机器人控制领域,具体涉及一种用于强化学习机械臂控制中基于磁场的奖励塑形方法。The invention belongs to the field of robot control, and in particular relates to a magnetic field-based reward shaping method used in reinforcement learning robotic arm control.
背景技术Background technique
传统的机械臂控制方法,通常需要基于运动学和动力学方程对机械臂进行建模并求解末端位姿和各个关节的角度值。随着工业应用场景的复杂性和动态性不断提高,传统基于模型的机械臂控制方法的计算复杂度也越来越高,无法及时适应外部环境的变化,缺乏对环境的自主学习和泛化能力。Traditional manipulator control methods usually need to model the manipulator based on kinematics and dynamic equations and solve the end pose and angle values of each joint. As the complexity and dynamics of industrial application scenarios continue to increase, the computational complexity of traditional model-based manipulator control methods is also getting higher and higher, unable to adapt to changes in the external environment in time, and lack of autonomous learning and generalization capabilities for the environment .
近年来,强化学习由于其处理序列决策问题的独特优势,被广泛应用于机械臂控制任务中。其通过将传感器获取的环境状态信息直接映射到机械臂执行的动作上,实现端到端(end-to-end)的控制,为复杂连续高维系统的控制问题提供新的解决思路。强化学习的优化目标是在马尔可夫决策过程(MDP)中寻找使得累计奖励值最大的最优策略,因此设计一个科学的奖励函数尤为重要。关于机械臂运动控制任务中奖励函数的设计,现有方法的设置较为简单,其中包括方位奖励函数设计、启发式奖励函数设计等,无法在复杂动态环境中为机械臂提供丰富的奖励信号,未能有效提高学习效率。In recent years, reinforcement learning has been widely used in robotic arm control tasks due to its unique advantages in dealing with sequential decision-making problems. It realizes end-to-end control by directly mapping the environmental state information obtained by the sensor to the actions performed by the manipulator, and provides a new solution to the control problem of complex continuous high-dimensional systems. The optimization goal of reinforcement learning is to find the optimal strategy that maximizes the cumulative reward value in the Markov Decision Process (MDP), so it is particularly important to design a scientific reward function. Regarding the design of the reward function in the motion control task of the robot arm, the settings of the existing methods are relatively simple, including the design of the azimuth reward function and the design of the heuristic reward function, etc., which cannot provide rich reward signals for the robot arm in the complex dynamic environment. It can effectively improve the learning efficiency.
公开号为CN113894787A的专利文件公开了一种用于机械臂强化学习运动规划中启发式奖励函数的设计方法,包括:建立机械臂运动规划问题的启发式函数;根据启发式函数,构建机械臂运动规划的启发式奖励函数;确定启发式奖励函数中的参数取值;利用构建的启发式奖励函数训练机械臂运动规划的神经网络运动规划器。该发明基于机械臂末端位置到目标位置的直线距离来设置启发式奖励函数,无法提供更高阶的奖励信号,且无法保证学习策略的最优性。Patent document with publication number CN113894787A discloses a design method for heuristic reward function in robotic arm reinforcement learning motion planning, including: establishing a heuristic function for robotic arm motion planning problem; constructing robotic arm motion according to the heuristic function Planning heuristic reward function; determine the parameter values in the heuristic reward function; use the constructed heuristic reward function to train a neural network motion planner for robotic arm motion planning. The invention sets the heuristic reward function based on the straight-line distance from the end position of the robotic arm to the target position, which cannot provide higher-order reward signals and cannot guarantee the optimality of the learning strategy.
发明内容SUMMARY OF THE INVENTION
本发明的目的是针对现有强化学习机械臂控制方法中的奖励函数在复杂动态环境中提供信息有限的问题,提出了一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,能够在保证最优策略不变的情况下,为机械臂提供关于目标物和障碍物更为丰富的方位信息,从而提高强化学习算法的学习效率和收敛速度。The purpose of the present invention is to solve the problem that the reward function in the existing reinforcement learning robotic arm control method provides limited information in complex dynamic environments, and proposes a magnetic field-based reward shaping method for reinforcement learning robotic arm control. Under the condition of keeping the optimal strategy unchanged, it provides the robotic arm with richer orientation information about the target and obstacles, thereby improving the learning efficiency and convergence speed of the reinforcement learning algorithm.
本发明的技术方案为:一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,其特征在于,包括以下步骤:The technical scheme of the present invention is: a magnetic field-based reward shaping method for reinforcement learning robotic arm control, which is characterized in that it includes the following steps:
S1、设计任务环境,设定机械臂、目标物和障碍物的相关参数,设置强化学习算法的各项超参数;S1. Design the task environment, set the relevant parameters of the manipulator, the target and the obstacle, and set the various hyperparameters of the reinforcement learning algorithm;
S2、将目标物视为同等形状的方形永磁体,确定其磁化方向和三维空间磁场强度分布的计算方式,障碍物同理;S2. Treat the target as a square permanent magnet of the same shape, determine its magnetization direction and the calculation method of the magnetic field intensity distribution in three-dimensional space, and the same is true for obstacles;
S3、机械臂与环境交互,收集训练数据,并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度,经过标准化和归一化处理后得到磁场奖励函数;S3. The manipulator interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the manipulator in the target and obstacle magnetic fields according to the next state, and obtains the magnetic field reward function after normalization and normalization;
S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数,并和训练数据一起存放于经验回放池;S4. Use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data;
S5、从经验回放池中采集一个批次的数据,使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略。S5. Collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment.
作为优选,所述步骤S1包括以下几个步骤:Preferably, the step S1 includes the following steps:
步骤1.1,设计任务环境的状态观测值和机械臂的动作值,具体包括:Step 1.1, design the state observation value of the task environment and the action value of the manipulator, including:
a、环境状态观测值包含机械臂三个关节的转角、机械臂末端的坐标,以及目标物和障碍物中心点的坐标;a. The observation value of the environmental state includes the rotation angles of the three joints of the manipulator, the coordinates of the end of the manipulator, and the coordinates of the center point of the target and obstacles;
b、机械臂的动作值为三个关节电机的转角速度,即在单位时间步长里三个关节旋转的角度。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step.
步骤1.2,建立与机械臂的连接,设置三个关节转动的速度和加速度范围;规定目标物和障碍物随机生成的方式,确保目标物在机械臂末端可达到的范围之内,并且目标物和障碍物不相交。Step 1.2, establish the connection with the manipulator, set the speed and acceleration range of the rotation of the three joints; specify the random generation method of the target and obstacles, ensure that the target is within the reach of the end of the manipulator, and the target and Obstacles do not intersect.
步骤1.3,设置强化学习算法基本的超参数,至少包括:探索噪声,经验回放池的大小;每次训练的更新次数K,每次更新所用数据批次的大小N;神经网络的层数,每层的节点数、激活函数;折扣因子γ;策略网络μθ(s)和值函数网络Qφ(s,a)参数更新的优化器、学习率,目标网络和的软更新步长τ。Step 1.3, set the basic hyperparameters of the reinforcement learning algorithm, including at least: exploring noise, experience playback pool The number of updates K for each training, the size N of the data batch used for each update; the number of layers of the neural network, the number of nodes in each layer, the activation function; the discount factor γ; the policy network μ θ (s) and the value Optimizer, learning rate, target network for parameter update of function network Q φ (s, a) and The soft update step size τ.
在步骤S2中,方形永磁体三维空间中磁场强度分布的解析计算方法如下:In step S2, the analytical calculation method of the magnetic field intensity distribution in the three-dimensional space of the square permanent magnet is as follows:
假设磁化方向为z轴正方向,磁化强度为Mc,对于沿x轴、y轴、z轴长度分别为l,w,h的方形永磁体,其在三维空间中任意一点P(x,y,z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为:Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:
其中,Γ(γ1,γ2,γ3)和为两个辅助函数,表达式如下:where Γ(γ 1 , γ 2 , γ 3 ) and For two auxiliary functions, the expressions are as follows:
其中,∈为一极小值。于是,可以得到方形永磁体在三维空间中任意一点的磁场强度为:Among them, ∈ is a minimum value. Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:
作为优选,所述步骤S3包括以下几个步骤:Preferably, the step S3 includes the following steps:
步骤3.1,将机械臂三个关节的转角初始化为零,读取机械臂末端的坐标;随机设置目标物和障碍物的位置,读取目标物和障碍物中心点在世界坐标系中的坐标。得到状态观测值的初始值。Step 3.1, initialize the rotation angles of the three joints of the manipulator to zero, read the coordinates of the end of the manipulator; randomly set the positions of the target and obstacles, and read the coordinates of the center points of the target and obstacles in the world coordinate system. Get the initial value of the state observation.
步骤3.2,机械臂根据当前状态观测值s和策略,输出动作并对其施加噪声得到a,与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下,控制机械臂运动至下一状态。Step 3.2, according to the current state observation value s and the strategy, the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, it obtains the next state s' and the original reward value r. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state.
步骤3.3,将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet.
假设下一状态中机械臂末端在世界坐标系中的坐标为目标物磁场坐标系原点相对于世界坐标系原点的平移量为(Tx,Ty,Tz),目标物磁场坐标系相对于世界坐标系绕x轴、y轴、z轴的旋转角度为θx,θy,θz,其正方向遵循右手螺旋定则。那么机械臂末端在目标物磁场坐标系中的坐标可表示为:Assume that the coordinates of the end of the manipulator in the world coordinate system in the next state are The displacement of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T x , Ty , T z ), and the rotation angle of the magnetic field coordinate system of the target object relative to the world coordinate system around the x-axis, y-axis, and z-axis is θ x , θ y , θ z , the positive directions follow the right-hand spiral rule. Then the coordinates of the end of the manipulator in the target magnetic field coordinate system can be expressed as:
其中,分别为坐标系绕x轴、y轴、z轴的旋转变换矩阵,具体如下:in, are the rotation transformation matrices of the coordinate system around the x-axis, y-axis, and z-axis, respectively, as follows:
步骤3.4,计算下一状态中机械臂末端坐标在目标物磁场和障碍物磁场中的磁场强度,并对其进行标准化处理。Step 3.4: Calculate the magnetic field strength of the coordinates of the end of the manipulator in the target magnetic field and the obstacle magnetic field in the next state, and normalize it.
假设环境中存在1个目标物和n个障碍物,其中目标物和障碍物磁体的磁场强度计算函数分别为机械臂末端在目标物和障碍物磁体的磁场坐标系中的坐标分别为于是,可以算得机械臂末端坐标在目标物和障碍物磁场中的磁场强度分别为 Assuming that there are 1 target and n obstacles in the environment, the calculation functions of the magnetic field strength of the target and obstacle magnets are respectively The coordinates of the end of the manipulator in the magnetic field coordinate system of the target and obstacle magnets are respectively Therefore, it can be calculated that the magnetic field strengths of the coordinates of the end of the manipulator in the target and obstacle magnetic fields are respectively:
将存入磁场强度回放池并根据当前中磁场强度的均值和标准差来将算得的映射到标准高斯分布上,标准化处理后的磁场强度表达如下:Will Stored in the magnetic field strength playback pool and according to the current Mean value of medium magnetic field strength and standard deviation it will count Mapping to a standard Gaussian distribution, normalized magnetic field strength The expression is as follows:
步骤3.5,计算目标物和障碍物磁体的联合磁场强度,并对其进行归一化处理,得到磁场奖励函数。Step 3.5: Calculate the combined magnetic field strength of the target and obstacle magnets, and normalize them to obtain a magnetic field reward function.
定义目标物和障碍物磁体的联合磁场强度为:The combined magnetic field strengths of the target and obstacle magnets are defined as:
利用Softsign函数对联合磁场强度进行归一化处理,并将输出的结果定义为磁场奖励函数rM。具体如下:The joint magnetic field strength is normalized by the Softsign function, and the output result is defined as the magnetic field reward function r M . details as follows:
作为优选,所述步骤S4包括以下几个步骤:Preferably, the step S4 includes the following steps:
步骤4.1,定义DPBA算法中的势函数神经网络为Φψ(s,a),其输入为状态观测值和机械臂的动作值,输出为当前状态动作对的势能值,ψ为神经网络的参数。用于更新势函数神经网络的损失函数定义为:Step 4.1, define the potential function neural network in the DPBA algorithm as Φ ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . The loss function used to update the potential function neural network is defined as:
其中,y为势函数的无梯度“标签值”,具体表达如下:Among them, y is the gradient-free "label value" of the potential function, which is specifically expressed as follows:
y=-rM+γΦψ(s′,a′)y=-r M +γΦ ψ (s′, a′)
其中,rM为步骤3.5得到的磁场奖励函数,γ为折扣因子。采用梯度下降方法更新势函数神经网络的参数如下:Among them, r M is the magnetic field reward function obtained in step 3.5, and γ is the discount factor. The parameters of the potential function neural network are updated using the gradient descent method as follows:
其中,η为势函数神经网络更新的学习率。Among them, η is the learning rate of the potential function neural network update.
步骤4.2,根据更新前的参数Ψ和更新后的参数Ψ′,可计算基于势能的塑形奖励函数如下:Step 4.2, according to the parameter Ψ before the update and the parameter Ψ' after the update, the shaping reward function based on the potential energy can be calculated as follows:
fM=γΦψ′(s′,a′)-Φψ(s,a)f M = γΦ ψ' (s', a') - Φ ψ (s, a)
当势函数Φψ(s,a)初始化为零并用以上方式更新至最终收敛时,可以实现将磁场奖励函数完全转换为基于势能的塑形奖励函数,即:fM=rM。When the potential function Φ ψ (s, a) is initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential energy-based shaping reward function can be achieved, ie: f M =r M .
将塑形奖励fM与步骤3.1中得到的原始奖励r结合,将(s,a,r+fM,s′)作为一组训练数据存放至经验回放池,用于后续强化学习算法的训练。根据最优策略不变性定理,算法由奖励函数r+fM学习到的最优策略,与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2,直到机械臂末端达到目标物,或机械臂碰到障碍物或地面,或经历设定的最大时间步长。Combine the shaping reward f M with the original reward r obtained in step 3.1, and store (s, a, r+f M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or ground, or goes through the set maximum time step.
作为优选,所述步骤S5包括以下几个步骤:Preferably, the step S5 includes the following steps:
步骤5.1,在经验回放池中随机采样一个批次的数据(S,A,R+FM,S′),其中用(si,ai,ri+fi M,si+1)表示单个训练数据。Step 5.1, randomly sample a batch of data (S, A, R+ FM , S′) in the experience replay pool, where (s i , a i , r i + f i M , s i+1 ) represents a single training data.
步骤5.2,计算用于更新值函数网络参数的损失函数:Step 5.2, calculate the loss function used to update the network parameters of the value function:
其中,yi为值函数的无梯度“标签值”,具体表达如下:Among them, y i is the gradient-free "label value" of the value function, which is specifically expressed as follows:
采用梯度下降方法更新值函数网络的参数如下:The parameters of the value function network are updated using the gradient descent method as follows:
其中,β为值函数网络更新的学习率。Among them, β is the learning rate of the value function network update.
步骤5.3,计算用于更新策略网络参数的损失函数:Step 5.3, calculate the loss function for updating the parameters of the policy network:
采用梯度下降方法更新策略网络的参数如下:The parameters of the policy network are updated using the gradient descent method as follows:
其中,α为策略网络更新的学习率。Among them, α is the learning rate of policy network update.
步骤5.4,软更新目标网络的参数:Step 5.4, soft update the parameters of the target network:
步骤5.5,重复步骤5.1至步骤5.4共K次,结束此轮回合。重复步骤S3至步骤S5,直到算法完全收敛,得到机械臂在动态环境下避开障碍物并达到目标物的最优策略网络 Step 5.5, repeat steps 5.1 to 5.4 for a total of K times to end this round. Steps S3 to S5 are repeated until the algorithm is completely converged, and the optimal strategy network for the robot arm to avoid obstacles and reach the target in a dynamic environment is obtained.
本发明的有益效果如下:与现有技术相比,本发明所提出的一种用于强化学习机械臂控制中基于磁场的奖励塑形方法,能够在保证最优策略不变的情况下,为机械臂提供关于目标物和障碍物更为丰富的方位信息,从而在复杂动态环境中有效提高强化学习算法的学习效率和收敛速度。The beneficial effects of the present invention are as follows: compared with the prior art, a magnetic field-based reward shaping method for reinforcement learning robotic arm control proposed by the present invention can ensure that the optimal strategy remains unchanged, as The robotic arm provides richer orientation information about targets and obstacles, thereby effectively improving the learning efficiency and convergence speed of reinforcement learning algorithms in complex dynamic environments.
附图说明Description of drawings
图1是本发明实施例在仿真环境中的任务场景图,其中,A为机械臂末端,B为障碍物,C为目标物;1 is a task scene diagram of an embodiment of the present invention in a simulation environment, wherein A is the end of the robotic arm, B is an obstacle, and C is a target;
图2是本发明的算法整体框架图;Fig. 2 is the algorithm overall frame diagram of the present invention;
图3是本发明实施例中方形永磁体的磁场坐标系示意图;3 is a schematic diagram of a magnetic field coordinate system of a square permanent magnet in an embodiment of the present invention;
图4是本发明基于磁场的奖励塑形方法与同类算法在仿真环境中的实验效果对比图。FIG. 4 is a comparison diagram of experimental effects between the magnetic field-based reward shaping method of the present invention and similar algorithms in a simulation environment.
具体实施方式Detailed ways
下面结合附图详细说明本发明的具体实施方式。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
实施例:如附图1所示:本实施例以3个自由度的Dobot Magician机械臂为例,设计的任务场景是利用强化学习算法在动态环境下,控制机械臂完成在不碰到障碍物的前提下移动末端到达目标物的任务。其中,目标物和障碍物为不同尺寸的长方体,且在每个回合中的位置是随机变化的。本实施例所述的一种用于强化学习机械臂控制的磁场奖励函数设计框架如附图2所示,至少包括以下几个步骤:Example: As shown in Figure 1: This example takes the Dobot Magician robotic arm with 3 degrees of freedom as an example, and the designed task scenario is to use the reinforcement learning algorithm to control the robotic arm to complete the task without encountering obstacles in a dynamic environment. On the premise of moving the end to reach the target task. Among them, the target and obstacle are cuboids of different sizes, and their positions in each round change randomly. A magnetic field reward function design framework for reinforcement learning robotic arm control described in this embodiment is shown in FIG. 2 and includes at least the following steps:
步骤S1、设计任务环境,设定机械臂、目标物和障碍物的相关参数,设置强化学习算法的各项超参数,具体包括以下步骤:Step S1, design the task environment, set the relevant parameters of the robotic arm, the target object and the obstacle, and set various hyperparameters of the reinforcement learning algorithm, which specifically includes the following steps:
步骤1.1,设计任务环境的状态观测值和机械臂的动作值,具体包括:Step 1.1, design the state observation value of the task environment and the action value of the manipulator, including:
a、环境状态观测值包含机械臂三个关节的转角、机械臂末端的坐标,以及目标物和障碍物中心点的坐标;a. The observation value of the environmental state includes the rotation angles of the three joints of the manipulator, the coordinates of the end of the manipulator, and the coordinates of the center point of the target and obstacles;
b、机械臂的动作值为三个关节电机的转角速度,即在单位时间步长里三个关节旋转的角度,其范围限制在[-1°,1°]。b. The action value of the robotic arm is the angular velocity of the three joint motors, that is, the rotation angle of the three joints in a unit time step, and its range is limited to [-1°, 1°].
步骤1.2,建立与机械臂的连接,设置三个关节转动的速度和加速度范围;规定目标物和障碍物随机生成的方式,确保目标物在机械臂末端可达到的范围之内,并且目标物和障碍物不相交。Step 1.2, establish the connection with the manipulator, set the speed and acceleration range of the rotation of the three joints; specify the random generation method of the target and obstacles, ensure that the target is within the reach of the end of the manipulator, and the target and Obstacles do not intersect.
步骤1.3,根据采用的强化学习算法设置基本的超参数。本实施例中采用“Lillicrap,Timothy P.,et al."Continuous control with deep reinforcementlearning."arXiv preprint arXiv:1509.02971(2015).”提出的适用于连续状态动作空间的DDPG算法,需要设置的超参数包括:探索噪声经验回放池的大小106;每次训练的更新次数K=20,每次更新所用数据批次的大小N=128;神经网络的层数2,每层的节点数256,激活函数ReLU,并随机初始化神经网络的参数;折扣因子γ=0.99;策略网络μθ(s)和值函数网络Qφ(s,a)参数更新的优化器为Adam,学习率分别为3×10-4和10-3,目标网络和的软更新步长τ=10-3。Step 1.3, set basic hyperparameters according to the reinforcement learning algorithm used. In this example, the DDPG algorithm proposed by "Lillicrap, Timothy P., et al."Continuous control with deep reinforcement learning."arXiv preprint arXiv:1509.02971(2015)." is used for continuous state action space, and the hyperparameters that need to be set Includes: Explore Noise Experience Playback Pool The size of 10 6 ; the number of updates for each training K=20, the size of the data batch used for each update is N=128; the number of layers of the neural network is 2, the number of nodes in each layer is 256, the activation function ReLU, and the neural network is randomly initialized parameters of the network; discount factor γ=0.99; the optimizer for parameter update of policy network μ θ (s) and value function network Q φ (s, a) is Adam, and the learning rates are 3×10 -4 and 10 -3 , respectively, target network and The soft update step size τ=10 -3 .
步骤S2、将目标物视为同等形状的方形永磁体,确定其磁化方向和三维空间磁场强度分布的计算方式,障碍物同理。In step S2, the target is regarded as a square permanent magnet of the same shape, and the calculation method of its magnetization direction and the magnetic field intensity distribution in three-dimensional space is determined, and the same is true for obstacles.
本实施方式以方形永磁体为例,以解析方法计算其磁场强度分布,其他形状的永磁体可用类似的解析方法或基于物理仿真的模拟方法得到磁场强度分布,方形永磁体的磁场坐标系如附图3所示。假设磁化方向为z轴正方向,磁化强度为Mc,对于沿x轴、y轴、z轴长度分别为l,w,h的方形永磁体,其在三维空间中任意一点P(x,y,z)处在x轴、y轴、z轴方向上的磁场强度分量可表示为:In this embodiment, a square permanent magnet is used as an example, and its magnetic field intensity distribution is calculated by an analytical method. For permanent magnets of other shapes, a similar analytical method or a simulation method based on physical simulation can be used to obtain the magnetic field intensity distribution. The magnetic field coordinate system of the square permanent magnet is shown in the appendix. shown in Figure 3. Assuming that the magnetization direction is the positive direction of the z-axis, and the magnetization intensity is M c , for a square permanent magnet whose lengths are l, w, and h along the x-axis, y-axis, and z-axis, respectively, at any point P(x, y) in the three-dimensional space , z) The magnetic field strength components in the x-axis, y-axis, and z-axis directions can be expressed as:
其中,Γ(γ1,γ2,γ3)和为两个辅助函数,表达式如下:where Γ(γ 1 , γ 2 , γ 3 ) and For two auxiliary functions, the expressions are as follows:
其中,∈为一极小值,本实施例中取∈=10-7。于是,可以得到方形永磁体在三维空间中任意一点的磁场强度为:Among them, ∈ is a minimum value, and in this embodiment, ∈=10 −7 . Therefore, the magnetic field strength of the square permanent magnet at any point in the three-dimensional space can be obtained as:
在本实施例中,目标物尺寸设置为l=0.03m,w=0.045m,h=0.02m,障碍物尺寸设置为l=0.038m,w=0.047m,h=0.12m。In this embodiment, the size of the target is set as l=0.03m, w=0.045m, h=0.02m, and the size of the obstacle is set as l=0.038m, w=0.047m, h=0.12m.
步骤S3、机械臂与环境交互,收集训练数据,并根据下一状态计算机械臂末端坐标在目标物和障碍物磁场中的磁场强度,经过标准化和归一化处理后得到磁场奖励函数,具体包括以下步骤:Step S3, the robotic arm interacts with the environment, collects training data, and calculates the magnetic field strength of the coordinates of the end of the robotic arm in the target and obstacle magnetic fields according to the next state, and obtains a magnetic field reward function after standardization and normalization, which specifically includes The following steps:
步骤3.1,将机械臂三个关节的转角初始化为零,读取机械臂末端的坐标;随机设置目标物和障碍物的位置,读取目标物和障碍物中心点在世界坐标系中的坐标。得到状态观测值的初始值。Step 3.1, initialize the rotation angles of the three joints of the manipulator to zero, read the coordinates of the end of the manipulator; randomly set the positions of the target and obstacles, and read the coordinates of the center points of the target and obstacles in the world coordinate system. Get the initial value of the state observation.
步骤3.2,机械臂根据当前状态观测值s和策略网络μθ(s),输出动作并对其施加噪声得到a,与环境交互后得到下一状态s′和原始奖励值r。在确保下一状态中机械臂三个关节的转角在其相应的工作范围内的情况下,控制机械臂运动至下一状态。在本实施例中,三个关节的工作范围分别为[-90°,90°],[0°,85°],[-10°,90°],原始奖励函数设置如下:Step 3.2, according to the current state observation value s and the policy network μ θ (s), the robotic arm outputs the action and applies noise to it to obtain a, and after interacting with the environment, the next state s' and the original reward value r are obtained. Under the condition that the rotation angles of the three joints of the manipulator in the next state are within their corresponding working ranges, the manipulator is controlled to move to the next state. In this embodiment, the working ranges of the three joints are respectively [-90°, 90°], [0°, 85°], [-10°, 90°], and the original reward function is set as follows:
步骤3.3,将下一状态中机械臂末端坐标从世界坐标系转换至目标物磁体和障碍物磁体的磁场坐标系中。本实施例以目标物磁体为例,障碍物磁体的坐标变换同理可求。Step 3.3: Convert the coordinates of the end of the manipulator in the next state from the world coordinate system to the magnetic field coordinate system of the target magnet and the obstacle magnet. In this embodiment, the target magnet is taken as an example, and the coordinate transformation of the obstacle magnet can be obtained in the same way.
假设下一状态中机械臂末端在世界坐标系中的坐标为目标物磁场坐标系原点相对于世界坐标系原点的平移量为(Tx,Ty,Tz),目标物磁场坐标系相对于世界坐标系绕x轴、y轴、z轴的旋转角度为θx,θy,θz,其正方向遵循右手螺旋定则。那么机械臂末端在目标物磁场坐标系中的坐标可表示为:Assume that the coordinates of the end of the manipulator in the world coordinate system in the next state are The displacement of the origin of the magnetic field coordinate system of the target object relative to the origin of the world coordinate system is (T x , Ty , T z ), and the rotation angle of the magnetic field coordinate system of the target object relative to the world coordinate system around the x-axis, y-axis, and z-axis is θ x , θ y , θ z , the positive directions follow the right-hand spiral rule. Then the coordinates of the end of the manipulator in the target magnetic field coordinate system can be expressed as:
其中,分别为坐标系绕x轴、y轴、z轴的旋转变换矩阵,具体如下:in, are the rotation transformation matrices of the coordinate system around the x-axis, y-axis, and z-axis, respectively, as follows:
步骤3.4,计算下一状态中机械臂末端坐标在目标物磁场和障碍物磁场中的磁场强度,并对其进行标准化处理。Step 3.4: Calculate the magnetic field strength of the coordinates of the end of the manipulator in the target magnetic field and the obstacle magnetic field in the next state, and normalize it.
假设环境中存在1个目标物和n个障碍物(本实施例中n=1),其中目标物和障碍物磁体的磁场强度计算函数分别为机械臂末端在目标物和障碍物磁体的磁场坐标系中的坐标分别为于是,可以算得机械臂末端坐标在目标物和障碍物磁场中的磁场强度分别为 Assuming that there are 1 target and n obstacles in the environment (n=1 in this embodiment), the calculation functions of the magnetic field strengths of the target and obstacle magnets are respectively The coordinates of the end of the manipulator in the magnetic field coordinate system of the target and obstacle magnets are respectively Therefore, it can be calculated that the magnetic field strengths of the coordinates of the end of the manipulator in the target and obstacle magnetic fields are respectively:
由于不同计算方法得到的磁场强度的区间范围往往是不同的,因此需要对算得的进行标准化处理。本发明通过引入一个磁场强度回放池来存放的值,并根据当前中磁场强度的均值和标准差来将算得的映射到标准高斯分布上,具体如下:Since the interval range of the magnetic field strength obtained by different calculation methods is often different, it is necessary to Standardize. The present invention replays the cell by introducing a magnetic field strength to store value, and according to the current Mean value of medium magnetic field strength and standard deviation it will count Mapped to the standard Gaussian distribution as follows:
其中,为标准化处理后的磁场强度。本实施例中,磁场强度回放池的大小为106。in, is the normalized magnetic field strength. In this embodiment, the magnetic field strength playback cell The size is 10 6 .
步骤3.5,计算目标物和障碍物磁体的联合磁场强度,并对其进行归一化处理,得到磁场奖励函数。Step 3.5: Calculate the combined magnetic field strength of the target and obstacle magnets, and normalize them to obtain a magnetic field reward function.
在本发明中,机械臂的任务是需要在避开障碍物的前提下将末端移动至目标物,因此目标物和障碍物磁体的联合磁场强度定义为:In the present invention, the task of the robotic arm is to move the end to the target while avoiding obstacles, so the combined magnetic field strength of the target and obstacle magnets is defined as:
其中,目标物磁体对机械臂末端的“吸引”与所有n个障碍物对机械臂末端的“排斥”作用相等。根据磁场强度分布的特性,目标物附近的磁场强度将会趋于正无穷,障碍物附近的磁场强度将会趋于负无穷。为了将联合磁场强度控制在合理的范围内,同时保持其分布规律,本发明利用Softsign函数对联合磁场强度进行归一化处理,并将输出的结果定义为磁场奖励函数rM。具体如下:Among them, the "attraction" of the target magnet to the end of the manipulator is equal to the "repulsion" of all n obstacles to the end of the manipulator. According to the characteristics of the magnetic field intensity distribution, the magnetic field intensity near the target will tend to positive infinity, and the magnetic field intensity near the obstacle will tend to negative infinity. In order to control the joint magnetic field strength within a reasonable range and keep its distribution law, the present invention uses the Softsign function to normalize the joint magnetic field strength, and defines the output result as the magnetic field reward function r M . details as follows:
步骤S4、利用DPBA算法将磁场奖励函数转换为基于势能的塑形奖励函数,并和训练数据一起存放于经验回放池,DPBA算法为文献“Harutyunyan A,Devlin S,Vrancx P,etal.Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015,29(1).”提出,能够实现将任意专家给定的奖励函数转化为满足基于势能的奖励塑形的表达形式,以满足最优策略不变性定理。具体包括以下步骤:Step S4, use the DPBA algorithm to convert the magnetic field reward function into a potential energy-based shaping reward function, and store it in the experience playback pool together with the training data. The DPBA algorithm is the document "Harutyunyan A, Devlin S, Vrancx P, et al. Expressing arbitrary reward functions as potential-based advice[C]//Proceedings of the AAAI sConference on Artificial Intelligence.2015, 29(1).” proposed that the reward function given by any expert can be transformed into one that satisfies potential-based reward shaping. expression to satisfy the optimal policy invariance theorem. Specifically include the following steps:
步骤4.1,定义DPBA算法中的势函数神经网络为Φψ(s,a),其输入为状态观测值和机械臂的动作值,输出为当前状态动作对的势能值,ψ为神经网络的参数。本实施例中,势函数神经网络拥有两个隐藏层,每层的节点数均为256,激活函数均为ReLU;其参数更新的优化器为Adam,学习率为10-4。用于更新势函数神经网络的损失函数定义如下:Step 4.1, define the potential function neural network in the DPBA algorithm as Φ ψ (s, a), the input is the state observation value and the action value of the manipulator, the output is the potential energy value of the current state-action pair, and ψ is the parameter of the neural network . In this embodiment, the potential function neural network has two hidden layers, the number of nodes in each layer is 256, and the activation function is ReLU; the optimizer for parameter updating is Adam, and the learning rate is 10 -4 . The loss function used to update the potential function neural network is defined as follows:
其中,y为势函数的无梯度“标签值”,具体表达如下:Among them, y is the gradient-free "label value" of the potential function, which is specifically expressed as follows:
y=-rM+γΦψ(s′,a′)y=-r M +γΦ ψ (s′, a′)
其中,rM为步骤3.5得到的磁场奖励函数,γ为折扣因子。采用梯度下降方法更新势函数神经网络的参数如下:Among them, r M is the magnetic field reward function obtained in step 3.5, and γ is the discount factor. The parameters of the potential function neural network are updated using the gradient descent method as follows:
其中,η为势函数神经网络更新的学习率。Among them, η is the learning rate of the potential function neural network update.
步骤4.2,根据更新前的参数Ψ和更新后的参数Ψ′,可计算基于势能的塑形奖励函数如下:Step 4.2, according to the parameter Ψ before the update and the parameter Ψ' after the update, the shaping reward function based on the potential energy can be calculated as follows:
fM=γΦψ′(s′,a′)-Φψ(s,a)f M = γΦ ψ' (s', a') - Φ ψ (s, a)
当势函数Φψ(s,a)初始化为零并用以上方式更新至最终收敛时,可以实现将磁场奖励函数完全转换为基于势能的塑形奖励函数,即:fM=rM。When the potential function Φ ψ (s, a) is initialized to zero and updated to final convergence in the above manner, a complete conversion of the magnetic field reward function to a potential energy-based shaping reward function can be achieved, ie: f M =r M .
将塑形奖励fM与步骤3.1中得到的原始奖励r结合,将(s,a,r+fM,s′)作为一组训练数据存放至经验回放池,用于后续强化学习算法的训练。根据最优策略不变性定理,算法由奖励函数r+fM学习到的最优策略,与由原始奖励函数r学习到的最优策略保持一致。重复步骤3.2至步骤4.2,直到机械臂末端达到目标物,或机械臂碰到障碍物或地面,或经历设定的最大时间步长(本实施例中设置为200)。Combine the shaping reward f M with the original reward r obtained in step 3.1, and store (s, a, r+f M , s′) as a set of training data in the experience playback pool for subsequent reinforcement learning algorithm training . According to the optimal policy invariance theorem, the optimal policy learned by the algorithm from the reward function r+f M is consistent with the optimal policy learned from the original reward function r. Repeat steps 3.2 to 4.2 until the end of the robotic arm reaches the target, or the robotic arm hits an obstacle or the ground, or experiences the set maximum time step (set to 200 in this embodiment).
步骤S5、从经验回放池中采集一个批次的数据,使用强化学习算法训练机械臂在动态环境下避开障碍物并到达目标物的最优策略,具体包括以下步骤:Step S5, collect a batch of data from the experience playback pool, and use the reinforcement learning algorithm to train the optimal strategy for the robotic arm to avoid obstacles and reach the target in a dynamic environment, which specifically includes the following steps:
步骤5.1,在经验回放池中随机采样一个批次的N组数据(S,A,R+FM,S′),其中用(si,ai,ri+fi M,si+1)表示单个训练数据。Step 5.1, randomly sample a batch of N groups of data (S, A, R+F M , S′) in the experience replay pool, where (s i , a i , r i +f i M , s i+ 1 ) represents a single training data.
步骤5.2,计算用于更新值函数网络参数的损失函数:Step 5.2, calculate the loss function used to update the network parameters of the value function:
其中,yi为值函数的无梯度“标签值”,具体表达如下:Among them, y i is the gradient-free "label value" of the value function, which is specifically expressed as follows:
采用梯度下降方法更新值函数网络的参数如下:The parameters of the value function network are updated using the gradient descent method as follows:
其中,β为值函数网络更新的学习率。Among them, β is the learning rate of the value function network update.
步骤5.3,计算用于更新策略网络参数的损失函数:Step 5.3, calculate the loss function for updating the parameters of the policy network:
采用梯度下降方法更新策略网络的参数如下:The parameters of the policy network are updated using the gradient descent method as follows:
其中,α为策略网络更新的学习率。Among them, α is the learning rate of policy network update.
步骤5.4,软更新目标网络的参数:Step 5.4, soft update the parameters of the target network:
步骤5.5,重复步骤5.1至步骤5.4共K次,结束此轮回合。重复步骤S3至步骤S5,直到算法完全收敛,得到机械臂在动态环境下避开障碍物并达到目标物的最优策略网络 Step 5.5, repeat steps 5.1 to 5.4 for a total of K times to end this round. Steps S3 to S5 are repeated until the algorithm is completely converged, and the optimal strategy network for the robot arm to avoid obstacles and reach the target in a dynamic environment is obtained.
本实施例中对比了本发明公开的方法和同类算法在附图1仿真环境中训练的效果,结果对比如附图4所示。从图中可以看出,基于磁场的奖励塑形方法在学习过程中每个回合的成功率明显高于原始奖励和基于距离的奖励塑形方法,由此可见,本发明公开的方法能有效提高机械臂运动控制任务中强化学习算法的学习效率。In this embodiment, the training effects of the method disclosed in the present invention and similar algorithms in the simulation environment of FIG. 1 are compared, and the results are compared as shown in FIG. 4 . It can be seen from the figure that the success rate of the magnetic field-based reward shaping method in each round in the learning process is significantly higher than that of the original reward and the distance-based reward shaping method. It can be seen that the method disclosed in the present invention can effectively improve the Learning efficiency of reinforcement learning algorithms in robotic arm motion control tasks.
以上所述的仅是本发明的优选实施方式,应当指出,对于本领域的普通技术人员来说,在不脱离本发明创造构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, some modifications and improvements can be made without departing from the inventive concept of the present invention, which belong to the present invention. The scope of protection of the invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210705509.0A CN115179280B (en) | 2022-06-21 | 2022-06-21 | Magnetic field-based reward shaping method for reinforcement learning mechanical arm control |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210705509.0A CN115179280B (en) | 2022-06-21 | 2022-06-21 | Magnetic field-based reward shaping method for reinforcement learning mechanical arm control |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115179280A true CN115179280A (en) | 2022-10-14 |
| CN115179280B CN115179280B (en) | 2025-07-22 |
Family
ID=83514905
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210705509.0A Active CN115179280B (en) | 2022-06-21 | 2022-06-21 | Magnetic field-based reward shaping method for reinforcement learning mechanical arm control |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115179280B (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117140527A (en) * | 2023-09-27 | 2023-12-01 | 中山大学·深圳 | A robotic arm control method and system based on deep reinforcement learning algorithm |
| CN118123816A (en) * | 2024-02-18 | 2024-06-04 | 东莞理工学院 | Deep reinforcement learning robot arm motion planning method, system and storage medium based on constraint model |
| CN120326639A (en) * | 2025-06-19 | 2025-07-18 | 北京中联国成科技有限公司 | A robot high-precision anti-collision system and method |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104914874A (en) * | 2015-06-09 | 2015-09-16 | 长安大学 | Unmanned aerial vehicle attitude control system and method based on self-adaption complementation fusion |
| US20190104994A1 (en) * | 2017-10-09 | 2019-04-11 | Vanderbilt University | Robotic capsule system with magnetic actuation and localization |
| CN110764416A (en) * | 2019-11-11 | 2020-02-07 | 河海大学 | Humanoid robot gait optimization control method based on deep Q network |
| CN111515961A (en) * | 2020-06-02 | 2020-08-11 | 南京大学 | Reinforcement learning reward method suitable for mobile mechanical arm |
| CN113889737A (en) * | 2021-09-30 | 2022-01-04 | 西华大学 | Substrate integrated waveguide parameter optimization method and structure based on reinforcement learning |
-
2022
- 2022-06-21 CN CN202210705509.0A patent/CN115179280B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104914874A (en) * | 2015-06-09 | 2015-09-16 | 长安大学 | Unmanned aerial vehicle attitude control system and method based on self-adaption complementation fusion |
| US20190104994A1 (en) * | 2017-10-09 | 2019-04-11 | Vanderbilt University | Robotic capsule system with magnetic actuation and localization |
| CN110764416A (en) * | 2019-11-11 | 2020-02-07 | 河海大学 | Humanoid robot gait optimization control method based on deep Q network |
| CN111515961A (en) * | 2020-06-02 | 2020-08-11 | 南京大学 | Reinforcement learning reward method suitable for mobile mechanical arm |
| CN113889737A (en) * | 2021-09-30 | 2022-01-04 | 西华大学 | Substrate integrated waveguide parameter optimization method and structure based on reinforcement learning |
Non-Patent Citations (2)
| Title |
|---|
| 陈立群: "仰之弥高 钻之弥深──《航天器姿态动力学》评介", 商丘师范学院学报, no. 4, 30 December 1997 (1997-12-30) * |
| 魏治华: "基于强化学习的移动机器人导航及环境状态探测的研究", 硕士学位论文, 15 January 2007 (2007-01-15) * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117140527A (en) * | 2023-09-27 | 2023-12-01 | 中山大学·深圳 | A robotic arm control method and system based on deep reinforcement learning algorithm |
| CN117140527B (en) * | 2023-09-27 | 2024-04-26 | 中山大学·深圳 | Mechanical arm control method and system based on deep reinforcement learning algorithm |
| CN118123816A (en) * | 2024-02-18 | 2024-06-04 | 东莞理工学院 | Deep reinforcement learning robot arm motion planning method, system and storage medium based on constraint model |
| CN120326639A (en) * | 2025-06-19 | 2025-07-18 | 北京中联国成科技有限公司 | A robot high-precision anti-collision system and method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115179280B (en) | 2025-07-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115179280A (en) | Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning | |
| CN110750096B (en) | Collision avoidance planning method for mobile robots based on deep reinforcement learning in static environment | |
| Juang et al. | Wall-following control of a hexapod robot using a data-driven fuzzy controller learned through differential evolution | |
| CN106096729B (en) | A kind of depth-size strategy learning method towards complex task in extensive environment | |
| Bai et al. | Path planning of autonomous mobile robot in comprehensive unknown environment using deep reinforcement learning | |
| CN116533234A (en) | Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning | |
| CN110362081B (en) | Mobile robot path planning method | |
| CN114888801A (en) | Mechanical arm control method and system based on offline strategy reinforcement learning | |
| Toan et al. | Mapless navigation with deep reinforcement learning based on the convolutional proximal policy optimization network | |
| CN116922391B (en) | Autonomous learning and optimizing method for spatial mechanical arm skills | |
| Luo et al. | Calibration-free monocular vision-based robot manipulations with occlusion awareness | |
| CN111309035A (en) | Multi-robot cooperative movement and dynamic obstacle avoidance method, device, equipment and medium | |
| Yan et al. | Path planning for mobile robot's continuous action space based on deep reinforcement learning | |
| CN119200410A (en) | Obstacle avoidance trajectory planning and adaptive tracking control method for tower cranes for intelligent construction | |
| CN116852347A (en) | A state estimation and decision control method for autonomous grasping of non-cooperative targets | |
| CN119388427B (en) | Method and system for controlling the stable motion trajectory of a robotic arm when grasping a large thin plate | |
| Luo et al. | Balance between efficient and effective learning: Dense2sparse reward shaping for robot manipulation with environment uncertainty | |
| Ding et al. | Magnetic field-based reward shaping for goal-conditioned reinforcement learning | |
| Kumar et al. | Kinematic control of a redundant manipulator using an inverse-forward adaptive scheme with a KSOM based hint generator | |
| Fang et al. | Quadrotor navigation in dynamic environments with deep reinforcement learning | |
| CN118536684A (en) | Multi-agent path planning method based on deep reinforcement learning | |
| Duan et al. | Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots | |
| Sang et al. | Motion planning of space robot obstacle avoidance based on DDPG algorithm | |
| Toan et al. | Environment exploration for mapless navigation based on deep reinforcement learning | |
| Shen et al. | Energy-Efficient Motion Planning and Control for Robotic Arms via Deep Reinforcement Learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |