CN115202359A

CN115202359A - Unmanned ship path planning method based on reinforcement learning and rapid expansion of random tree

Info

Publication number: CN115202359A
Application number: CN202210898057.2A
Authority: CN
Inventors: 肖玲君; 姜璐璐; 白巧; 朱曼; 文元桥; 王理征
Original assignee: Wuhan Yuchi Detection Technology Co ltd; Wuhan University of Technology WUT
Current assignee: Wuhan Yuchi Detection Technology Co ltd; Wuhan University of Technology WUT
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-10-18

Abstract

The invention discloses an unmanned ship path planning method based on reinforcement learning and rapid expansion random tree, which comprises the steps of firstly constructing a decision model for unmanned ship path planning based on a Markov decision process; then modeling the navigation environment of the unmanned ship, and determining an initial position and a target position; and then, taking the initial position as a root node, constructing an initial random tree, selecting expansion points of the initial random tree by adopting a Q-learning algorithm based on a constructed decision model for unmanned ship path planning, updating the random tree according to the selected expansion points until a path from the initial position to a target position is found, screening random sampling points according to the state of interaction with the environment model in the selection process of the expansion points, and updating a value function according to the state of the random sampling points and reward. The method of the invention can provide adaptability to the environment and improve the accuracy of path planning.

Description

A Path Planning Method for Unmanned Vehicles Based on Reinforcement Learning and Rapidly Expanding Random Trees

技术领域technical field

本发明涉及船舶路径规划技术领域，尤其涉及一种基于强化学习和快速扩展随机树的无人艇路径规划方法。The invention relates to the technical field of ship path planning, in particular to an unmanned ship path planning method based on reinforcement learning and rapid expansion of random trees.

背景技术Background technique

水面无人艇是一种能够自主航行和智能作业的无人驾驶小型船舶。与常规的有人驾驶船舶相比，水面无人艇具有船体尺度较小、操纵较灵活、成本较低等优点。路径规划从水面无人艇诞生以来一直是水面无人艇研究的重要内容，无人艇与外界位置环境进行人机交互，预先采集当前水域环境信息，根据收集的信息进行初步的路径规划，再由内置传感器在实时航行过程中采集动态环境信息，进行局部路径规划，从而进行避碰。无人艇的路径规划就是根据采集到的环境信息，找寻一条从起始点到目标点的最优路径，安全到达目标点。Surface unmanned boat is an unmanned small vessel that can navigate autonomously and operate intelligently. Compared with conventional manned ships, surface unmanned boats have the advantages of smaller hull size, more flexible maneuvering, and lower cost. Path planning has been an important part of the research of surface unmanned boats since the birth of surface unmanned boats. Unmanned boats conduct human-machine interaction with the external location environment, collect current water environment information in advance, and carry out preliminary path planning according to the collected information. Dynamic environmental information is collected by built-in sensors during real-time navigation, and local path planning is performed to avoid collisions. The path planning of the unmanned boat is to find an optimal path from the starting point to the target point according to the collected environmental information, and safely reach the target point.

因此，在面对复杂环境的情况下，对现有路径规划算法的优化显得至关重要。在优化路径规划问题时，目前主要采用限制约束条件，目标函数的设置来进行优化。然而，现有的规划方法对于环境的适应性较差，因此路径规划效果不好。Therefore, in the face of complex environments, the optimization of existing path planning algorithms is crucial. When optimizing the path planning problem, at present, the constraints and the setting of the objective function are mainly used for optimization. However, the existing planning methods have poor adaptability to the environment, so the path planning effect is not good.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于强化学习和快速扩展随机树的无人艇路径规划方法，用以解决或者至少部分解决现有技术中的方法对于环境的适应性较差导致的路径规划效果不佳的技术问题。The present invention provides an unmanned boat path planning method based on reinforcement learning and rapid expansion of random trees, which is used to solve or at least partially solve the poor path planning effect caused by the poor adaptability of the methods in the prior art to the environment. question.

为了解决上述技术问题，本发明公开了一种基于强化学习和快速扩展随机树的无人艇路径规划方法，包括：In order to solve the above technical problems, the present invention discloses an unmanned boat path planning method based on reinforcement learning and rapid expansion of random trees, including:

S1：基于马尔科夫决策过程构建用于无人艇路径规划的决策模型，决策模型包括状态和奖励；S1: Build a decision model for UAV path planning based on Markov decision process, the decision model includes state and reward;

S2：对无人艇的航行环境进行建模，并确定初始位置和目标位置；S2: Model the navigation environment of the UAV, and determine the initial position and target position;

S3：将初始位置作为根节点，构建初始随机树，基于构建的用于无人艇路径规划的决策模型，采用Q-learning算法对初始随机树的拓展点进行选择，根据选择出的拓展点更新随机树，直到找到从初始位置到目标位置的路径，扩展点的选择过程中根据与环境模型交互的状态筛选随机采样点，并根据随机采样点的状态和奖励更新价值函数。S3: Take the initial position as the root node, construct an initial random tree, and use the Q-learning algorithm to select the expansion point of the initial random tree based on the constructed decision-making model for UAV path planning, and update it according to the selected expansion point. Random tree until the path from the initial position to the target position is found, the selection process of the expansion point selects the random sampling points according to the state of interaction with the environment model, and updates the value function according to the state and reward of the random sampling point.

在一种实施方式中，步骤S1的决策模型为：In one embodiment, the decision model of step S1 is:

(S,A,O,T,r,P_sa)(S,A,O,T,r, _Psa )

其中，S＝{s₁,s₂,…,s_t}表示状态空间，s_t表示t时刻的状态，A＝{a₁,a₂,…,a_t}表示动作空间，a_t∈A表示在t时刻执行的动作，O表示观察空间，T表示观察值转移函数，r表示奖励函数，P_sa表示状态转移函数。Among them, S={s ₁ , s ₂ ,...,s _t } represents the state space, s _t represents the state at time t, A={a ₁ , a ₂ ,..., a _t } represents the action space, and a _t ∈ A represents the action performed at time t, O represents the observation space, T represents the observation transfer function, r represents the reward function, and P _sa represents the state transfer function.

在一种实施方式中，步骤S2包括：In one embodiment, step S2 includes:

S21：将无人艇航行水域用大小相同的栅格进行均匀划分；S21: Evenly divide the navigation water area of the unmanned boat with grids of the same size;

S22：判断栅格是否被碍航物占据，将栅格分为无障碍物的自由栅格和被碍航物占据的障碍栅格，环境模型由自由栅格和障碍栅格构成；S22: Judging whether the grid is occupied by obstacles, divide the grid into free grids without obstacles and obstacle grids occupied by obstacles, and the environment model is composed of free grids and obstacle grids;

S23：将自由栅格的水深数值归一化；S23: Normalize the water depth value of the free grid;

S24：确定无人艇的初始位置及目标位置。S24: Determine the initial position and target position of the unmanned boat.

在一种实施方式中，步骤S3包括：In one embodiment, step S3 includes:

S31：初始化无人艇的状态s及其任意动作a所对应的价值函数Q(s,a)；S31: Initialize the state s of the unmanned boat and the value function Q(s, a) corresponding to its arbitrary action a;

S32：根据与环境模型交互的状态筛选随机采样点，并搜索临近的节点，得到新的状态和奖励；S32: Screen random sampling points according to the state of interaction with the environment model, and search for nearby nodes to obtain new states and rewards;

S33：用Q-Learning更新价值函数，计算公式为：S33: Update the value function with Q-Learning, the calculation formula is:

Q(s_t+1,a_t+1)←Q(s_t-1,a_t-1)+a[r+γmax_atQ(s_t,a_t)-Q(s_t-1,a_t-1)]Q(s _t+1 ,a _t+1 )←Q(s _t-1 ,a _t-1 )+a[r+γmax _at Q(s _t ,a _t )-Q(s _t-1 ,a _{t -1} )]

其中，s_t为t时刻状态，a_t为t时刻所选动作，r为当前状态下反馈的奖励，α为学习率，γ为折扣因子，max表示取最大值，Q(s_t-1,a_t-1)表示t-1时刻的价值函数，Q(s_t,a_t)表示t时刻的价值函数，←表示更新，Q(s_t+1,a_t+1)表示更新后的价值函数，即t+1时刻的价值函数；Among them, s _t is the state at time _t , at t is the action selected at time t, r is the reward feedback in the current state, α is the learning rate, γ is the discount factor, max represents the maximum value, Q(s _t-1 , a _t-1 ) represents the value function at time t-1, Q(s _t , a _t ) represents the value function at time t, ← represents the update, Q(s _t+1 , a _t+1 ) represents the updated value function, that is, the value function at time t+1;

S34：判断是否满足约束条件，如果满足约束则执行下一步S35，如果不满足条件则重新执行S32；S34: judge whether the constraint condition is satisfied, if the constraint is satisfied, execute the next step S35, and if the condition is not satisfied, execute S32 again;

S35：将S34选择的满足约束的节点拓展为新的节点，更新随机树。S35: Expand the node that satisfies the constraint selected in S34 into a new node, and update the random tree.

在一种实施方式中，在步骤S35之后，所述方法还包括：In one embodiment, after step S35, the method further includes:

判断是否到达目标点，如果到达目标点则回溯节点，完成路径规划；如果未到达目标点则重新执行S32的操作。It is judged whether the target point is reached, and if the target point is reached, the node is backtracked to complete the path planning; if the target point is not reached, the operation of S32 is re-executed.

在一种实施方式中，所述方法还包括设置诱导奖励和路径转角奖励，其中，诱导奖励的计算公式为：In one embodiment, the method further includes setting an inducement reward and a path corner reward, wherein the calculation formula of the inducement reward is:

r₁＝kr_o+(1-k)r_g r ₁ =kr _o +(1-k)r _g

其中，r₁为诱导奖励，k为诱导系数，r_o为避障动作奖励，r_g为目标动作奖励；Among them, r ₁ is the induction reward, _k is the induction coefficient, ro is the obstacle avoidance action reward, and r _g is the target action reward;

路径转角奖励的计算公式为：The calculation formula of the path corner reward is:

其中，r₂为转角奖励，e为转角系数，θ为转角角度。Among them, r ₂ is the corner reward, e is the corner coefficient, and θ is the corner angle.

相对于现有技术，本发明的优点和有益的技术效果如下：Compared with the prior art, the advantages and beneficial technical effects of the present invention are as follows:

本发明提供的一种基于强化学习和快速扩展随机树的无人艇路径规划方法，在对随机树的扩展点进行选择时，采用强化学习选择机制Q-learning算法来选择扩展点，扩展点的选择过程中根据与环境模型交互的状态筛选随机采样点，并根据随机采样点的状态和奖励更新价值函数可以筛选出符合条件的扩展点，由于考虑了与环境模型的交互状态，因此可以提高路径对环境的适应性，重复上述过程即生成一条对初始位置到目标位置的最优路径，提升路径规划的效果。The invention provides an unmanned boat path planning method based on reinforcement learning and rapid expansion of random trees. When selecting expansion points of random trees, reinforcement learning selection mechanism Q-learning algorithm is used to select expansion points. During the selection process, random sampling points are screened according to the state of interaction with the environment model, and the value function is updated according to the state and reward of the random sampling point. Eligible expansion points can be filtered out. Since the interaction state with the environment model is considered, the path can be improved. For the adaptability to the environment, repeating the above process generates an optimal path from the initial position to the target position, which improves the effect of path planning.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例中基于强化学习和快速扩展随机树的无人艇路径规划方法的流程图。FIG. 1 is a flowchart of an unmanned boat path planning method based on reinforcement learning and fast expanding random tree in an embodiment of the present invention.

图2是本发明实施例中采用RRT算法进行节点扩展的示意图。FIG. 2 is a schematic diagram of node expansion using an RRT algorithm in an embodiment of the present invention.

图3是本发明实施例中转角约束示意图；3 is a schematic diagram of a corner constraint in an embodiment of the present invention;

图4是本发明实施例中采用的路径规划方法的结果示意图。FIG. 4 is a schematic diagram of the result of the path planning method adopted in the embodiment of the present invention.

具体实施方式Detailed ways

本发明通过提供基于Q-learning的RRT水面无人艇路径搜索方法，解决现有技术中的环境适应性较差以及路径规划效果不佳的技术问题。The invention solves the technical problems of poor environmental adaptability and poor path planning effect in the prior art by providing a Q-learning-based RRT surface unmanned boat path search method.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例提供了一种基于强化学习和快速扩展随机树的无人艇路径规划方法，包括：The embodiment of the present invention provides an unmanned boat path planning method based on reinforcement learning and rapid expansion of random trees, including:

具体实施过程中，快速扩展随机树即为RRT算法，首先根据与环境模型交互的状态筛选随机采样点，并根据随机采样点的状态和奖励更新价值函数，进一步得到扩展点，扩展点即找到一条到当前节点到扩展点的子路径，重复该过程，直到找到一条从初始节点到目标节点的路径。In the specific implementation process, the rapid expansion of the random tree is the RRT algorithm. First, the random sampling points are screened according to the state of interaction with the environment model, and the value function is updated according to the state and reward of the random sampling point, and the expansion point is further obtained. To the subpath from the current node to the extension point, repeat the process until a path is found from the initial node to the target node.

(S,A,O,T,r,P_sa)(S,A,O,T,r, _Psa )

具体实施过程中，一般的马尔可夫决策由(S,A,r,P_sa)组成，针对面对部分未知环境和重规划存在的不确定性问题，本发明将水面无人艇路径规划问题定义为部分可观察的马尔科夫决策过程六元组(S,A,O,T,r,P_sa)。In the specific implementation process, the general Markov decision is composed of (S, A, r, P _sa ). In order to face the uncertainty problem of some unknown environments and re-planning, the present invention solves the problem of path planning for surface unmanned boats. Defined as a partially observable Markov decision process six-tuple (S, A, O, T, r, P _sa ).

强化学习的回报函数G_t表示从t时刻开始后的奖励折扣和，回报函数表达式为：

其中，折扣因子γ∈(0,1)。The reward function G _t of reinforcement learning represents the sum of reward discounts starting from time t, and the reward function expression is:

where the discount factor γ∈(0,1).

通过对无人艇的航行环境进行建模，从而构建环境模型，并确定初始位置和目标位置。By modeling the navigation environment of the UAV, the environment model is constructed, and the initial position and target position are determined.

S32：根据与环境模型交互的状态筛选随机采样点，并搜索临近的节点，得到新的状态s_t和奖励r；S32: Screen random sampling points according to the state interacting with the environment model, and search for nearby nodes to obtain a new state s _t and a reward r;

请参见图2，为本发明实施例中采用RRT算法进行节点扩展的示意图，x_rand为随机采样点，x_init为初始节点，x_near为临近节点，x_goal为目标节点。Please refer to FIG. 2 , which is a schematic diagram of node expansion using the RRT algorithm in an embodiment of the present invention, where x _rand is a random sampling point, x _init is an initial node, x _near is an adjacent node, and x _goal is a target node.

请参见图4，是本发明实施例中采用的路径规划方法的结果示意图。Please refer to FIG. 4 , which is a schematic diagram of the result of the path planning method adopted in the embodiment of the present invention.

请参见图1，为本发明实施例中基于强化学习和快速扩展随机树的无人艇路径规划方法的流程图。初始化拓展树(随机搜索树)，即将初始位置作为扩展树的根节点。更新Q表，即更新价值函数。Please refer to FIG. 1 , which is a flowchart of a method for planning a path of an unmanned boat based on reinforcement learning and a fast expanding random tree in an embodiment of the present invention. Initialize the expansion tree (random search tree), that is, take the initial position as the root node of the expansion tree. Update the Q table, i.e. update the value function.

r₁＝kr_o+(1-k)r_g r ₁ =kr _o +(1-k)r _g

具体来说，对于决策模型的奖励机制，为降低RRT算法的随机性，提高目标趋向性，本发明引入诱导因子对随机节点(采样点)进行改进。诱导因子的核心思想是在随机树拓展的过程中，通过一定概率使得x_rand＝x_goal(随机节点等于目标节点)，这样随机树会更倾向于向目标方向搜索。Specifically, for the reward mechanism of the decision-making model, in order to reduce the randomness of the RRT algorithm and improve the target tendency, the present invention introduces an induction factor to improve the random nodes (sampling points). The core idea of the induction factor is to make x _rand = x _goal (the random node is equal to the goal node) through a certain probability during the expansion of the random tree, so that the random tree will be more inclined to search in the direction of the goal.

进一步地，本实施方式引入路径转角奖励，小角度调整船舶会提高其移动效率并节约资源，而大角度则会降低可靠性，因此大角度产生低回报，小角度产生高回报。如图3所示，是本发明实施例中转角约束示意图。Further, this embodiment introduces a path angle reward. Adjusting the ship at a small angle will improve its moving efficiency and save resources, while a large angle will reduce reliability. Therefore, a large angle produces a low reward, and a small angle produces a high reward. As shown in FIG. 3 , it is a schematic diagram of a corner constraint in an embodiment of the present invention.

通过设置的诱导奖励和转角奖励得到奖励，从而对扩展点的选择进行优化，进一步提高了路径规划的准确性。The selection of extension points is optimized through the set inducement reward and corner reward, which further improves the accuracy of path planning.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, additional changes and modifications to these embodiments may occur to those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiment and all changes and modifications that fall within the scope of the present invention.

显然，本领域的技术人员可以对本发明实施例进行各种改动和变型而不脱离本发明实施例的精神和范围。这样，倘若本发明实施例的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present invention without departing from the spirit and scope of the embodiments of the present invention. Thus, provided that these modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An unmanned ship path planning method based on reinforcement learning and fast expansion random tree is characterized by comprising the following steps:

s1: constructing a decision model for unmanned ship path planning based on a Markov decision process, wherein the decision model comprises a state and an award;

s2: modeling the navigation environment of the unmanned ship, and determining an initial position and a target position;

s3: the method comprises the steps of taking an initial position as a root node, constructing an initial random tree, selecting expansion points of the initial random tree by adopting a Q-learning algorithm based on a constructed decision model for unmanned ship path planning, updating the random tree according to the selected expansion points until a path from the initial position to a target position is found, screening random sampling points according to a state of interaction with an environment model in the selection process of the expansion points, and updating a value function according to the state of the random sampling points and reward.

2. The unmanned ship path planning method based on reinforcement learning and fast expansion random tree as claimed in claim 1, wherein the decision model of step S1 is:

(S,A,O,T,r,P _sa )

wherein S = { S = ₁ ,s ₂ ,…,s _t Denotes the state space, s _t Represents the state at time t, A = { a = ₁ ,a ₂ ,…,a _t Denotes the motion space, a _t E A represents the action performed at time T, O represents the observation space, T represents the observation value transfer function, r represents the reward function, P _sa Representing the state transition function.

3. The reinforcement learning and fast propagation stochastic tree based unmanned ship path planning method of claim 1, wherein step S2 comprises:

s21: uniformly dividing the sailing water area of the unmanned ship by using grids with the same size;

s22: judging whether the grid is occupied by an obstacle, dividing the grid into a free grid without the obstacle and an obstacle grid occupied by the obstacle, wherein the environment model consists of the free grid and the obstacle grid;

s23: normalizing the water depth value of the free grid;

s24: and determining the initial position and the target position of the unmanned ship.

4. The reinforcement learning and fast propagation stochastic tree based unmanned ship path planning method of claim 1, wherein step S3 comprises:

s31: initializing a state s of the unmanned ship and a value function Q (s, a) corresponding to any action a of the unmanned ship;

s32: screening random sampling points according to the state of interaction with the environment model, and searching adjacent nodes to obtain a new state and reward;

s33: updating the cost function by using Q-Learning, wherein the calculation formula is as follows:

Q(s _t+1 ,a _t+1 )←Q(s _t-1 ,a _t-1 )+a[r+γmax _at Q(s _t ,a _t )-Q(s _t-1 ,a _t-1 )]

wherein s is _t State at time t, a _t For the action selected at time t, r is the reward fed back at the current state, α is the learning rate, γ is the discount factor, max represents the maximum value, Q(s) _t-1 ,a _t-1 ) Representing the cost function, Q(s), at time t-1 _t ,a _t ) Value function representing time t, ← representing update, Q(s) _t+1 ,a _t+1 ) Representing the updated cost function, namely the cost function at the time t + 1;

s34: judging whether constraint conditions are met, if so, executing the next step S35, and if not, re-executing S32;

s35: and expanding the nodes which are selected in the S34 and meet the constraint into new nodes, and updating the random tree.

5. The reinforcement learning and fast propagation stochastic tree based unmanned ship path planning method of claim 4, wherein after step S35, the method further comprises:

judging whether the target point is reached, if so, backtracking the node to complete path planning; if the target point is not reached, the operation of S32 is re-executed.

6. The reinforcement learning and fast propagation random tree based unmanned ship path planning method of claim 1, wherein the method further comprises setting an inducement award and a path corner award, wherein the inducement award is calculated by the formula:

r ₁ ＝kr _o +(1-k)r _g

wherein r is ₁ For inducing reward, k is the induction coefficient, r _o Reward for obstacle avoidance action, r _g Awarding a target action;

the calculation formula of the path rotation angle reward is as follows:

wherein r is ₂ The number is the corner reward, e is the corner coefficient, and theta is the corner angle.