CN114676909A

CN114676909A - A charging path planning method for unmanned vehicles based on deep reinforcement learning

Info

Publication number: CN114676909A
Application number: CN202210302308.6A
Authority: CN
Inventors: 傅忱忱; 郜正轩; 吴巍炜; 吕妍; 周頴豪; 徐学永; 夏羽
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-28
Anticipated expiration: 2042-03-25
Also published as: CN114676909B

Abstract

The invention designs a path planning method for charging sensor nodes of an unmanned vehicle in a wireless sensor Network by using a deep reinforcement learning model Graph based Pointer Network (GAPN). The process comprises the following steps: (1) collecting each node in wireless sensor network

Position coordinate information of

And its electric quantity condition

(2) Generating a training data set and a testing data set which are distributed the same as the position information of the training data set according to the collected data; (3) using the prepared data set for GAPN model training; (4) and inputting the collected node information into the trained GAPN model, and outputting a final access node sequence through the processing of the model. The method based on deep reinforcement learning and the graph neural network provides a strategy for solving the problem of route planning of the unmanned charging vehicle, improves the working efficiency of the wireless sensor network, and prolongs the life cycle of the network.

Description

A charging path planning method for unmanned vehicles based on deep reinforcement learning

技术领域technical field

本发明属于机器学习应用领域，具体涉及模型搭建与训练，通过实际问题的解决效果来反应模型的优劣，尤其涉及基于深度强化学习的无人充电车路径规划方法。The invention belongs to the application field of machine learning, and specifically relates to model building and training, and reflects the advantages and disadvantages of the model through the solution effect of practical problems, and particularly relates to a path planning method for unmanned charging vehicles based on deep reinforcement learning.

背景技术Background technique

在大多数无线传感器网络中，每个传感器节点都是固定部署在一个位置，并且通过电池设备或者依靠环境能源，例如风能、太阳能等进行供给电量，但是这样的能源供给存在不可控性和不稳定性，无线传感网工作效率以及稳定性也受到了极大的影响，网络的生命周期也大大降低。In most wireless sensor networks, each sensor node is fixedly deployed in one location, and supplies electricity through battery equipment or relying on ambient energy, such as wind energy, solar energy, etc., but such energy supply is uncontrollable and unstable The efficiency and stability of wireless sensor networks are also greatly affected, and the life cycle of the network is also greatly reduced.

为了解决传统无线传感器网络的能量供给问题，利用移动设备对传感器节点进行补充电量的方案成为研究的热点。这样的充电方式实际上是基于移动性节点的思想提出的，引入移动性节点的可以提高网络的灵活性，为传感器网中的多种任务，例如传感器节点电量补充、数据收集、目标跟踪和节点定位等提供了新的思路，极大提高了传感器网络的容错率，延长了传感器网络的生命周期。为了让充电车能够合理选择需要补充能量的传感器节点，平衡节点之间的能量损耗，更加有效地执行充电任务，避免网络中出现因电量不足而出现的节点“死亡”现象，需要对充电车进行有效调度以及合理的路径规划。In order to solve the energy supply problem of traditional wireless sensor networks, the scheme of using mobile devices to supplement the power of sensor nodes has become a research hotspot. Such a charging method is actually proposed based on the idea of mobility nodes. The introduction of mobility nodes can improve the flexibility of the network and provide various tasks in the sensor network, such as sensor node power replenishment, data collection, target tracking and node tracking. Positioning provides new ideas, greatly improves the fault tolerance rate of sensor networks, and prolongs the life cycle of sensor networks. In order to allow the charging vehicle to reasonably select the sensor nodes that need to be supplemented with energy, balance the energy loss between nodes, perform charging tasks more effectively, and avoid the phenomenon of node "death" due to insufficient power in the network, it is necessary to conduct Effective scheduling and reasonable path planning.

路径规划问题是经典的组合优化问题之一，近些年来，研究表明深度学习和强化学习在解决组合优化问题上有独特的优势。基于深度强化学习的模型通常将问题转化到高维空间中，在高维空间中寻找并学习变量之间存在的关系，并结合图神经网络对于图结构的处理，获取到问题的最优解，这是传统运筹学方法所无法做到的。所以，如何设计神经网络的框架结构和损失函数，优化模型的训练过程，是目前深度强化学习解决组合优化问题的技术难点。The path planning problem is one of the classic combinatorial optimization problems. In recent years, studies have shown that deep learning and reinforcement learning have unique advantages in solving combinatorial optimization problems. The model based on deep reinforcement learning usually transforms the problem into a high-dimensional space, finds and learns the relationship between variables in the high-dimensional space, and combines the processing of the graph structure with the graph neural network to obtain the optimal solution to the problem. This is not possible with traditional operations research methods. Therefore, how to design the frame structure and loss function of the neural network and optimize the training process of the model is the technical difficulty of deep reinforcement learning to solve the combinatorial optimization problem.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题在于，在考虑节点位置、耗电速率、充电速率以及节点电量容量等限制条件情况下，为充电车设计合理的节点充电路径规划方案，避免无线传感网络中的节点出现电量不足而导致的“死亡现象”，并且尽可能减少无人车所消耗的路径成本，从而延长无线传感网络的生命周期，提高网络可用性。The technical problem to be solved by the present invention is to design a reasonable node charging path planning scheme for charging vehicles under the consideration of node location, power consumption rate, charging rate and node power capacity and other constraints, so as to avoid the nodes in the wireless sensor network. There is a "death phenomenon" caused by insufficient power, and the path cost consumed by unmanned vehicles is minimized, thereby extending the life cycle of wireless sensor networks and improving network availability.

本发明的技术方案是：The technical scheme of the present invention is:

基于深度强化学习的无人充电车路径规划方法，包括如下步骤：The path planning method for unmanned charging vehicles based on deep reinforcement learning includes the following steps:

步骤(1)收集无线传感网络中的节点S＝{s₀,s₁,s₂,…,s_n}信息，包含节点的位置信息和紧急状况；Step (1) Collect node S= _{ s ₀ , s ₁ , s ₂ , .

步骤(2)搭建基于深度强化学习的模型GAPN，包含点编码器、图编码器、解码器以及注意力机制模块；Step (2) build a model GAPN based on deep reinforcement learning, including a point encoder, a graph encoder, a decoder and an attention mechanism module;

步骤(3)以搜集到节点信息数据为基础，生成训练数据集和测试数据集用于训练GAPN模型；Step (3) is based on the collected node information data, and generates a training data set and a test data set for training the GAPN model;

步骤(4)设定模型训练目标为最小化节点死亡率以及最小化充电车路径成本，定义损失函数，使用强化学习actor-critic算法对GAPN进行训练，得到最终的模型；Step (4) Set the model training objectives to minimize the node mortality rate and minimize the path cost of the charging vehicle, define the loss function, and use the reinforcement learning actor-critic algorithm to train the GAPN to obtain the final model;

步骤(5)将无线传感网络中的节点S＝{s₀,s₁,s₂,…,s_n}信息输入到GAPN模型中，模型返回充电车访问所有节点的方案π＝{s₀,s′₁,s′₂,…,s′_n}，其中π是对S的重新排序；Step (5) Input the information of nodes S={s ₀ , s ₁ , s ₂ ,...,s _n } in the wireless sensor network into the GAPN model, and the model returns the scheme π={s ₀ for the charging vehicle to visit all nodes ,s′ ₁ ,s′ ₂ ,…,s′ _n }, where π is the reordering of S;

步骤(6)利用一种启发式聚类方法结合GAPN，解决多个充电车场景下的路径规划问题，即无线传感网络具有多个充电车时，调度充电车以及规划充电车路径的方案；Step (6) utilizes a heuristic clustering method combined with GAPN to solve the path planning problem in multiple charging vehicle scenarios, that is, when the wireless sensor network has multiple charging vehicles, a scheme for scheduling charging vehicles and planning charging vehicle paths;

步骤(7)将模型的图拓扑嵌入模块进行修改，利用模型泛化性来解决大规模旅行商问题的方案，即模型在小规模旅行商问题数据集上进行训练，然后将模型用于解决大规模场景下的旅行商问题。Step (7) Modify the graph topology embedding module of the model, and use the model generalization to solve the large-scale traveling salesman problem, that is, the model is trained on the small-scale traveling salesman problem dataset, and then the model is used to solve large The Traveling Salesman Problem in Scale Scenarios.

本发明进一步优选，步骤(4)中使用的actor-critic算法具体为：The present invention is further preferred, and the actor-critic algorithm used in step (4) is specifically:

步骤4.1：以GAPN的网络结构为基础搭建actor网络和critic网络，θ代表actor网络参数,θ_v代表critic网络参数，两个网络的区别在于actor网络在解码过程中使用平均采样的方式，critic网络采用贪心采样的策略；Step 4.1: Build an actor network and a critic network based on the GAPN network structure. θ represents the actor network parameters, and θ _v represents the critic network parameters. The difference between the two networks is that the actor network uses average sampling in the decoding process. The critic network Adopt greedy sampling strategy;

步骤4.2：一批量训练数据集合X，共有B个数据，对于每一个数据集实例x_i，应用当前actor网络进行求解最终路径成本L(π_i|x_i)，应用critic网络求解其基线期望值

Step 4.2: A batch of training data set X, with a total of B data, for each data set instance _xi , the current actor network is used to solve the final path cost L(π _i | _xi ), and the critic network is used to solve its baseline expectation value

步骤4.3：利用步骤4.2中得到的值进行两个网络批量策略梯度计算，actor网络策略梯度g_θ计算：

critic网络策略梯度

Step 4.3: Use the values obtained in step 4.2 to perform two network batch policy gradient calculations, and the actor network policy gradient g _θ calculation:

critic network policy gradient

步骤4.4：利用梯度下降法对两个网络的参数分别进行更新，更新规则分别为：θ＝θ+lr*g_θ，

其中lr为学习率；Step 4.4: Use the gradient descent method to update the parameters of the two networks respectively. The update rules are: θ=θ+lr*g _θ ,

where lr is the learning rate;

步骤4.5：重复步骤4.2至步骤4.4T次，T为训练前设定好的训练代数。Step 4.5: Repeat steps 4.2 to 4.4 T times, where T is the training algebra set before training.

此外，本发明还提出一种利用模型泛化性来解决大规模旅行商问题的方案。目前解决大规模旅行商问题，比如500，1000或更多的节点的问题，用同等大小的数据集进行训练模型时间成本较高，本发明利用图神经网络的模型泛化能力，对整个点集编码的方式进行修改，将点集的邻接矩阵作为图神经网络的输入，使得模型能够使用小规模问题的数据集进行训练，然后用于解决大规模问题的场景。In addition, the present invention also proposes a solution to the large-scale traveling salesman problem by using the model generalization. At present, to solve the large-scale traveling salesman problem, such as the problem of 500, 1000 or more nodes, the time cost of training the model with the same size data set is relatively high. The encoding method is modified, and the adjacency matrix of the point set is used as the input of the graph neural network, so that the model can be trained using the dataset of small-scale problems, and then used to solve the scene of large-scale problems.

本发明进一步优选，提出了一个解决多个充电车场景下的路径规划解决方案，如附图2所示，即无线传感网络具有多个充电车时，调度充电车以及规划充电车路径的方案。现有m辆充电车用于解决无线传感网络节点集合S的充电问题，步骤如下：The present invention is further preferred, and proposes a solution for path planning in the scenario of multiple charging vehicles, as shown in FIG. 2 , that is, when the wireless sensor network has multiple charging vehicles, the scheme of scheduling charging vehicles and planning the path of charging vehicles . The existing m charging vehicles are used to solve the charging problem of the wireless sensor network node set S, and the steps are as follows:

步骤(6)中使用的启发式聚类算法具体为：(1)利用启发式聚类算法将节点集合S划分成m个子集，分别交给m个充电车进行充电。The specific heuristic clustering algorithm used in step (6) is as follows: (1) Using the heuristic clustering algorithm to divide the node set S into m subsets, which are respectively handed over to m charging vehicles for charging.

(2)对m任务子集，分别利用训练好的GAPN模型进行处理，得到m个充电车各自的充电路径方案。(2) For the m task subsets, the trained GAPN models are used for processing, and the respective charging path schemes of m charging vehicles are obtained.

步骤6.1：定义m个子集合R＝{S₁,S₂,…,S_m}，用于存放每个充电车将要访问的节点；定义e(S_j)为S_j中所有节点的紧急程度之和；Step 6.1: Define m subsets R={S ₁ , S ₂ ,..., S _m } to store the nodes to be visited by each charging vehicle; define e(S _j ) as the sum of the urgency of all nodes in S _j and;

步骤6.2：对于输入的无线传感网络节点集合S，按照节点的初始紧急程度

从大到小排序；Step 6.2: For the input wireless sensor network node set S, according to the initial urgency of the node

Sort from largest to smallest;

步骤6.3：先将节点集合S中前m个节点分别放入m个子集S₁,S₂,…,S_m中；；Step 6.3: first put the first m nodes in the node set S into m subsets S ₁ , S ₂ ,..., S _m respectively;

步骤6.4：先将集合R按照e(S_j)递增顺序排序，遍历集合S中剩下的节点，对于当前遍历到的节点s_i，如果S_j中不存在节点s_k和节点s_i的开启时间之差R_i,k大于旅行时间Γ_i,k，或者截止时间之差D_i,k小于旅行时间Γ_i,k，则将s_i放入S_j，否则就更换另一个子集，直至条件满足；Step 6.4: First sort the set R in the ascending order of e(S _j ), traverse the remaining nodes in the set S, and for the currently traversed node _si , if there is no node _sk and node _si in S _j to open If the time difference Ri _,k is greater than the travel time Γ _i,k , or the deadline difference _Di,k is smaller than the travel time Γ _i,k , then put _si into S _j , otherwise replace another subset until condition is satisfied;

步骤6.5：得到最终m个子集合R＝{S₁,S₂,…,S_m}，用于m个充电车任务执行。Step 6.5: Obtain the final m subsets R={S ₁ , S ₂ , . . . , S _m }, which are used for the execution of m charging vehicle tasks.

本发明的有益效果为：The beneficial effects of the present invention are:

在无线传感网络的为节点充电的场景中，良好的充电调度方案对于移动充电车进行电量补充是至关重要的；本发明在考虑了节点的耗电速率、充电车的充电速率和路径成本等限制条件，发明了一种基于深度强化学习的决策模型，为移动充电车提供更好的充电路径方案，以延长无线传感网络的生命周期。In the scenario of charging nodes in a wireless sensor network, a good charging scheduling scheme is crucial for the power replenishment of mobile charging vehicles; the present invention considers the power consumption rate of nodes, charging rate of charging vehicles and path cost Due to other constraints, a decision-making model based on deep reinforcement learning was invented to provide a better charging path scheme for mobile charging vehicles to extend the life cycle of wireless sensor networks.

附图说明Description of drawings

图1为本发明提出的深度学习模型解决路径规划问题示意图。FIG. 1 is a schematic diagram of solving a path planning problem by a deep learning model proposed by the present invention.

图2为本发明提出的解决多个充电车路径规划问题方案示意图。FIG. 2 is a schematic diagram of a solution to the problem of path planning for multiple charging vehicles proposed by the present invention.

图3为本发明应用于实际系统的框架结构示意图。FIG. 3 is a schematic diagram of a frame structure of the present invention applied to an actual system.

图4为本发明实际应用到的移动智能体充电车。FIG. 4 is a mobile intelligent body charging vehicle to which the present invention is actually applied.

图5、为图4中设计的移动智能体充电车线条图；Figure 5 is a line drawing of the mobile intelligent charging vehicle designed in Figure 4;

图6、为本发明的步骤流程图。FIG. 6 is a flow chart of the steps of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式，进一步阐明本发明，应理解下述具体实施方式仅用于说明本发明而不用于限制本发明的范围。需要说明的是，下面描述中使用的词语“前”、“后”、“左”、“右”、“上”和“下”指的是附图中的方向，词语“内”和“外”分别指的是朝向或远离特定部件几何中心的方向。The present invention will be further clarified below with reference to the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. It should be noted that the words "front", "rear", "left", "right", "upper" and "lower" used in the following description refer to the directions in the drawings, and the words "inner" and "outer" ” refer to directions towards or away from the geometric center of a particular part, respectively.

如图1和6所示，本实施例提出深度学习模型解决路径规划问题示意图，输入数据包含有所有节点的数据信息，通过模型的编码和解码过程，返回最终的路径规划方案，具体包括以下步骤：As shown in Figures 1 and 6, this embodiment proposes a schematic diagram of a deep learning model to solve the path planning problem. The input data contains the data information of all nodes, and the final path planning scheme is returned through the encoding and decoding process of the model, which specifically includes the following steps :

(1)将无线传感网络中的节点集合定义为S＝{s₀,s₁,s₂,…,s_i}，其中s₀表示的是基站，也就是充电车出发的起点和最终返回的终点。除了基站外的每个节点s_i都由一个四维向量表示，s_i＝{c_i,[r_i,d_i],T_i(t),e_i(t)}，其中c_i＝[x_i,y_i]表示节点s_i的二维坐标，[r_i,d_i]表示的是节点s_i的时间窗，r_i表示开启时间，d_i表示截止时间，节点s_i只能在该时间窗内被充电，e_i(t)和T_i(t)分别表示在t时刻节点s_i的紧急状况和所需要充电的时间，e_i(t)越大，代表s_i的电量越低，需要充电的时间越长。(1) Define the node set in the wireless sensor network as S={s ₀ , s ₁ , s ₂ ,..., s _i }, where s ₀ represents the base station, that is, the starting point and the final return of the charging car the end point. Each node s _i except the base station is represented by a four-dimensional vector, s _i ={ _{ci ,[r i} _, d _i ],T _i (t),e _i (t)}, where _ci =[x _i , y _i ] represents the two-dimensional coordinates of node s _i , [ri , d _i ] represents the time window of node s _i , _ri represents the opening time, d _i represents the deadline _{, and the node s i} _can only be is charged within the time window, e _i (t) and T _i (t) represent the emergency situation of node _si at time t and the required charging time, respectively, the larger e _i (t), the lower the power of _si , the longer it takes to charge.

(2)充电车以v的恒定速度在网络中移动，从节点s_i移动到s_j的旅行时间

其中L_i,j代表节点s_i和s_j之间的距离。(2) The charging vehicle moves in the network at a constant speed of v, and the travel time from node _si to _sj

where _Li,j represents the distance between nodes s _i and s _j .

(3)节点s_i的紧急状况e_i(t)随时间t变化，变化的规律用数学公式表示为

其中

表示节点s_i的初始紧急状态，f(t)是一个递增项，e_i(t)存在上限值e_max，始终保持e_i(t)≤e_max。节点s_i的时间窗[r_i,d_i]取决于节点的初始紧急状态耗电速率，那么s_i的截止时间

(3) The emergency situation e _i (t) of node s _i changes with time t, and the law of change is expressed by mathematical formula as

in

Represents the initial emergency state of node s _i , f(t) is an increasing term, e _i (t) has an upper limit e _max , and always keeps e _i (t)≤e _max . The time window [r _i ,d _i ] of node _si depends on the initial emergency state power consumption rate of the node, then the deadline for _si

(4)对于充电车来说，如果在节点s_i的时间窗之外访问的话，将会受到一个惩罚，用数学公式表示为：(4) For the charging car, if it visits outside the time window of node _si , it will be punished by a penalty, which is expressed as:

其中τ_i代表充电车到达节点s_i的时间，σ是一个常数，取决于问题的规模。where τ _i represents the time for the charging car to reach node _si and σ is a constant that depends on the scale of the problem.

(5)对于模型给出的一个路径规划的方案π，评价该方案的函数为：

其中β是一个负常数，其大小取决于路径成本和节点死亡成本相对大小。(5) For a path planning scheme π given by the model, the function to evaluate the scheme is:

where β is a negative constant whose size depends on the relative size of path cost and node death cost.

(6)搭建深度强化学习模型GAPN，GAPN使用的是编码器-解码器的框架结构，编码器将所有节点s_i进行编码，将s_i＝{c_i,[r_i,d_i],T_i(t),e_i(t)}分成静态g_i＝{c_i,[r_i,d_i]}和动态δ_i(t)＝{T_i(t),e_i(t)}两个部分分别进行编码操作，得到编码后的向量

(6) Build a deep reinforcement learning model GAPN. GAPN uses an encoder-decoder framework. The encoder encodes all nodes _si , and sets _si ={ _{ci ,[r i} _, d _i ],T _i (t), e _i (t)} is divided into static g _i ={ _{ci ,[r i} _, d _i ]} and dynamic δ _i (t)={T _i (t),e _i (t)} two Each part is encoded separately, and the encoded vector is obtained.

其中relu是一个非线性的激活函数，W₁,W₂和b都是可训练的参数，在训练中可以被学习到。where relu is a nonlinear activation function, W ₁ , W ₂ and b are all trainable parameters that can be learned during training.

(8)GAPN中的图神经网络对无线传感网络所有节点集合S做编码，编码网络一共有N层，每一层的表达式为：(8) The graph neural network in GAPN encodes all node sets S of the wireless sensor network. The encoding network has a total of N layers, and the expression of each layer is:

(9)解码器部分依据注意力机制对编码器生成的隐藏层向量

进行权重计算，计算结果权重最高的节点，则为下一个将要访问的点，权重计算公式如下：(9) The decoder part uses the attention mechanism to generate the hidden layer vector of the encoder

Perform weight calculation, and the node with the highest weight of the calculation result is the next point to be visited. The weight calculation formula is as follows:

其中o_i表示解码所依据的语境编码信息，包含图神经网络所编码的信息，q_i表示节点s_i的编码向量。Among them, o _i represents the context encoding information based on decoding, including the information encoded by the graph neural network, and q _i represents the encoding vector of node _si .

(10)得到完整的神经网络模型后，使用强化学习中的actor-critic方法对模型进行训练，得到最终的GAPN模型。(10) After obtaining the complete neural network model, use the actor-critic method in reinforcement learning to train the model to obtain the final GAPN model.

(11)将待解决的问题节点集合，以四维特征的形式输入到GAPN模型中，模型返回最终访问所有节点的顺序，即充电车的路径规划方案。(11) Input the set of problem nodes to be solved into the GAPN model in the form of four-dimensional features, and the model returns the order of finally visiting all nodes, that is, the path planning scheme of the charging vehicle.

图3展示了部署模型系统至智能体(移动无人车)集群时的系统结构图，当当中心基站将收集到的节点信息进处理，并且通过GAPN模块进行处理，Figure 3 shows the system structure diagram when deploying the model system to an agent (mobile unmanned vehicle) cluster. When the central base station processes the collected node information and processes it through the GAPN module,

返回相应的任务指令，发送给移动智能体，移动智能体利用控制指令控制其运动控制器，执行相应的任务。具体地，每个智能体如图4和5所示，该设备上运行有智能体代理程序。智能体代理程序首先负责与智能体上的运动控制器进行通信，向其发送控制指令、收集及初步处理智能体感知到的数据，同时智能体代理会进行任务决策，通过基站中心发送的任务指令，进行相应的执行操作。The corresponding task instruction is returned and sent to the mobile agent, and the mobile agent uses the control instruction to control its motion controller to execute the corresponding task. Specifically, each agent is shown in Figures 4 and 5, and an agent agent program runs on the device. The agent agent program is first responsible for communicating with the motion controller on the agent, sending control commands to it, collecting and preliminarily processing the data sensed by the agent, and at the same time, the agent agent will make task decisions and pass the task instructions sent by the base station center. , and perform the corresponding operations.

本发明方案所公开的技术手段不仅限于上述实施方式所公开的技术手段，还包括由以上技术特征任意组合所组成的技术方案。The technical means disclosed in the solution of the present invention are not limited to the technical means disclosed in the above embodiments, but also include technical solutions composed of any combination of the above technical features.

Claims

1. The unmanned charging vehicle path planning method based on deep reinforcement learning is characterized by comprising the following steps:

step (1) collecting nodes S ═ S in the wireless sensor network₀,s₁,s₂,…,s_nInformation, including location information and emergency status of the node;

step (2) building a model GAPN based on deep reinforcement learning, wherein the model GAPN comprises a point encoder, a graph encoder, a decoder and an attention mechanism module;

step (3) generating a training data set and a testing data set for training a GAPN model on the basis of the collected node information data;

setting a model training target to minimize node mortality and minimize charging vehicle path cost, defining a loss function, and training GAPN by using a reinforcement learning operator-critic algorithm to obtain a final model;

and (5) setting the node S in the wireless sensor network to be S { (S)₀,s₁,s₂,…,s_nInputting the information into the GAPN model, and returning the scheme pi ═ s of all the nodes accessed by the charging vehicle by the model₀,s′₁,s′₂,…,s′_nH, where pi is the reordering of S;

and (6) solving the path planning problem under the scene of a plurality of charging vehicles by combining a heuristic clustering method with the GAPN, namely scheduling the charging vehicles and planning the paths of the charging vehicles when the wireless sensing network has a plurality of charging vehicles.

2. The unmanned charging vehicle path planning method based on deep reinforcement learning of claim 1, characterized in that: the operator-critic algorithm used in the step (4) is specifically as follows:

step 4.1: constructing an actor network and a critic network based on the GAPN network structure, wherein theta represents actor network parameters and theta_vRepresenting critic network parameters, wherein the difference between the two networks is that an operator network uses an average sampling mode in a decoding process, and the critic network adopts a greedy sampling strategy;

step 4.2: a batch of training data sets X, with a total of B data, for each data set instance X_iAnd solving the final path cost L (pi) by applying the current actor network_i|x_i) Solving for its baseline expected value using critic's network

Step 4.3: performing two-network batch strategy gradient calculation by using the value obtained in the step 4.2, wherein the operator network strategy gradient g_θAnd (3) calculating:

critic network policy gradient

Step 4.4: and respectively updating the parameters of the two networks by using a gradient descent method, wherein the updating rules are respectively as follows: θ + lr g_θ，

Wherein lr is the learning rate;

step 4.5: and repeating the step 4.2 to the step 4.4T times, wherein T is a set training algebra before training.

3. The unmanned charging vehicle path planning method based on deep reinforcement learning of claim 1, characterized in that: the heuristic clustering algorithm used in the step (6) is specifically as follows:

step 6.1: defining m subsets R ═ S₁,S₂,…,S_mThe node is used for storing nodes to be visited by each charging vehicle; definition e (S)_j) Is S_jThe sum of the urgency levels of all nodes in the cluster;

step 6.2: wireless sensor network node for inputSet of points S, according to the initial urgency of the node

Sorting from big to small;

step 6.3: firstly, respectively putting the first m nodes in a node set S into m subsets S₁,S₂,…,S_mPerforming the following steps;

step 6.4: firstly, the set R is according to e (S)_j) Sequencing in an increasing order, traversing the rest nodes in the set S, and for the currently traversed node S_iIf S is_jIn which there is no node s_kAnd node s_iDifference R of opening time of_i,kGreater than travel time t_i,kOr difference D in cutoff time_i,kLess than travel time t_i,kThen s will_iPut in S_jOtherwise, another subset is replaced until the condition is met;

step 6.5: obtaining the final m subsets R ═ S₁,S₂,…,S_mAnd the method is used for executing m charging vehicle tasks.