CN116628448B

CN116628448B - Sensor management method based on deep reinforcement learning in extended goals

Info

Publication number: CN116628448B
Application number: CN202310609986.1A
Authority: CN
Inventors: 陈辉; 张虹芸; 张文旭; 张新迪; 田博; 罗欣; 缪嘉伟
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-11-28
Anticipated expiration: 2043-05-26
Also published as: CN116628448A

Abstract

The application discloses a sensor management method based on deep reinforcement learning in an extended target, which comprises the following steps: modeling is conducted on an ellipse expansion target, and a virtual interaction environment with deep reinforcement learning is constructed according to an expansion target filtering algorithm; establishing a TD3 algorithm intelligent agent; the virtual interaction environment is interacted with the TD3 algorithm intelligent agent to obtain sensor control data, and the sensor control data is stored as a sample in an experience playback pool; based on the experience playback pool, extracting samples, training an intelligent agent of a TD3 algorithm, and deciding a sensor path planning optimal action through the trained intelligent agent; and (3) acting the optimal action on the sensor, and obtaining the sensor position after the sensor is subjected to state transfer, so as to obtain the sensor measurement value of the expansion target at the current moment, and carrying out filtering prediction and updating to carry out tracking estimation of the expansion target. The application optimizes the tracking effect of the ellipse expanding target as a whole.

Description

Sensor management method based on deep reinforcement learning in extended goals

技术领域Technical field

本发明涉及传感器智能管理技术领域，尤其涉及扩展目标中基于深度强化学习的传感器管理方法。The present invention relates to the technical field of sensor intelligent management, and in particular to a sensor management method based on deep reinforcement learning in extended targets.

背景技术Background technique

传感器管理是指通过控制传感器系统中的自由度以满足某种约束条件并优化某种性能度量，最终实现目标的过程。随着现代高分辨率传感器的出现，目标可以由多个量测源在某一时刻产生多个量测，因此可以估计目标更多的状态特征，如目标的形状轮廓，此类问题被称为扩展目标跟踪问题。通常传感器管理以优化目标跟踪性能为目的来获取最佳量测信息的方法有两大类，一类是基于任务论的传感器管理方法，此类方法通过具体的任务需求，例如状态的方差或关于目标状态分布的测度来制定相应的传感器控制策略。但此类方法难以满足多个任务需求同时存在的系统的要求。另外一类是基于信息论的传感器管理方法，通过两个概率密度函数之间的信息增益的某种度量来建立评价函数(如Kullback-Leibler散度、Rényi散度等)，进而在信息增益最大化的准则下求解传感器控制策略。基于信息论的传感器控制方法可以使包含多个任务的系统整体信息增益最大化。Sensor management refers to the process of ultimately achieving goals by controlling the degrees of freedom in a sensor system to meet certain constraints and optimize certain performance measures. With the emergence of modern high-resolution sensors, the target can produce multiple measurements from multiple measurement sources at a certain time, so more state characteristics of the target can be estimated, such as the shape profile of the target. This type of problem is called Extended target tracking issues. Generally, there are two main categories of sensor management methods to obtain the best measurement information for the purpose of optimizing target tracking performance. One is the sensor management method based on task theory. This method uses specific task requirements, such as the variance of the state or the related information. The measurement of the target state distribution is used to formulate the corresponding sensor control strategy. However, such methods are difficult to meet the requirements of systems where multiple task requirements exist simultaneously. The other type is a sensor management method based on information theory, which establishes an evaluation function (such as Kullback-Leibler divergence, Rényi divergence, etc.) through some measure of the information gain between two probability density functions, and then maximizes the information gain. Solve the sensor control strategy under the criterion. Sensor control methods based on information theory can maximize the overall information gain of a system containing multiple tasks.

传统目标跟踪中的传感器管理通常是在部分可观测马尔可夫决策过程的理论框架下进行研究的，传感器管理通常是在离散动作空间中进行的，因为在每次决策时都需要在建立的相应评价准则下，评价可实现的传感器控制方案中的所有动作。所以传统的方法无法处理动作空间急剧增加时造成的维度爆炸和存在计算复杂性的问题。Sensor management in traditional target tracking is usually studied under the theoretical framework of a partially observable Markov decision process. Sensor management is usually performed in a discrete action space, because each decision needs to be based on the established response. Under the evaluation criteria, evaluate all actions in the achievable sensor control scheme. Therefore, traditional methods cannot handle the dimension explosion and computational complexity problems caused by the sharp increase in the action space.

而近几年，深度强化学习是人工智能领域新的研究热点。深度强化学习与扩展目标跟踪问题的交叉融合，为传感器控制智能决策的实现提供了新的途径。深度Q网络(DeepQ-Network，DQN)算法，是深度强化学习领域的开创性工作。在DQN的基础上，又先后提出了Double DQN(DDQN)、Dueling DQN和Double Dueling DQN(D3QN)等一系列改进算法。尽管DQN及其改进算法都有不错的应用效果，但仍无法处理连续动作空间的问题。所以在深度强化学习领域出现了经典的连续控制算法，深度确定性策略梯度(DDPG)算法，是应用于复杂、连续控制的重要算法。但是DDPG算法存在Critic网络对Q值过估计的问题。In recent years, deep reinforcement learning has become a new research hotspot in the field of artificial intelligence. The cross-fusion of deep reinforcement learning and extended target tracking problems provides a new way to realize intelligent decision-making in sensor control. The Deep Q-Network (DQN) algorithm is a pioneering work in the field of deep reinforcement learning. On the basis of DQN, a series of improved algorithms such as Double DQN (DDQN), Dueling DQN and Double Dueling DQN (D3QN) have been proposed. Although DQN and its improved algorithms have good application effects, they still cannot handle the problem of continuous action space. Therefore, a classic continuous control algorithm has emerged in the field of deep reinforcement learning, the deep deterministic policy gradient (DDPG) algorithm, which is an important algorithm applied to complex and continuous control. However, the DDPG algorithm has the problem that the Critic network overestimates the Q value.

现有的目标跟踪领域的传感器管理方法中，选择传感器管理的任务为控制传感器平台的位置以优化目标跟踪的性能时，由于基于任务论和信息论这两类传感器管理方法，都需要在建立的特定任务优化或者一定的优化准则下进行研究，所以主要是基于离散的传感器动作空间进行控制决策，并且在每一次决策时都需要遍历整个动作空间，当需要考虑自由度空间中的所有待决策动作时，传统的传感器控制方法会面临维度爆炸而引起的效率急剧下降的问题，并且当需要决策的自由度维度更高时，传统的传感器控制方法将会束手无策。Among the existing sensor management methods in the field of target tracking, when the task of sensor management is to control the position of the sensor platform to optimize the performance of target tracking, due to the two types of sensor management methods based on task theory and information theory, both need to establish specific Research is conducted under task optimization or certain optimization criteria, so control decisions are mainly based on discrete sensor action spaces, and the entire action space needs to be traversed in each decision-making process. When all actions to be decided in the degree of freedom space need to be considered, , traditional sensor control methods will face the problem of sharp decline in efficiency caused by dimensionality explosion, and when higher degrees of freedom in decision-making are required, traditional sensor control methods will be helpless.

发明内容Contents of the invention

为解决上述技术问题，本发明提出了扩展目标中基于深度强化学习的传感器管理方法，将传统传感器管理决策空间由离散动作空间拓展到连续动作空间上，依据椭圆扩展目标跟踪估计效果建立联合优化扩展目标运动学状态和扩展状态(轮廓信息)为目标的科学的奖励回报机制，基于深度强化学习算法建立智能体学习最优控制策略，以人工智能的方式实现传感器控制的智能决策。In order to solve the above technical problems, the present invention proposes a sensor management method based on deep reinforcement learning in extended targets, which expands the traditional sensor management decision space from discrete action space to continuous action space, and establishes a joint optimization extension based on the elliptical extended target tracking estimation effect. The target kinematic state and extended state (contour information) are the scientific reward and return mechanism of the target. Based on the deep reinforcement learning algorithm, the intelligent agent learning optimal control strategy is established to realize intelligent decision-making of sensor control through artificial intelligence.

为实现上述目的，本发明提供了扩展目标中基于深度强化学习的传感器管理方法，包括：In order to achieve the above objectives, the present invention provides a sensor management method based on deep reinforcement learning among extended goals, including:

针对椭圆扩展目标进行建模，并根据扩展目标滤波算法构建与深度强化学习的虚拟交互环境；Model the elliptical extended target and build a virtual interactive environment with deep reinforcement learning based on the extended target filtering algorithm;

建立TD3算法智能体；Establish TD3 algorithm intelligent agent;

将所述虚拟交互环境与所述TD3算法智能体进行交互，获取传感器控制数据，并将所述传感器控制数据作为样本存放至经验回放池；基于所述经验回放池抽取样本，训练所述TD3算法智能体，通过训练后的智能体决策出传感器路径规划最优动作；The virtual interactive environment interacts with the TD3 algorithm agent to obtain sensor control data, and the sensor control data is stored as a sample in an experience playback pool; samples are extracted based on the experience playback pool to train the TD3 algorithm The agent uses the trained agent to decide the optimal action for sensor path planning;

将所述最优动作作用于传感器，所述传感器发生状态转移后获得传感器位置，由此获取当前时刻扩展目标传感器量测值，并进行滤波的预测和更新，进行扩展目标的跟踪估计。The optimal action is applied to the sensor, and the sensor position is obtained after state transition of the sensor, thereby obtaining the extended target sensor measurement value at the current moment, and filtering prediction and update are performed to perform tracking and estimation of the extended target.

优选地，针对所述椭圆扩展目标进行建模，包括：Preferably, modeling for the elliptical expansion target includes:

设定k时刻扩展目标跟踪的状态为：ζ_k＝(x_k,X_k)，其中，x_k表示目标的运动学状态，X_k表示目标的扩展状态；Set the state of extended target tracking at time k as: ζ _k = (x _k ,X _k ), where x _k represents the kinematic state of the target, and X _k represents the expanded state of the target;

进行建模的方法为：The approach to modeling is:

其中，w_k为零均值高斯过程噪声，v_k为零均值高斯量测噪声，x_s,k(π)为当前时刻传感器位置，为系统状态演化映射，/>为量测映射，x_k+1表示k+1时刻目标运动学状态，/>表示k时刻的多个量测值。Among them, w _k is the zero-mean Gaussian process noise, v _k is the zero-mean Gaussian measurement noise, x _s,k (π) is the sensor position at the current moment, Mapping system state evolution,/> For measurement mapping, x _k+1 represents the kinematic state of the target at k+1 moment,/> Represents multiple measurement values at time k.

优选地，所述k时刻扩展目标的扩展状态被建模为椭圆形状，用正定对称矩阵X_k描述为：Preferably, the expansion state of the expansion target at time k is modeled as an elliptical shape, which is described by a positive definite symmetric matrix X _k as:

其中，θ_k为椭圆形状方向角，σ_k,1和σ_k,2分别为椭圆形状的长轴和短轴。Among them, θ _k is the direction angle of the elliptical shape, σ _k,1 and σ _k,2 are the major axis and minor axis of the elliptical shape respectively.

优选地，根据所述扩展目标滤波算法构建与深度强化学习的虚拟交互环境，包括：Preferably, a virtual interactive environment with deep reinforcement learning is constructed according to the extended target filtering algorithm, including:

基于神经网络拟合所述深度强化学习中的价值函数和策略函数，采用深度强化学习算法通过探索与利用机制进行传感器控制，建立智能传感器控制系统，通过所述智能传感器控制系统构建所述虚拟交互环境；所述扩展目标滤波算法包括预测过程和更新过程。Based on the neural network to fit the value function and policy function in the deep reinforcement learning, the deep reinforcement learning algorithm is used to control the sensor through the exploration and utilization mechanism, an intelligent sensor control system is established, and the virtual interaction is constructed through the intelligent sensor control system Environment; the extended target filtering algorithm includes a prediction process and an update process.

优选地，所述预测过程为：Preferably, the prediction process is:

优选地，所述TD3算法智能体，包括：Preferably, the TD3 algorithm agent includes:

Actor网络：用于根据状态选择动作；Actor network: used to select actions based on status;

目标Actor网络：用于根据所述Actor网络获取的结果再次根据状态选择动作；Target Actor network: used to select actions based on the status again based on the results obtained by the Actor network;

Critic网络：用于对所述Actor网络选择的动作进行评价；Critic network: used to evaluate the actions selected by the Actor network;

目标Critic网络：用于根据所述Critic网络获取的结果再次对所述Actor网络选择的动作进行评价。Target Critic network: used to evaluate the action selected by the Actor network again based on the results obtained by the Critic network.

优选地，获取所述传感器控制数据，包括：Preferably, obtaining the sensor control data includes:

所述智能体在任一时刻，采取动作并作用于所述虚拟交互环境后，所述传感器从k时刻状态x_s,k转移到k+1时刻状态x_s,k+1，通过奖励函数进行评价获得奖励值R_k+1，然后所述智能体根据所述奖励值不断改进策略，最终学习到最优策略决策传感器每一时刻的动作。After the agent takes action and acts on the virtual interactive environment at any time, the sensor moves from the state x _s, k at time k to the state x _{s,k+1 at time k+1} , and is evaluated through the reward function The reward value R _k+1 is obtained, and then the agent continuously improves the strategy based on the reward value, and finally learns the optimal strategy to determine the actions of the sensor at each moment.

优选地，构建所述奖励函数的方法包括：Preferably, the method of constructing the reward function includes:

定义扩展目标k时刻先验概率分布和后验概率分布均服从多元高斯分布，获取先验概率分布和后验概率分布之间的高斯瓦瑟斯坦距离，基于所述高斯瓦瑟斯坦距离构建所述奖励函数。Define the extended target k time prior probability distribution and posterior probability distribution to obey the multivariate Gaussian distribution, obtain the Gauss Wasserstein distance between the prior probability distribution and the posterior probability distribution, and construct the above based on the Gauss Wasserstein distance reward function.

优选地，所述奖励函数为：Preferably, the reward function is:

其中，a_k,0表示传感器在当前时刻处于静止状态。Among them, a _k,0 indicates that the sensor is in a stationary state at the current moment.

与现有技术相比，本发明具有如下优点和技术效果：Compared with the existing technology, the present invention has the following advantages and technical effects:

本发明方法运用随机矩阵对椭圆扩展目标的扩展状态建模，可以对目标运动状态和扩展状态进行有效估计，然后采用类似于信息论的传感器管理方法中评价函数的设定构建应用于深度强化学习TD3算法中的奖励函数，此奖励函数综合考虑了对目标运动状态和轮廓信息(扩展状态)的联合优化，使用TD3算法对传感器在连续动作空间下进行有效控制后，与无传感器控制相比不仅可以对目标质心位置估计上更加准确，同时对目标轮廓信息的估计上也更加准确，所以在整体上优化了椭圆扩展目标的跟踪效果。The method of the present invention uses a random matrix to model the expanded state of the elliptical expanded target, which can effectively estimate the target motion state and expanded state, and then uses the setting of the evaluation function in a sensor management method similar to information theory to construct and apply it to deep reinforcement learning TD3 The reward function in the algorithm, this reward function comprehensively considers the joint optimization of the target motion state and contour information (extended state). After using the TD3 algorithm to effectively control the sensor in the continuous action space, compared with sensorless control, it can not only It is more accurate in estimating the position of the target center of mass, and at the same time, it is also more accurate in estimating the target contour information, so the tracking effect of the elliptical extended target is optimized as a whole.

附图说明Description of the drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings that form a part of this application are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an improper limitation of this application. In the attached picture:

图1为本发明实施例中传感器控制轨迹示意图；Figure 1 is a schematic diagram of the sensor control trajectory in the embodiment of the present invention;

图2为本发明实施例的椭圆扩展目标半长轴及半短轴误差曲线图；Figure 2 is an error curve diagram of the semi-major axis and semi-minor axis of the elliptical expansion target according to the embodiment of the present invention;

图3为本发明实施例的质心估计误差曲线图；Figure 3 is a centroid estimation error graph according to the embodiment of the present invention;

图4为本发明实施例的长安器控制扩展目标GW距离示意图；Figure 4 is a schematic diagram of the extended target GW distance controlled by the Changan device according to the embodiment of the present invention;

图5为本发明实施例的扩展目标中基于深度强化学习的传感器管理方法流程图；Figure 5 is a flow chart of a sensor management method based on deep reinforcement learning in the expanded target of the embodiment of the present invention;

图6为本发明实施例的TD3算法智能体中各网络连接关系示意图。Figure 6 is a schematic diagram of the network connection relationships in the TD3 algorithm agent according to the embodiment of the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although a logical sequence is shown in the flowchart, in some cases, The steps shown or described may be performed in a different order than here.

本发明提出了扩展目标中基于深度强化学习的传感器管理方法，如图5，包括：The present invention proposes a sensor management method based on deep reinforcement learning in extended goals, as shown in Figure 5, including:

(1)扩展目标跟踪问题描述：(1) Extended target tracking problem description:

对扩展目标进行跟踪时，除开对目标质心运动状态的跟踪还同时跟踪目标的扩展状态即目标的形状随时间的演化.设定k时刻扩展目标跟踪的状态表示为：ζ_k＝(x_k,X_k)，其中，x_k表示目标的运动学状态，服从多元高斯分布。X_k表示目标的扩展状态，服从逆Wishart分布.对系统进行建模为：When tracking an extended target, in addition to tracking the motion state of the target center of mass, it also tracks the extended state of the target, that is, the evolution of the target's shape over time. The state of extended target tracking at time k is expressed as: ζ _k = (x _k , X _k ), where x _k represents the kinematic state of the target and obeys the multivariate Gaussian distribution. X _k represents the extended state of the target and obeys the inverse Wishart distribution. The system is modeled as:

其中，w_k为零均值高斯过程噪声，v_k为零均值高斯量测噪声。x_s,k(π)为当前时刻传感器位置，是系统状态演化映射，/>是量测映射。Among them, w _k is zero-mean Gaussian process noise, and v _k is zero-mean Gaussian measurement noise. x _s,k (π) is the sensor position at the current moment, is the system state evolution map,/> is a measurement mapping.

k时刻扩展目标的扩展状态被建模为椭圆形状，用正定对称矩阵X_k描述为：The extended state of the extended target at time k is modeled as an elliptical shape and is described by a positive definite symmetric matrix X _k as:

(2)扩展目标滤波算法：(2) Extended target filtering algorithm:

扩展目标滤波算法在贝叶斯滤波算法框架下实现，由预测过程和更新过程组成。其中每个过程又分为对运动状态和扩展状态的预测和更新：The extended target filtering algorithm is implemented under the framework of Bayesian filtering algorithm and consists of a prediction process and an update process. Each process is divided into prediction and update of motion state and extended state:

1)预测过程：1) Forecasting process:

对运动状态的一步预测，由于服从多元高斯分布，其均值和协方差矩阵如下：One-step prediction of motion status, since it obeys the multivariate Gaussian distribution, its mean and covariance matrix are as follows:

扩展状态的一步预测：One-step prediction of extended state:

v_k|k-1＝e^-T/τv_k-1|k-1 (6)v _k|k-1 ＝e ^-T/τ v _k-1|k-1 (6)

其中，v_k|k-1和V_k|k-1分别为k时刻根据k-1时刻后验预测得到的逆Wishart分布中的自由度和逆尺度矩阵，T表示采样时间，τ是时间衰减常数，d表示目标扩展状态的维度，v_k-1|k-1和V_k-1|k-1表示k-1时刻的后验即迭代更新后得到的k-1时刻的自由度和逆尺度矩阵，E[X_k|k-1]表示X_k|k-1的数学期望。Among them, v _k|k-1 and V _k|k-1 are respectively the degrees of freedom and the inverse scale matrix in the inverse Wishart distribution obtained at time k based on the posterior prediction at time k-1, T represents the sampling time, and τ is the time attenuation Constant, d represents the dimension of the target extended state, v _k-1|k-1 and V _k-1|k-1 represent the posterior of k-1 moment, that is, the degree of freedom and inverse of k-1 moment obtained after iterative update. Scale matrix, E[X _k|k-1 ] represents the mathematical expectation of X _k|k-1 .

2)更新过程：2)Update process:

运动状态的更新：Movement status updates:

其中，表示质心量测，/>表示对应的散射矩阵。W_k|k-1表示系统增益矩阵，ε_k为系统量测的新息部分，S_k|k-1表示新息部分的协方差矩阵。in, Represents the center of mass measurement,/> represents the corresponding scattering matrix. W _k|k-1 represents the system gain matrix, ε _k is the innovation part of the system measurement, and S _k|k-1 represents the covariance matrix of the innovation part.

扩展状态的更新：Update on extension status:

v_k|k＝v_k|k-1+n_k (16)v _k|k ＝v _k|k-1 +n _k (16)

其中，n_k为量测数。Among them, n _k is the measurement number.

假定扩展目标作匀速直线运动，根据公式(1)建立系统方程，同时由公式(2)-(3)建模扩展目标形状为椭圆。由公式(4)-(19)扩展目标滤波算法设定用于与强化学习智能体交互的环境。对于交互环境，即输入k-1时刻的扩展目标运动状态的后验估计值跟协方差矩阵x_k-1|k-1,P_k-1|k-1和扩展状态的后验估计值v_k-1|k-1,V_k-1|k-1，以及传感器实时位置，通过公式(4)-(19)滤波算法得到k时刻的后验值x_k|k,P_k|k,v_k|k,V_k|k。It is assumed that the extended target moves in a straight line at a uniform speed, and the system equation is established according to formula (1). At the same time, the shape of the extended target is modeled as an ellipse by formulas (2)-(3). The environment for interacting with the reinforcement learning agent is set by the extended objective filtering algorithm of Equations (4)-(19). For the interactive environment, that is, input the posterior estimate of the extended target motion state at time k-1 and the covariance matrix x _k-1|k-1 , P _k-1|k-1 and the posterior estimate v of the extended state _k-1|k-1 ,V _k-1|k-1 , and the real-time position of the sensor, the posterior value x _k|k ,P _k|k at time k is obtained through the filtering algorithm of formula (4)-(19), v _k|k ,V _k|k .

扩展目标跟踪问题已经在雷达、计算机视觉等不同领域进行了研究，其性能依赖于观测者(测量传感器)与运动目标的相对几何形状，所以选择传感器管理的任务为传感器轨迹规划。按照图1所示的框架，首先针对椭圆扩展目标进行建模并根据扩展目标跟踪算法构建与深度强化学习的虚拟交互环境，用神经网络拟合强化学习中的价值函数和策略函数，采用深度强化学习算法通过探索与利用等机制来进行传感器控制，构建智能传感器控制系统。The extended target tracking problem has been studied in different fields such as radar and computer vision. Its performance depends on the relative geometry of the observer (measurement sensor) and the moving target, so the task of sensor management is selected as sensor trajectory planning. According to the framework shown in Figure 1, first the elliptical extended target is modeled and a virtual interactive environment with deep reinforcement learning is constructed based on the extended target tracking algorithm. The neural network is used to fit the value function and strategy function in reinforcement learning, and deep reinforcement is used. The learning algorithm performs sensor control through mechanisms such as exploration and utilization to build an intelligent sensor control system.

根据TD3算法构建其网络结构，其中一共包含6个网络(如图6)：分别为Actor网络，目标Actor网络，两个Critic网络以及目标Critic网络，其网络训练算法由公式(20)-(27)给出。至此已经搭建好两大主体即环境和强化学习智能体，之后是对智能体进行训练，最后得到最优策略然后根据传感器量测进行目标跟踪的过程。目标Actor网络的作用与Actor网络作用相同，同样的，目标Critic网络的作用与Critic网络作用相同。在进行网络参数更新时，会涉及计算下一状态所采取的动作以及下一状态采取动作后的状态动作价值，设置目标网络的目的是为了抑制再次使用原本的网络即“自举”造成的Q值过高的问题。另外设置两个Critic网络和目标Critic网络是因为在更新网络时，取最大化的操作同样会导致Q值过估计的问题，通过两个不同的Critic网络选择较小值来更新可以有效抑制。The network structure is constructed according to the TD3 algorithm, which contains a total of 6 networks (as shown in Figure 6): Actor network, target Actor network, two Critic networks and the target Critic network. The network training algorithm is represented by formulas (20)-(27) ) is given. So far, the two main subjects, namely the environment and the reinforcement learning agent, have been built. The next step is to train the agent, and finally obtain the optimal strategy and then perform target tracking based on sensor measurements. The target Actor network has the same function as the Actor network, and similarly, the target Critic network has the same function as the Critic network. When updating network parameters, it will involve calculating the action taken in the next state and the value of the state action after the action is taken in the next state. The purpose of setting the target network is to suppress the Q caused by reusing the original network, that is, "bootstrapping" The problem of too high value. In addition, two Critic networks and a target Critic network are set up because when updating the network, the operation of maximizing will also lead to the problem of overestimation of the Q value, which can be effectively suppressed by selecting a smaller value for update through two different Critic networks.

DDPG算法是处理复杂、连续控制问题的重要深度强化学习算法，但是DDPG算法中会存在Critic网络中对Q值过估计的问题，所以针对该问题，双延迟深度确定性策略梯度(TD3)算法对DDPG算法通过3个部分进行了优化，有效的抑制了Q值过高的问题。所以为了提升深度强化学习算法对扩展目标跟踪的性能优化程度，本发明基于TD3算法在采用连续任务的环境中进行传感器控制。The DDPG algorithm is an important deep reinforcement learning algorithm for dealing with complex, continuous control problems. However, the DDPG algorithm will have the problem of overestimating the Q value in the Critic network. Therefore, to address this problem, the Double Delay Deep Deterministic Policy Gradient (TD3) algorithm is used. The DDPG algorithm is optimized through three parts, effectively suppressing the problem of excessive Q value. Therefore, in order to improve the performance optimization of the deep reinforcement learning algorithm for extended target tracking, the present invention performs sensor control in an environment using continuous tasks based on the TD3 algorithm.

(3)TD3算法(3)TD3 algorithm

强化学习问题包含两个主体：智能体与环境。在扩展目标基于深度强化学习的传感器控制中，交互的环境为椭圆扩展目标滤波算法，训练强化学习智能体进行传感器智能控制。通常可以使用马尔可夫决策过程(MDP)对强化学习问题进行建模，表示为一个五元组为有限状态集合，即智能体在环境中探索到的所有可能的状态，用s表示当前时刻的状态，s′表示下一时刻状态，则具体状态为坐标系中传感器的位置。/>为有限动作集合，即智能体根据当前状态所能采取的所有可能的动作的集合，用a表示当前采取的动作，此时为传感器路径规划中，固定传感器速度后，传感器可选择移动的方向角。P为状态转移函数，即传感器从当前状态s转移到下一时刻状态s′时的概率。/>为奖励函数，表示传感器根据当前时刻位置状态采取动作后所能获得的期望奖励。γ为折扣因子，表示未来期望奖励在当前时刻的价值比例。Reinforcement learning problems contain two subjects: the agent and the environment. In the extended target sensor control based on deep reinforcement learning, the interactive environment is the elliptical extended target filter algorithm, and the reinforcement learning agent is trained to perform intelligent sensor control. Reinforcement learning problems can often be modeled using a Markov decision process (MDP), represented as a quintuple is a finite set of states, that is, all possible states that the agent explores in the environment. Use s to represent the state at the current moment, s′ to represent the state at the next moment, and the specific state is the position of the sensor in the coordinate system. /> is a limited action set, that is, a set of all possible actions that an agent can take based on the current state. Let a represent the currently taken action. At this time, it is the sensor path planning. After fixing the sensor speed, the sensor can choose the direction angle of movement. . P is the state transition function, which is the probability of the sensor transitioning from the current state s to the next state s′. /> It is a reward function, which represents the expected reward that the sensor can obtain after taking action based on the current position status. γ is the discount factor, which represents the proportion of the value of future expected rewards at the current moment.

智能体与环境交互的过程是智能体在k时刻，采取动作a_k并作用于环境后，传感器从k时刻状态x_s,k转移到k+1时刻状态x_s,k+1，由奖励函数进行评价得到奖励值R_k+1，然后智能体根据奖励值不断改进策略，最终学习到最优策略决策传感器每一时刻的动作a_k。The process of interaction between the agent and the environment is that after the agent takes action a _k and acts on the environment at time k, the sensor moves from the state x _s,k at time k to the state x _{s,k+1 at time k+1} . According to the reward function The reward value R _k+1 is obtained through evaluation, and then the agent continuously improves the strategy based on the reward value, and finally learns the optimal strategy to determine the sensor's action a _k at each moment.

在MDP中，价值函数包含状态价值函数和动作价值函数(Q函数)两种，在TD3算法中使用的是动作价值函数，表示在策略函数π的指导下，智能体根据状态s采取动作a所获得的期望奖励。强化学习算法根据环境模型是否已知可以分为两大类：分别为基于模型(model-based)的方法和无模型(model-free)的方法，由于实际问题的环境大部分是复杂未知的，导致对环境的建模困难，所以model-free的方法应用更加广泛。对于model-free的方法又可以分为基于策略(policy-based)的方法、基于价值(value-based)的方法以及综合两者的Actor-Critic(AC)方法。In MDP, the value function includes two types: state value function and action value function (Q function). In the TD3 algorithm, the action value function is used, which means that under the guidance of the policy function π, the agent takes action a according to the state s. Expected rewards to be obtained. Reinforcement learning algorithms can be divided into two categories according to whether the environment model is known: model-based methods and model-free methods. Since most of the actual problem environments are complex and unknown, This makes it difficult to model the environment, so the model-free method is more widely used. Model-free methods can be divided into policy-based methods, value-based methods, and the Actor-Critic (AC) method that combines the two.

对于深度强化学习的TD3算法是model-free的基于AC的方法，用神经网络拟合价值函数和策略函数。其中，Actor表示基于策略函数的网络，用于根据状态选择动作，Critic表示基于价值函数的网络，用于对Actor网络选择的动作进行评价。假设分别使用参数为θ^μ和的Actor网络和Critic网络，参数为θ^μ′的目标Actor网络以及参数为/>和/>的两个目标Critic网络。对Critic网络采用梯度下降的方式进行更新，更新公式描述为：The TD3 algorithm for deep reinforcement learning is a model-free AC-based method that uses neural networks to fit the value function and policy function. Among them, Actor represents a network based on the policy function, which is used to select actions based on the state, and Critic represents a network based on the value function, which is used to evaluate the actions selected by the Actor network. Assume that the parameters are θ ^μ and Actor network and Critic network, the target Actor network with parameters θ ^μ′ and the parameter /> and/> The two-target Critic network. The Critic network is updated using gradient descent. The update formula is described as:

对Actor网络通过梯度下降的方式更新，可以描述为：Updating the Actor network through gradient descent can be described as:

其中，N为每次学习采样的批量样本数，α和β为学习率。对于目标网络采用软更新的方式，更新公式可以描述为：Among them, N is the number of batch samples sampled for each learning, and α and β are the learning rate. For the target network using soft update, the update formula can be described as:

θ^μ′←τθ^μ+(1-τ)θ^μ′ (27)θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ (27)

智能体由当前k时刻传感器位置，根据当前智能体的网络参数输出传感器动作，然后根据此动作得到传感器k+1时刻位置，将k+1时刻位置带入到滤波环境中得到传感器量测值，以此进行扩展目标跟踪伪更新，从而根据设计的奖励函数由公式(29)给出可以得到k时刻采取动作的奖励值，将数据以(x_s,k,a_k,r_k+1,x_s,k+1)的形式进行存储。在目标跟踪中会涉及许多时刻，即k∈{0,1,2…T}，直至跟踪时刻的结束。在这个过程中环境会不断的与强化学习智能体进行交互，会得到许多条(x_s,k,a_k,r_k+1,x_s,k+1)形式的数据，都将其存储在经验回放池中，设置一定的经验回放池容量，当数据超过这个容量时，丢弃之前旧的数据，重新用新的数据来填充经验回放池。The agent outputs the sensor action based on the current sensor position at k time according to the current agent's network parameters, and then obtains the sensor k+1 time position based on this action, and brings the k+1 time position into the filtering environment to obtain the sensor measurement value. This is used to perform extended target tracking pseudo-update, so that the reward value of the action taken at time k can be obtained according to the designed reward function given by formula (29). The data is divided into (x _s,k ,a _k ,r _k+1 ,x stored in the form of _s,k+1 ). Many moments will be involved in target tracking, that is, k∈{0,1,2...T}, until the end of the tracking moment. In this process, the environment will continue to interact with the reinforcement learning agent, and many pieces of data in the form of (x _s,k ,a _k ,r _k+1 ,x _s,k+1 ) will be obtained, which will be stored in In the experience replay pool, set a certain experience replay pool capacity. When the data exceeds this capacity, the old data will be discarded and new data will be used to fill the experience replay pool.

对于深度强化学习中的奖励函数的构建：选取扩展目标先验概率密度与后验概率密度之间的信息增益来设计奖励函数，利用高斯瓦瑟斯坦距离来度量这种信息增益。此时，用先验概率密度和后验概率密度之间的高斯瓦瑟斯坦(GW)距离对扩展目标跟踪估计效果进行综合评价，GW距离越大表明信息增益越大，通过该奖励函数引导强化学习智能体选择最优策略，可以避免采用稀疏奖励使智能体难以收敛的问题。For the construction of the reward function in deep reinforcement learning: select the information gain between the extended target prior probability density and the posterior probability density to design the reward function, and use the Gaussian Wasserstein distance to measure this information gain. At this time, the Gauss-Wasserstein (GW) distance between the prior probability density and the posterior probability density is used to comprehensively evaluate the extended target tracking estimation effect. The larger the GW distance, the greater the information gain. The reward function is used to guide the enhancement. Learning the optimal strategy for the agent can avoid the problem of sparse rewards that make it difficult for the agent to converge.

(4)奖励函数的设计(4)Design of reward function

由于目标跟踪涉及许多时刻，即k∈{0,1,2…T}，在每一个时刻，从经验回放池中抽取小批量的数据，根据TD3算法的网络更新方式即公式(20)-(27)对网络参数进行更新。不断的从经验回放池中抽取数据，设置一定的训练次数，对网络参数进行迭代更新，最终使其收敛到最优，控制智能体根据k时刻传感器位置作出最优策略得到传感器动作。Since target tracking involves many moments, that is, k∈{0,1,2...T}, at each moment, a small batch of data is extracted from the experience replay pool, and the network update method according to the TD3 algorithm is formula (20)-( 27) Update network parameters. Continuously extract data from the experience replay pool, set a certain number of training times, iteratively update the network parameters, and finally converge to the optimum. The control agent makes the optimal strategy based on the sensor position at k time to obtain the sensor action.

在椭圆扩展目标跟踪中，可以将运动状态的位置分量和扩展状态定义为多元高斯分布，用来描述扩展目标跟踪的整体效果，表示为：N_x～N(m_x,s∑_x)，其中，m_x由x_k的位置分量确定，s∑_x表示范围，s为缩放因子，取s＝1，∑_x由随机矩阵X_k确定。则定义扩展目标k时刻先验概率分布和后验概率分布均服从多元高斯分布，描述为：两者之间的高斯瓦瑟斯坦距离为：In elliptical extended target tracking, the position component and extended state of the motion state can be defined as a multivariate Gaussian distribution, which is used to describe the overall effect of extended target tracking, expressed as: N _x ~ N (m _x ,s∑ _x ), where , m _x is determined by the position component of x _k , s∑ _x represents the range, s is the scaling factor, take s=1, and ∑ _x is determined by the random matrix X _k . Then it is defined that the prior probability distribution and posterior probability distribution at moment k of the extended target both obey the multivariate Gaussian distribution, which is described as: The Gaussian Wasserstein distance between the two is:

则基于此，奖励函数为：Based on this, the reward function is:

用每一时刻训练好的强化学习智能体最终根据当前时刻的传感器位置选择其下一时刻的动作以获得最佳量测，新的量测根据公式(9)-(10)得到，最后带入到滤波算法(6)-(19)中得到最终的扩展目标跟踪结果。通过以上步骤的迭代最终得到最优传感器路径以优化椭圆扩展目标跟踪整体性能。The reinforcement learning agent trained at each moment is used to finally select its action at the next moment based on the sensor position at the current moment to obtain the best measurement. The new measurement is obtained according to formulas (9)-(10), and finally brought into Go to the filtering algorithm (6)-(19) to obtain the final extended target tracking result. Through iteration of the above steps, the optimal sensor path is finally obtained to optimize the overall performance of elliptical extended target tracking.

图2为扩展目标跟踪中基于TD3算法的传感器轨迹智能规划后，对目标扩展状态估计中与真实目标的半长轴和半短轴之间的估计误差，可以看出对半长轴的估计误差均在1.0以下，对半短轴的估计误差均在0.5以下，对扩展目标的形状估计比较准确。Figure 2 shows the estimation error between the semi-major axis and the semi-minor axis of the target expanded state estimation and the real target after intelligent planning of the sensor trajectory based on the TD3 algorithm in extended target tracking. It can be seen that the estimation error of the semi-major axis All are below 1.0, the estimation errors of the semi-minor axis are all below 0.5, and the shape estimation of the extended target is relatively accurate.

图3为在无传感器控制方案下与基于TD3控制方案下，扩展目标状态估计与真实目标状态之间的质心估计误差。从图3中可以看出，扩展目标中加入基于深度强化学习的传感器控制方法后，对扩展目标的质心跟踪估计更加精准。Figure 3 shows the centroid estimation error between the extended target state estimate and the true target state under the sensorless control scheme and the TD3-based control scheme. As can be seen from Figure 3, after the sensor control method based on deep reinforcement learning is added to the extended target, the centroid tracking estimate of the extended target is more accurate.

图4为在无传感器控制方案下与基于TD3控制方案下，扩展目标估计与真实扩展目标之间的高斯瓦瑟斯坦距离，通过该指标综合考虑对目标运动状态和扩展状态估计的整体性能。从图4可以看出，扩展目标中加入基于深度强化学习的传感器控制方法后，提升了扩展目标跟踪的整体性能，不仅使质心估计更加准确，同时对目标轮廓信息的跟踪估计也更接近于目标真实形状。Figure 4 shows the Gauss Wasserstein distance between the extended target estimate and the real extended target under the sensorless control scheme and the TD3-based control scheme. This indicator comprehensively considers the overall performance of the target motion state and extended state estimation. As can be seen from Figure 4, adding the sensor control method based on deep reinforcement learning to the extended target improves the overall performance of the extended target tracking. Not only does the center of mass estimation become more accurate, but the tracking estimate of the target contour information is also closer to the target. True shape.

本发明运用随机矩阵对椭圆扩展目标的扩展状态建模，可以对目标运动状态和扩展状态进行有效估计，然后采用类似于信息论的传感器管理方法中评价函数的设定构建应用于深度强化学习TD3算法中的奖励函数，此奖励函数综合考虑了对目标运动状态和轮廓信息(扩展状态)的联合优化，使用TD3算法对传感器在连续动作空间下进行有效控制后，与无传感器控制相比不仅可以对目标质心位置估计上更加准确，同时对目标轮廓信息的估计上也更加准确，所以在整体上优化了椭圆扩展目标的跟踪效果。This invention uses a random matrix to model the expanded state of the elliptical expanded target, which can effectively estimate the target motion state and expanded state, and then uses the setting of the evaluation function in a sensor management method similar to information theory to construct the deep reinforcement learning TD3 algorithm. The reward function in The target center of mass position is more accurately estimated, and the target contour information is also more accurately estimated, so the tracking effect of the elliptical extended target is optimized as a whole.

以上，仅为本申请较佳的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应该以权利要求的保护范围为准。The above are only preferred specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application. All are covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

1. The sensor management method based on deep reinforcement learning in the extended goal is characterized by:

Model the elliptical extended target and build a virtual interactive environment with deep reinforcement learning based on the extended target filtering algorithm;

Establish TD3 algorithm intelligent agent;

The virtual interactive environment interacts with the TD3 algorithm agent to obtain sensor control data, and the sensor control data is stored as a sample in an experience playback pool; samples are extracted based on the experience playback pool to train the TD3 algorithm The agent uses the trained agent to decide the optimal action for sensor path planning;

Wherein, obtaining the sensor control data includes:

After the agent takes action and acts on the virtual interactive environment at any time, the sensor moves from the state x _s, k at time k to the state x _{s,k+1 at time k+1} , and is evaluated through the reward function Obtain the reward value R _k+1 , and then the agent continuously improves the strategy based on the reward value, and finally learns the optimal strategy to determine the actions of the sensor at each moment;

The optimal action is applied to the sensor, and the sensor position is obtained after state transition of the sensor, thereby obtaining the extended target sensor measurement value at the current moment, and filtering prediction and update are performed to perform tracking and estimation of the extended target.

2. The sensor management method based on deep reinforcement learning in the extended target according to claim 1, characterized in that modeling the elliptical extended target includes:

Set the state of extended target tracking at time k as: ζ _k = (x _k ,X _k ), where x _k represents the kinematic state of the target, and X _k represents the expanded state of the target;

The approach to modeling is:

Among them, w _k is zero-mean Gaussian process noise, v _k is zero-mean Gaussian measurement noise, x _{s, k} (π) is the sensor position at the current moment, f _k : is the system state evolution mapping, h _k :/> For measurement mapping, x _k+1 represents the kinematic state of the target at k+1 moment,/> Represents multiple measurement values at time k.

3. The sensor management method based on deep reinforcement learning in the extended target according to claim 2, characterized in that the extended state of the extended target at time k is modeled as an elliptical shape, and is described by a positive definite symmetric matrix X _k as:

Among them, θ _k is the direction angle of the elliptical shape, σ _k,1 and σ _k,2 are the major axis and minor axis of the elliptical shape respectively.

4. The sensor management method based on deep reinforcement learning in the extended target according to claim 1, characterized in that, according to the extended target filtering algorithm, a virtual interactive environment with deep reinforcement learning is constructed, including:

Based on the neural network to fit the value function and policy function in the deep reinforcement learning, the deep reinforcement learning algorithm is used to control the sensor through the exploration and utilization mechanism, an intelligent sensor control system is established, and the virtual interaction is constructed through the intelligent sensor control system Environment; the extended target filtering algorithm includes a prediction process and an update process.

5. The sensor management method based on deep reinforcement learning in the extended target according to claim 4, characterized in that the prediction process is:

Among them, F _k|k-1 is the state transition matrix, I _d is the d-dimensional identity matrix, P _k|k-1 is the prediction covariance matrix, D _k|k-1 is the covariance matrix of zero-mean Gaussian process noise, x _k|k-1 is the one-step prediction value, x _k-1|k-1 is the filter update value at time k-1, and P _k-1|k-1 is the corresponding covariance matrix.

6. The sensor management method based on deep reinforcement learning in the extended target according to claim 1, characterized in that the TD3 algorithm agent includes:

Actor network: used to select actions based on status;

Target Actor network: used to select actions based on the status again based on the results obtained by the Actor network;

Critic network: used to evaluate the actions selected by the Actor network;

Target Critic network: used to evaluate the action selected by the Actor network again based on the results obtained by the Critic network.

7. The sensor management method based on deep reinforcement learning in the extended target according to claim 1, characterized in that the method of constructing the reward function includes:

Define the extended target k time prior probability distribution and posterior probability distribution to obey the multivariate Gaussian distribution, obtain the Gauss Wasserstein distance between the prior probability distribution and the posterior probability distribution, and construct the above based on the Gauss Wasserstein distance reward function.

8. The sensor management method based on deep reinforcement learning in the extended target according to claim 7, characterized in that the reward function is:

Among them, a _k,0 indicates that the sensor is in a stationary state at the current moment.