CN116165886A

CN116165886A - Multi-sensor intelligent cooperative control method, device, equipment and medium

Info

Publication number: CN116165886A
Application number: CN202211631510.XA
Authority: CN
Inventors: 胡超; 黄杰; 麻舜予; 张宇阳; 李贵; 丛迅超; 郑博元
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-05-26

Abstract

The invention discloses a multi-sensor intelligent cooperative control method, a device, equipment and a medium, wherein the method comprises the following steps: establishing a reinforcement learning agent model, wherein a state space of the reinforcement learning agent model comprises a global comprehensive situation expression and a single sensor state embedded representation, and an action space of the reinforcement learning agent model comprises action output values abstracted by different tasks executed by a plurality of sensors; training the reinforcement learning intelligent agent model through sampling, and guiding the reinforcement learning intelligent agent model to learn through rewarding modeling engineering, wherein the training comprises centering training and decentralizing execution; the reinforcement learning agent model obtained through loading training cooperatively controls a plurality of sensors. The invention realizes cross-means cross-region cooperative control of multiple sensors by using reinforcement learning technology.

Description

Multi-sensor intelligent collaborative control method, device, equipment and medium

技术领域Technical Field

本发明属于人工智能技术领域，尤其涉及多传感器智能协同控制方法、装置、设备及介质。The present invention belongs to the field of artificial intelligence technology, and in particular relates to a multi-sensor intelligent collaborative control method, device, equipment and medium.

背景技术Background Art

目前，电磁频谱空间呈现出强对抗的特点，面对存在大量不同特征的复杂辐射源在网电空间中对目标重点信号进行探测活动时，要跨区域协同调度多个同类型或不同类型传感器(如超短波、微波、电子和便携式雷达等)，发挥各自的优势，实现对多个目标的即时发现与定位，并实现长时间持续跟踪的需求。At present, the electromagnetic spectrum space presents the characteristics of strong confrontation. Faced with a large number of complex radiation sources with different characteristics, when conducting detection activities on key target signals in the network and electronic space, it is necessary to coordinate and dispatch multiple sensors of the same or different types (such as ultra-short wave, microwave, electronic and portable radars, etc.) across regions, give full play to their respective advantages, realize the instant discovery and positioning of multiple targets, and meet the needs of long-term continuous tracking.

传统的人工决策控制方法存在高时延、低容错、无法综合考虑高维复杂信号态势以及对多传感器资源的最优协同调度，且多数重点目标信号具有短猝发的特征，从而难以实现对辐射源的有效持续定位跟踪。Traditional manual decision-making and control methods have the problems of high latency, low fault tolerance, and are unable to comprehensively consider the situation of high-dimensional complex signals and the optimal coordinated scheduling of multi-sensor resources. In addition, most key target signals have short burst characteristics, making it difficult to achieve effective and continuous positioning and tracking of radiation sources.

深度强化学习面临奖励信号稀疏和采样率低的问题，进而导致样本利用率低、学习速度缓慢甚至训练难以收敛，这种缺点在强对抗多辐射源复杂电磁频谱环境中尤为突出。且由于部分重点信号具有隐蔽性、突发性、阶段性等特征，不同信号又具有不同特征无法通过单一手段(即单一种类的传感器)进行探测，导致辐射源定位跟踪任务晦涩，难以拆解。Deep reinforcement learning faces the problem of sparse reward signals and low sampling rate, which leads to low sample utilization, slow learning speed and even difficulty in training convergence. This shortcoming is particularly prominent in a complex electromagnetic spectrum environment with strong confrontation and multiple radiation sources. In addition, since some key signals have characteristics such as concealment, suddenness, and phases, and different signals have different characteristics that cannot be detected by a single means (i.e. a single type of sensor), the task of locating and tracking radiation sources is obscure and difficult to disassemble.

发明内容Summary of the invention

本发明的目的在于，为克服现有技术缺陷，提供了多传感器智能协同控制方法、装置、设备及介质,能够在复杂多辐射源环境中，通过人工智能技术对多个传感器进行协同控制，从而实现多目标即时发现与定位，并满足对目标长时间持续跟踪的需求。The purpose of the present invention is to overcome the defects of the prior art and provide a multi-sensor intelligent collaborative control method, device, equipment and medium, which can collaboratively control multiple sensors through artificial intelligence technology in a complex multi-radiation source environment, thereby realizing instant detection and positioning of multiple targets and meeting the needs of long-term continuous tracking of targets.

本发明目的通过下述技术方案来实现：The object of the present invention is achieved through the following technical solutions:

一种多传感器智能协同控制方法，所述方法包括：A multi-sensor intelligent collaborative control method, the method comprising:

建立每个传感器对应的强化学习智能体模型，所述强化学习智能体模型的状态空间包括全局综合态势表达和单个传感器状态嵌入表示，所述强化学习智能体模型的动作空间包括由多传感器执行的不同任务所抽象出的动作输出值；Establishing a reinforcement learning agent model corresponding to each sensor, wherein the state space of the reinforcement learning agent model includes a global comprehensive situation expression and a single sensor state embedding representation, and the action space of the reinforcement learning agent model includes action output values abstracted from different tasks performed by multiple sensors;

通过采样对所述强化学习智能体模型进行训练，并通过奖励塑造工程引导所述强化学习智能体模型学习，所述训练包括中心化训练与去中心化执行；Training the reinforcement learning agent model through sampling, and guiding the reinforcement learning agent model to learn through reward shaping engineering, wherein the training includes centralized training and decentralized execution;

通过训练得到的强化学习智能体模型协同控制多个传感器。The trained reinforcement learning agent model collaboratively controls multiple sensors.

进一步的，所述通过采样对所述强化学习智能体模型进行训练具体包括：Furthermore, the training of the reinforcement learning agent model by sampling specifically includes:

开启多个采样线程分别在多个具备不同配置场景的并行仿真环境或并行真实环境中进行独立采样；Open multiple sampling threads to perform independent sampling in multiple parallel simulation environments or parallel real environments with different configuration scenarios;

将采样数据统一放入采样经验缓存池；Put the sampled data into the sampled experience buffer pool;

当满足训练条件时，从采样经验缓存池中取出采样数据根据对应的强化学习算法集中式训练对智能体模型进行更新，然后将更新后的模型参数放入模型参数缓存池。When the training conditions are met, the sampled data is taken out from the sampled experience cache pool to update the agent model according to the corresponding reinforcement learning algorithm centralized training, and then the updated model parameters are placed in the model parameter cache pool.

进一步的，所述通过奖励塑造工程引导所述强化学习智能体模型学习具体包括：Furthermore, guiding the learning of the reinforcement learning agent model through the reward shaping project specifically includes:

设置人为设计奖励、终局模式奖励和好奇心奖励；Set up human-design rewards, end-game rewards, and curiosity rewards;

所述人为设计奖励包括当完成预设任务时或任务失败时给予对应奖励值；The artificially designed rewards include giving corresponding reward values when a preset task is completed or when a task fails;

所述终局模式奖励包括根据整体的信号探测效能给予奖励；The end-game mode reward includes awarding rewards based on overall signal detection effectiveness;

所述好奇心奖励包括当探索到未知空间时给予奖励。The curiosity reward includes giving a reward when an unknown space is explored.

进一步的，所述方法建立强化学习智能体模型时采用的置信区域策略优化方法具体包括：Furthermore, the confidence region strategy optimization method used in the method to establish the reinforcement learning agent model specifically includes:

使用一阶近似形式简化置信区域策略优化算法，所述置信区域策略优化算法简化后为：The confidence region strategy optimization algorithm is simplified using a first-order approximation form, and the confidence region strategy optimization algorithm is simplified as follows:

对应约束条件为：The corresponding constraints are:

其中为π为新策略；π_old为旧策略；S为状态；a为动作，

为旧策略的优势函数；Where π is the new strategy; π _old is the old strategy; S is the state; a is the action,

is the advantage function of the old strategy;

状态行动值函数：

State-action-value function:

状态值函数：

State value function:

优势函数：

Advantage function:

其中，γ为衰减因子，

为新旧策略的KL散度的平均值；Where γ is the attenuation factor,

is the average value of the KL divergence of the new and old strategies;

对简化后的置信区域策略优化算法使用蒙特卡洛法进行近似，得到The simplified confidence region strategy optimization algorithm is approximated using the Monte Carlo method to obtain

令

用于表示新旧策略的比率，得到make

Used to express the ratio of the new and old strategies, we get

将置信区域策略优化算法的约束条件近似为r_t(θ)∈[1-∈,1+∈]，其中∈为clip系数，则带约束的置信区域策略优化算法目标函数可以表示无约束的目标函数：The constraints of the confidence region policy optimization algorithm are approximated as r _t (θ)∈[1-∈,1+∈], where ∈ is the clip coefficient. Then the objective function of the confidence region policy optimization algorithm with constraints can be expressed as the unconstrained objective function:

将状态值函数的目标函数和策略模型的熵添加至无约束的目标函数，得到完整的目标函数：Adding the objective function of the state value function and the entropy of the policy model to the unconstrained objective function yields the complete objective function:

其中，

C₁、C₂分别为预先设置的对应项的系数。in,

C ₁ and C ₂ are preset coefficients of corresponding items respectively.

进一步的，所述中心化训练与去中心化执行具体包括：Furthermore, the centralized training and decentralized execution specifically include:

由一个中央控制器收集所有智能体的全局状态并做出统一决策；A central controller collects the global states of all agents and makes unified decisions;

各传感器之间异步根据自身当前状态执行各自任务。Each sensor performs its own task asynchronously according to its current state.

进一步的，所述方法还包括基于生成对抗模仿学习的专家策略引导所述强化学习智能体模型的学习。Furthermore, the method also includes guiding the learning of the reinforcement learning agent model based on an expert strategy of generative adversarial imitation learning.

进一步的，所述基于生成对抗模仿学习的专家策略引导所述强化学习智能体模型的学习具体包括：Furthermore, the expert strategy based on generative adversarial imitation learning guides the learning of the reinforcement learning agent model specifically including:

重复以下步骤直至得到最优策略：Repeat the following steps until the optimal strategy is obtained:

利用当前传感器对应的强化学习智能体模型与环境交互得到智能体生成轨迹；Use the reinforcement learning agent model corresponding to the current sensor to interact with the environment to obtain the agent generation trajectory;

智能体生成轨迹和示范轨迹一并输入鉴别器并以监督学习的方式更新鉴别器参数；The agent-generated trajectory and the demonstration trajectory are input into the discriminator together and the discriminator parameters are updated in a supervised learning manner;

更新后的鉴别器输出新的鉴别奖励函数；The updated discriminator outputs a new discrimination reward function;

利用更新后的奖励函数来提供奖励信号进一步更新智能体策略。The updated reward function is used to provide a reward signal to further update the agent strategy.

另一方面，本发明还提供了一种多传感器智能协同控制装置，所述装置包括：On the other hand, the present invention also provides a multi-sensor intelligent collaborative control device, the device comprising:

智能体模型建立模块，所述智能体模型建立模块建立每个传感器对应的强化学习智能体模型，所述强化学习智能体模型的状态空间包括全局综合态势表达和单个传感器状态嵌入表示，所述强化学习智能体模型的动作空间包括由多传感器执行的不同任务所抽象出的动作输出值；An agent model building module, wherein the agent model building module builds a reinforcement learning agent model corresponding to each sensor, wherein the state space of the reinforcement learning agent model includes a global comprehensive situation expression and a single sensor state embedding representation, and the action space of the reinforcement learning agent model includes action output values abstracted from different tasks performed by multiple sensors;

智能体模型训练模块，所述智能体模型训练模块通过采样对所述强化学习智能体模型进行训练，并通过奖励塑造工程引导所述强化学习智能体模型学习，所述训练包括中心化训练与去中心化执行；An agent model training module, wherein the agent model training module trains the reinforcement learning agent model by sampling and guides the reinforcement learning agent model to learn by reward shaping engineering, wherein the training includes centralized training and decentralized execution;

传感器控制模块，所述传感器控制模块通过训练得到的强化学习智能体模型协同控制多个传感器。A sensor control module, wherein the sensor control module collaboratively controls multiple sensors through a trained reinforcement learning agent model.

另一方面，本发明还提供了一种计算机设备，计算机设备包括处理器和存储器，所述存储器中存储有计算机程序，所述计算机程序由所述处理器加载并执行以实现上述的任意一种多传感器智能协同控制方法。On the other hand, the present invention also provides a computer device, which includes a processor and a memory, wherein the memory stores a computer program, and the computer program is loaded and executed by the processor to implement any one of the above-mentioned multi-sensor intelligent collaborative control methods.

另一方面，本发明还提供了一种计算机可读存储介质，所述存储介质中存储有计算机程序，所述计算机程序由处理器加载并执行以实现上述的任意一种多传感器智能协同控制方法。On the other hand, the present invention also provides a computer-readable storage medium, in which a computer program is stored. The computer program is loaded and executed by a processor to implement any of the above-mentioned multi-sensor intelligent collaborative control methods.

本发明的有益效果在于：The beneficial effects of the present invention are:

(1)本发明通过建立强化学习智能体模型来实现多个传感器的协同控制，能够使得多个传感器在复杂工作环境下对有复杂信号态势具有相应的处理能力。(1) The present invention realizes the coordinated control of multiple sensors by establishing a reinforcement learning intelligent agent model, which enables multiple sensors to have corresponding processing capabilities for complex signal situations in complex working environments.

(2)本发明利用强化学习技术以及基于专家知识的生成对抗模仿学习技术，实现对复杂电磁频谱环境中重点信号的持续定位跟踪。(2) The present invention utilizes reinforcement learning technology and generative adversarial imitation learning technology based on expert knowledge to achieve continuous positioning and tracking of key signals in complex electromagnetic spectrum environments.

(3)本发明能够控制跨区域多传感器之间异步执行各自任务，具备对短猝发信号的捕捉与定位能力，并具备一定的持续定位跟踪能力。(3) The present invention can control cross-regional multiple sensors to asynchronously execute their respective tasks, has the ability to capture and locate short burst signals, and has a certain degree of continuous positioning and tracking capability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例提供的多传感器智能协同控制方法；FIG1 is a multi-sensor intelligent collaborative control method provided by an embodiment of the present invention;

图2是本发明实施例状态空间示意图；FIG2 is a schematic diagram of a state space of an embodiment of the present invention;

图3是本发明实施例智能体训练模块运行逻辑图；FIG3 is a logic diagram of the operation of the intelligent agent training module according to an embodiment of the present invention;

图4是本发明实施例智能体训练系统架构示意图；FIG4 is a schematic diagram of the architecture of an intelligent agent training system according to an embodiment of the present invention;

图5是本发明实施例分布式架构框架图；FIG5 is a diagram of a distributed architecture framework according to an embodiment of the present invention;

图6是本发明实施例多智能体学习系统示意图；FIG6 is a schematic diagram of a multi-agent learning system according to an embodiment of the present invention;

图7是本发明实施例生成对抗模仿学习的专家策略引导学习方法示意图；7 is a schematic diagram of an expert strategy guided learning method for generative adversarial imitation learning according to an embodiment of the present invention;

图8是本发明实施例提供的多传感器智能协同控制装置结构框图。FIG8 is a structural block diagram of a multi-sensor intelligent collaborative control device provided in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The following describes the embodiments of the present invention by specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments can be combined with each other without conflict.

基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without making any creative work shall fall within the scope of protection of the present invention.

为了解决上述技术问题，提出了本发明多传感器智能协同控制方法、装置、设备及介质的下述各个实施例。In order to solve the above technical problems, the following embodiments of the multi-sensor intelligent collaborative control method, device, equipment and medium of the present invention are proposed.

实施例1Example 1

本实施例针对复杂电磁频谱环境中重点辐射源信号的定位跟踪，利用强化学习技术对多传感器实现跨手段跨区域的协同控制，优化资源调度与决策速度，提升定位精度与准确性，进一步提高对部分短猝发信号辐射源的持续定位能力。This embodiment aims at locating and tracking the signals of key radiation sources in a complex electromagnetic spectrum environment, and uses reinforcement learning technology to achieve cross-means and cross-regional collaborative control of multiple sensors, optimize resource scheduling and decision-making speed, improve positioning precision and accuracy, and further improve the continuous positioning capability of some short burst signal radiation sources.

参照图1，如图1所示是本实施例提供的多传感器智能协同控制方法流程示意图，该方法具体包括以下步骤：Referring to FIG. 1 , FIG. 1 is a flow chart of a multi-sensor intelligent collaborative control method provided by this embodiment, and the method specifically includes the following steps:

步骤一：建立每个传感器对应的强化学习智能体模型。Step 1: Establish a reinforcement learning agent model corresponding to each sensor.

具体地，强化学习智能体模型的状态空间包括全局综合态势表达和单个传感器状态嵌入表示，强化学习智能体模型的动作空间包括由多传感器执行的不同任务所抽象出的动作输出值。Specifically, the state space of the reinforcement learning agent model includes the global comprehensive situation expression and the embedded representation of the state of a single sensor, and the action space of the reinforcement learning agent model includes the action output values abstracted from different tasks performed by multiple sensors.

本实施例强化学习智能体模型的状态空间由传感器、目标、信号的状态向量组成。在本方案中状态空间为N×D维的矩阵。N表示当前环境中要素的数量，包括传感器、目标和信号。当环境要素的数量小于N时，剩余位置进行补0。D表示向量的长度，不同要素间向量会通过补0补齐至同一长度。The state space of the reinforcement learning agent model in this embodiment is composed of the state vectors of sensors, targets, and signals. In this solution, the state space is an N×D dimensional matrix. N represents the number of elements in the current environment, including sensors, targets, and signals. When the number of environmental elements is less than N, the remaining positions are padded with 0. D represents the length of the vector, and the vectors between different elements will be padded to the same length by padded with 0.

本发明接下来的部分将以处于3个不同区域的3个探测设备传感器的协同控制来对具体实施方式进行描述，应当理解，此处传感器以及区域的个数并不用于限定本发明。The following part of the present invention will describe the specific implementation method by means of the coordinated control of three detection device sensors in three different areas. It should be understood that the number of sensors and areas here is not intended to limit the present invention.

参照图2，如图2所示是本实施例状态空间示意图。本实施例状态空间由全局综合态势表达和单个传感器状态嵌入表示两部分组成，其中全局中和态势表达分别包含目标、信号以及全局运维状态。目标综合态势嵌入表示中，0代表当前智能体未对信号成功定位，因此没有经纬度，1则代表定位到目标；以3个不同区域的3个探测传感器为例，由于定位至少需要两个不同区域的传感器，因此支撑定位的传感器ID，0代表数据无效，1代表区域(1,2)之间的传感器进行定位，2代表区域(1,3)之间的传感器进行定位，3代表区域(2,3)之间的传感器进行定位，4代表区域(1,2,3)之间的传感器都在进行定位。信号综合态势嵌入表示由当前探测到的所有信号态势组成，具体则包括频率、带宽、幅度、方位和重点信号标识。全局运维状态离散嵌入表示由当前所有传感器的工作状态组成，具体则包括任务类型、任务状态、当前宽扫频段(没有则填0)、当前控守频点(没有则填0)组成。单个传感器状态嵌入表示则包括当前传感器信号态势数据(频率、带宽、幅度、方位、重点信号标识)、历史定频数据(频率、带宽、标识)、传感器当前工作状态(任务类型、任务状态、当前宽扫频段、当前控守频点)、同类传感器宽扫频点集合(由一个二元数组组成，第一维为频率，第二维为是否为传感器自身扫描到的信号标识)。Referring to Figure 2, as shown in Figure 2, it is a schematic diagram of the state space of this embodiment. The state space of this embodiment consists of two parts: a global comprehensive situation expression and a single sensor state embedding representation, wherein the global neutralization situation expression includes the target, signal, and global operation and maintenance status, respectively. In the target comprehensive situation embedding representation, 0 represents that the current intelligent agent has not successfully located the signal, so there is no longitude and latitude, and 1 represents that the target is located; taking three detection sensors in three different areas as an example, since positioning requires at least two sensors in different areas, the sensor ID supporting positioning, 0 represents invalid data, 1 represents that the sensor between areas (1,2) is positioned, 2 represents that the sensor between areas (1,3) is positioned, 3 represents that the sensor between areas (2,3) is positioned, and 4 represents that the sensors between areas (1,2,3) are all positioned. The signal comprehensive situation embedding representation consists of all the signal situations currently detected, specifically including frequency, bandwidth, amplitude, azimuth, and key signal identification. The discrete embedding representation of the global operation and maintenance status consists of the working status of all current sensors, including task type, task status, current wide-scan frequency band (fill 0 if none), and current control frequency point (fill 0 if none). The embedding representation of a single sensor status includes the current sensor signal situation data (frequency, bandwidth, amplitude, direction, key signal identification), historical fixed-frequency data (frequency, bandwidth, identification), the current working status of the sensor (task type, task status, current wide-scan frequency band, current control frequency point), and the set of wide-scan frequency points of similar sensors (composed of a binary array, the first dimension is the frequency, and the second dimension is whether it is the signal identification scanned by the sensor itself).

本实施例以超短波探测传感器为例，强化学习智能体的动作空间如表1所示。对于其中的连续动作，通过从对应的高斯分布进行采样得到对应的具体动作输出值。This embodiment takes the ultrashort wave detection sensor as an example, and the action space of the reinforcement learning agent is shown in Table 1. For the continuous actions therein, the corresponding specific action output value is obtained by sampling from the corresponding Gaussian distribution.

表1强化学习智能体动作空间表Table 1. Action space of reinforcement learning agent

步骤二：通过采样对强化学习智能体模型进行训练，并通过奖励塑造工程引导强化学习智能体模型学习，训练包括中心化训练与去中心化执行。Step 2: Train the reinforcement learning agent model through sampling, and guide the reinforcement learning agent model to learn through reward shaping engineering. The training includes centralized training and decentralized execution.

面向辐射源定位跟踪的多传感器智能协同控制方法由任务规划模块和智能体训练模块组成。其中任务规划模块通过加载不同的智能体模型给出不同的传感器协同方案，如资源优先、定位优先、跟踪优先等。The multi-sensor intelligent collaborative control method for radiation source positioning and tracking consists of a task planning module and an agent training module. The task planning module gives different sensor collaboration schemes by loading different agent models, such as resource priority, positioning priority, tracking priority, etc.

智能体训练为本实施例的核心模块，主要通过与仿真推演软件或者真实环境交互对相应的智能体进行训练。Agent training is the core module of this embodiment, which mainly trains the corresponding agent by interacting with simulation software or real environment.

参照图3和4，如图3所示是本实施例智能体训练模块运行逻辑图，如图4所示是本实施例智能体训练系统架构示意图。3 and 4 , FIG3 is a running logic diagram of the intelligent agent training module of the present embodiment, and FIG4 is a schematic diagram of the architecture of the intelligent agent training system of the present embodiment.

本实施例智能体训练主要分为采样部分和训练部分，其中蓝方为辐射源方，其对应的蓝方轨迹生成子模块(辐射源以怎样的轨迹进行移动)与蓝方决策子模块(辐射源携带的信号何时开关)均由规则进行生成；红方即为传感器智能体方。红方模型为智能体模型，蓝方模型为规则模型。本实施例针对在复杂电磁频谱环境中部署PPO强化学习智能体式面临的低采样率问题，基于gRPC分布式架构，对智能体训练模块采用分布式采样集中式训练方案。参照图5，如图5所示是本实施例分布式架构框架图。开启多个采样线程分别在多个具备不同配置场景的并行仿真环境中进行独立采样。随后将采样数据统一放入采样经验缓存池。当满足训练条件是时，从采样经验缓存池中取出采样数据根据对应的强化学习算法集中式训练对智能体模型进行更新，然后将更新后的模型参数放入模型参数缓存池。采样模块会定期从模型参数缓存池中取出最新模型参数更新智能体模型，从而更新智能体策略，并利用更新后的策略进行采样。The agent training of this embodiment is mainly divided into a sampling part and a training part, in which the blue side is the radiation source side, and its corresponding blue side trajectory generation submodule (what trajectory the radiation source moves with) and the blue side decision submodule (when the signal carried by the radiation source is switched on and off) are generated by rules; the red side is the sensor agent side. The red side model is an agent model, and the blue side model is a rule model. In this embodiment, based on the gRPC distributed architecture, a distributed sampling centralized training scheme is adopted for the agent training module to solve the low sampling rate problem faced by deploying PPO reinforcement learning agents in a complex electromagnetic spectrum environment. Referring to Figure 5, as shown in Figure 5, it is a distributed architecture framework diagram of this embodiment. Multiple sampling threads are opened to perform independent sampling in multiple parallel simulation environments with different configuration scenarios. The sampled data is then uniformly placed in the sampling experience cache pool. When the training conditions are met, the sampled data is taken out from the sampling experience cache pool to update the agent model according to the centralized training of the corresponding reinforcement learning algorithm, and then the updated model parameters are placed in the model parameter cache pool. The sampling module will periodically take out the latest model parameters from the model parameter cache pool to update the agent model, thereby updating the agent strategy, and use the updated strategy for sampling.

至此，基于分布式采样集中式训练的强化学习训练技术完成了采样、训练和模型迭代的循环，通过最大化利用物理硬件资源来提升采样率，极大的提升智能体学习效率。At this point, the reinforcement learning training technology based on distributed sampling centralized training has completed the cycle of sampling, training and model iteration, and has greatly improved the learning efficiency of intelligent agents by maximizing the use of physical hardware resources to increase the sampling rate.

针对跨区域多传感器协同控制的难点，考虑到跨区域多传感器之间为异步执行各自任务的特点，基于多智能强化学习思想，提出中心化训练，去中心化执行(CTDE)式的半多智能体PPO强化学习算法。Aiming at the difficulty of cross-regional multi-sensor collaborative control and taking into account the characteristics of cross-regional multi-sensors performing their respective tasks asynchronously, a centralized training and decentralized execution (CTDE) semi-multi-agent PPO reinforcement learning algorithm was proposed based on the idea of multi-intelligence reinforcement learning.

强化学习算法普遍具有一定的波动性，这对训练过程和最终训练表现都具有较大影响。为了确保策略模型在优化时单调提升，置信区域策略优化(Trust Region PolicyOptimization,TRPO)算法用KL散度来衡量新旧策略之间差异，构建了类似自然梯度法的目标公式，并以此目标来不断优化策略从而有效防止因策略梯度中的噪声带来的较大波动，且使用了共轭梯度法来减少Fisher信息矩阵的计算量。但作为一个二阶方法，TRPO算法仍需要大量的计算成本。PPO算法在TRPO的基础上进一步化简目标函数，使用一阶近似形式，在保证精度的同时加快了训练速度。Reinforcement learning algorithms generally have a certain degree of volatility, which has a great impact on the training process and the final training performance. In order to ensure that the policy model improves monotonically during optimization, the Trust Region Policy Optimization (TRPO) algorithm uses KL divergence to measure the difference between the new and old policies, constructs a target formula similar to the natural gradient method, and uses this goal to continuously optimize the policy to effectively prevent large fluctuations caused by noise in the policy gradient, and uses the conjugate gradient method to reduce the amount of calculation of the Fisher information matrix. However, as a second-order method, the TRPO algorithm still requires a lot of computational cost. The PPO algorithm further simplifies the objective function based on TRPO, uses a first-order approximation form, and speeds up the training while ensuring accuracy.

首先，TRPO算法的目标函数可以整理为：First, the objective function of the TRPO algorithm can be organized as:

对应的约束条件为：The corresponding constraints are:

其中为π为新策略；π_old为旧策略；s为状态；a为动作，

为旧策略的优势函数，定义如下(γ为衰减因子)：Where π is the new strategy; π _old is the old strategy; s is the state; a is the action,

is the advantage function of the old strategy, defined as follows (γ is the decay factor):

状态行动值函数：

State-action-value function:

状态值函数：

State value function:

优势函数：

Advantage function:

为新旧策略的KL散度的平均值。

is the average value of the KL divergence between the new and old strategies.

虽然TRPO使用共轭梯度法来尽可能降低求解该复杂且带约束的目标函数的计算量，但仍然需要大量的计算成本，算法效率较低。为此PPO算法进一步对目标函数进行了优化。Although TRPO uses the conjugate gradient method to reduce the amount of computation required to solve the complex and constrained objective function, it still requires a lot of computational cost and the algorithm efficiency is low. For this reason, the PPO algorithm further optimizes the objective function.

实际应用中，对期望的计算往往用使用蒙特卡洛法进行近似，因此TRPO算法的目标函数变为：In practical applications, the expected calculation is often approximated using the Monte Carlo method, so the objective function of the TRPO algorithm becomes:

令

用于表示新旧策略的比率，则TRPO算法的目标函数进一步变为：make

Used to represent the ratio of the new and old strategies, the objective function of the TRPO algorithm further becomes:

将TRPO的约束条件近似为r_t(θ)∈[1-ε,1+ε]，其中ε为clip系数，则带约束的TRPO目标函数可以表示无约束的目标函数：The constraints of TRPO are approximated as r _t (θ) ∈ [1-ε, 1+ε], where ε is the clip coefficient. Then the constrained TRPO objective function can be expressed as the unconstrained objective function:

从而用常见的梯度下降法进行求解。在保有TRPO算法能够使策略稳定提升的优势的同时，用一阶方法大大降低了计算量，提升算法效率。进一步将状态值函数(VF)的目标函数和策略模型的熵(S)添加至最终的目标函数，于是PPO完整的目标函数就变为：The common gradient descent method is then used to solve the problem. While maintaining the advantage of the TRPO algorithm in improving the strategy steadily, the first-order method greatly reduces the amount of calculation and improves the efficiency of the algorithm. The objective function of the state value function (VF) and the entropy (S) of the strategy model are further added to the final objective function, so the complete objective function of PPO becomes:

其中，

C₂分别为对应项的系数，在本实施例中默认C₁＝0.001，C₂＝0。in,

C ₂ are coefficients of the corresponding items. In this embodiment, C ₁ =0.001 and C ₂ =0 by default.

在复杂电磁频谱环境中进行目标信号定位跟踪需要跨区域多传感器之间进行协作，常见的多智能体强化学习有3种框架：中心化训练与中心化执行(CTCE),中心化训练与去中心化执行(DTCE),去中心化训练与去中心化执行(DTDE)。对于中心化训练来说，需要一个中央控制器收集所有智能体的全局状态并做出统一决策，因此中心化训练可以保证较好的协作效果。但位于各个不同区域的传感器各自的观测范围不同以及观测到的信号数量、类型等也不同，因此各传感器之间为异步根据自身当前状态执行各自任务。参照图6，如图6所示是本实施例多智能体学习系统示意图。本实施例采用中心化训练与去中心化执行(DTCE)框架来部署多智能体学习系统。Target signal positioning and tracking in a complex electromagnetic spectrum environment requires collaboration between multiple sensors across regions. There are three common frameworks for multi-agent reinforcement learning: centralized training and centralized execution (CTCE), centralized training and decentralized execution (DTCE), and decentralized training and decentralized execution (DTDE). For centralized training, a central controller is required to collect the global state of all agents and make unified decisions, so centralized training can ensure better collaboration. However, the sensors located in different areas have different observation ranges and different numbers and types of observed signals, so each sensor asynchronously performs its own tasks according to its current state. Referring to Figure 6, Figure 6 is a schematic diagram of the multi-agent learning system of this embodiment. This embodiment uses a centralized training and decentralized execution (DTCE) framework to deploy a multi-agent learning system.

为了进一步利用专家先验知识(即专家策略)加速智能体学习，进一步解决强化学习方法样本利用率低的问题，本发明提出了基于生成对抗模仿学习(GenerativeAdversarial Imitation Learning，GAIL)的专家策略引导学习方法。受生成对抗网络以及逆强化学习中非线性损失函数成功应用的启发，生成对抗模仿学习可以直接从专家轨迹中学习策略。具体的实施方式如图7所示。In order to further utilize expert prior knowledge (i.e., expert strategy) to accelerate agent learning and further solve the problem of low sample utilization rate of reinforcement learning methods, the present invention proposes an expert strategy guided learning method based on generative adversarial imitation learning (GAIL). Inspired by the successful application of nonlinear loss functions in generative adversarial networks and inverse reinforcement learning, generative adversarial imitation learning can learn strategies directly from expert trajectories. The specific implementation method is shown in Figure 7.

逆强化学习的目标为从潜在的函数族C:R^S×A＝{c:S×A→R}中拟合一个损失函数c，该损失函数使得期望累积损失在专家示范的轨迹上最小，而对于其他任何策略生成的轨迹则较大。考虑到在复杂多辐射源电磁频谱环境中，由于环境维度较高，存在巨大的损失函数集

在供给有限的示范数据集情况下，逆强化学习很容易过拟合。因此采用一个损失函数正则化器ψ(c)来避免过拟合。如果分配给专家示范中的动作状态对较小的损失值，ψ(c)会对损失函数施加轻微惩罚，反之施加较大惩罚。于是本发明中，逆强化学习的目标函数就可以表示为：The goal of inverse reinforcement learning is to fit a loss function c from the potential function family C: ^RS×A ={c:S×A→R}, which minimizes the expected cumulative loss on the trajectory demonstrated by the expert and is larger for the trajectory generated by any other strategy. Considering that in a complex electromagnetic spectrum environment with multiple radiation sources, there is a huge set of loss functions due to the high dimensionality of the environment.

In the case of a limited supply of demonstration data sets, inverse reinforcement learning is prone to overfitting. Therefore, a loss function regularizer ψ(c) is used to avoid overfitting. If the action state assigned to the expert demonstration has a small loss value, ψ(c) will impose a slight penalty on the loss function, otherwise a larger penalty will be imposed. Therefore, in the present invention, the objective function of inverse reinforcement learning can be expressed as:

进一步将ψ(c)采取对专家数据的期望的形式，可以表示为：Further, ψ(c) takes the form of expectation of expert data, which can be expressed as:

其中in

并定义策略的占用度为：And define the occupancy of the strategy as:

其中γ为马尔科夫决策过程中的衰减因子。占用度可以解释为智能体在策略π下与环境交互时遇到的状态动作对的分布。于是在不同策略下与环境交互产生的轨迹的衰减回报期望就可以表示为：Where γ is the decay factor in the Markov decision process. Occupancy can be interpreted as the distribution of state-action pairs encountered by the agent when interacting with the environment under strategy π. Therefore, the decayed return expectation of the trajectory generated by interacting with the environment under different strategies can be expressed as:

于是可证明先通过逆强化学习恢复一个损失函数，然后再利用强化学习来习得策略可以表示为：It can be proved that firstly recovering a loss function through inverse reinforcement learning and then learning the strategy through reinforcement learning can be expressed as:

其中λ是策略熵H(π)的权重；

是ψ(c)的凸共轭形式；ρ_π和

是策略π和专家策略π_E的占用度。Where λ is the weight of the policy entropy H(π);

is the convex conjugate form of ψ(c); ρ _π and

is the occupancy of strategy π and expert strategy π _E.

在GAIL中，损失函数被设置为：In GAIL, the loss function is set as:

c(s,a)＝log(D(s,a))c(s,a)＝log(D(s,a))

其中D:S×A→(0,1)是一个鉴别器。c(s,a)为更新智能体的策略提供奖励信号。可进一步证明：Where D:S×A→(0,1) is a discriminator. c(s,a) provides a reward signal for updating the agent's strategy. It can be further proved that:

最终，基于生成对抗模仿学习(Generative Adversarial Imitation Learning，GAIL)的专家策略引导学习方法就可以总结为求解上式的鞍点。Finally, the expert strategy-guided learning method based on Generative Adversarial Imitation Learning (GAIL) can be summarized as solving the saddle point of the above equation.

其中，鉴别器D(s,a)通过最小化

的形式被训练用于将专家策略采样轨迹τ_E中的状态动作对(s,a)～τ_E和智能体生成轨迹τ_agent中的状态动作对(s,a)～τ_agent区分开；生成器(即智能体策略π)通过最大化E_π[log(D(s,a))]让鉴别器将智能体状态动作对“误判”为专家采样轨迹状态动作对；

为策略的γ衰减因果熵。The discriminator D(s,a) is constructed by minimizing

The form of is trained to distinguish the state-action pairs (s, a) ~ τ _E in the expert strategy sampling trajectory τ _E and the state-action pairs (s, a) ~ τ _agent in _{the agent-generated trajectory τ agent} ; the generator (i.e., the agent strategy π) maximizes E _π [log(D(s, a))] to make the discriminator "misjudge" the agent state-action pair as the expert sampling trajectory state-action pair;

is the γ-decay causal entropy of the policy.

参照图7，如图7所示是本实施例生成对抗模仿学习的专家策略引导学习方法示意图。在每个新的迭代回合：(1)生成器利用当前传感器智能体策略与环境交互得到智能体生成轨迹；(2)智能体生成轨迹和示范轨迹一并输入鉴别器并以监督学习的方式更新鉴别器参数；(3)更新后的鉴别器输出新的鉴别奖励函数；(4)利用更新后的奖励函数来提供奖励信号进一步更新智能体策略(即生成器)。不断重复上述步骤，生成器和鉴别器通过对抗式训练不断优化各自性能，直到习得理想策略。Referring to FIG. 7 , FIG. 7 is a schematic diagram of the expert strategy guided learning method for generative adversarial imitation learning in this embodiment. In each new iteration round: (1) the generator uses the current sensor agent strategy to interact with the environment to obtain the agent generation trajectory; (2) the agent generation trajectory and the demonstration trajectory are input into the discriminator together and the discriminator parameters are updated in a supervised learning manner; (3) the updated discriminator outputs a new identification reward function; (4) the updated reward function is used to provide a reward signal to further update the agent strategy (i.e., the generator). The above steps are repeated continuously, and the generator and discriminator continuously optimize their respective performance through adversarial training until the ideal strategy is learned.

本实施例对定位跟踪任务难以拆解的问题，基于分层强化学习思想，通过奖励塑造工程引导智能体进行学习。具体奖励方式如表2所示。This embodiment solves the problem that the positioning and tracking task is difficult to disassemble, and guides the intelligent agent to learn through reward shaping engineering based on the idea of hierarchical reinforcement learning. The specific reward method is shown in Table 2.

表2奖励汇总表Table 2 Reward Summary

其中奖励1-7为人为设计的奖励，围绕发现信号、定位目标、控守与探测均衡进行设置，具体的奖励值和计算过程在实际的试验过程中将不断调整。奖励8为终局模式奖励，根据整体的信号探测效能评估的结果进行奖励计算。奖励9较为特殊，为好奇心奖励。从强化学习中智能体对状态空间探索的方向来看，额外给出如下奖惩内容，以便智能体拥有更好的搜索状态空间。传统强化学习算法在反馈稀疏的环境中样本利用率极低，学习速度缓慢难以收敛。在本发明中，大多数状态下不同区域的传感器均处于搜索信号状态下，只有扫描到目标信号，才能获取相应的正奖励。此外，在某些场景中，存在一组状态内部的转移概率较高，但这组状态与其他状态之间的转移率较低。因此设计好奇心奖励，针对探索率较低的状态赋予额外奖励以激励对未知空间的探索，或者避免部分空间未被探索。Among them, rewards 1-7 are artificially designed rewards, which are set around finding signals, locating targets, and controlling and detecting balance. The specific reward values and calculation processes will be continuously adjusted during the actual experiment. Reward 8 is a final mode reward, and the reward calculation is based on the results of the overall signal detection efficiency evaluation. Reward 9 is more special and is a curiosity reward. From the perspective of the direction of the agent's exploration of the state space in reinforcement learning, the following rewards and punishments are additionally given so that the agent has a better search state space. Traditional reinforcement learning algorithms have extremely low sample utilization in an environment with sparse feedback, and the learning speed is slow and difficult to converge. In the present invention, sensors in different areas in most states are in a signal search state, and only when the target signal is scanned can the corresponding positive reward be obtained. In addition, in some scenarios, there is a group of states with a high transition probability, but the transition rate between this group of states and other states is low. Therefore, a curiosity reward is designed to give additional rewards to states with low exploration rates to encourage the exploration of unknown spaces, or to avoid some spaces from being unexplored.

具体的，设计奖惩r(S_known,S_novel)作为好奇心奖励函数，根据当前已经探索状态空间与当前未探索的状态空间对探索未知状态的智能体进行奖励。首先在智能体训练中，设计专用的记忆存储模块记录训练过程中各个状态s_t出现的概率：φ(s_t；m)，其中m表示记忆存储模块参数，同时在训练之后对记忆存储网络参数进行更新。当启用好奇心奖励之后，在每次智能体做出动作并转到新的状态时，给出对应奖惩：Asign(k-φ(s_t,m_-1))，其中A为奖励权重；k为奖励阈值，当对应状态转移概率大于k是，则进行惩罚，反之进行奖励。Specifically, the reward and punishment r(S _known ,S _novel ) is designed as a curiosity reward function, and the agent that explores the unknown state is rewarded according to the currently explored state space and the currently unexplored state space. First, in the agent training, a dedicated memory storage module is designed to record the probability of each state s _t appearing during the training process: φ(s _t ; m), where m represents the memory storage module parameters, and the memory storage network parameters are updated after training. After the curiosity reward is enabled, each time the agent takes an action and moves to a new state, the corresponding reward and punishment are given: Asign(k-φ(s _t ,m _-1 )), where A is the reward weight; k is the reward threshold. When the corresponding state transition probability is greater than k, it is punished, otherwise it is rewarded.

步骤三：通过训练得到的强化学习智能体模型协同控制多个传感器。Step 3: The trained reinforcement learning agent model collaboratively controls multiple sensors.

通过任务规划模块加载不同的智能体模型给出不同的传感器协同方案实现对多个传感器的协同控制，在复杂多辐射源环境中，通过执行不同的工作模式(宽带扫描、定频控守、测向定位)，实现对辐射源目标的即时发现与持续定位跟踪。Different intelligent agent models are loaded through the task planning module to give different sensor coordination schemes to achieve coordinated control of multiple sensors. In a complex multi-radiation source environment, by executing different working modes (wideband scanning, fixed-frequency control, direction finding and positioning), the radiation source target can be instantly discovered and continuously positioned and tracked.

本实施例提供的多传感器智能协同控制方法具有复杂信号态势处理能力。针对复杂电磁频谱环境中对重点辐射源信号的定位跟踪的需求，将深度学习与强化学习相结合，实现对全局综合态势(目标综合态势、信号综合态势、全局运维状态)以及单个传感器状态的综合处理。通过对不同传感器的动作空间以及单个动作的内部表征进行细粒化设置，综合利用强化学习技术以及基于专家知识的生成对抗模仿学习技术，实现对复杂电磁频谱环境中重点信号的持续定位跟踪。具备对短猝发信号的捕捉与定位能力，并具备一定的持续定位跟踪能力。The multi-sensor intelligent collaborative control method provided in this embodiment has the ability to process complex signal situations. In response to the need for positioning and tracking of key radiation source signals in complex electromagnetic spectrum environments, deep learning is combined with reinforcement learning to achieve comprehensive processing of the global comprehensive situation (target comprehensive situation, signal comprehensive situation, global operation and maintenance status) and the status of a single sensor. By fine-grained settings of the action space of different sensors and the internal representation of a single action, the reinforcement learning technology and the generative adversarial imitation learning technology based on expert knowledge are comprehensively utilized to achieve continuous positioning and tracking of key signals in complex electromagnetic spectrum environments. It has the ability to capture and locate short burst signals, and has a certain degree of continuous positioning and tracking capabilities.

实施例2Example 2

参照图8，如图8所示是本实施例提供的多传感器智能协同控制装置结构框图，该装置具体包括：Referring to FIG. 8 , FIG. 8 is a structural block diagram of a multi-sensor intelligent collaborative control device provided in this embodiment, and the device specifically includes:

智能体模型建立模块，智能体模型建立模块建立每个传感器对应的强化学习智能体模型，强化学习智能体模型的状态空间包括全局综合态势表达和单个传感器状态嵌入表示，强化学习智能体模型的动作空间包括由多传感器执行的不同任务所抽象出的动作输出值；An agent model building module, which builds a reinforcement learning agent model corresponding to each sensor. The state space of the reinforcement learning agent model includes a global comprehensive situation expression and a single sensor state embedding expression. The action space of the reinforcement learning agent model includes action output values abstracted from different tasks performed by multiple sensors.

智能体模型训练模块，智能体模型训练模块通过采样对强化学习智能体模型进行训练，并通过奖励塑造工程引导强化学习智能体模型学习，训练包括中心化训练与去中心化执行；Agent model training module: The agent model training module trains the reinforcement learning agent model through sampling and guides the learning of the reinforcement learning agent model through reward shaping engineering. The training includes centralized training and decentralized execution;

传感器控制模块，传感器控制模块通过训练得到的强化学习智能体模型协同控制多个传感器。The sensor control module collaboratively controls multiple sensors through the trained reinforcement learning agent model.

本实施例提供的多传感器智能协同控制装置通过建立强化学习智能体模型来实现多个传感器的协同控制，能够使得多个传感器在复杂工作环境下对有复杂信号态势具有相应的处理能力。利用强化学习技术以及基于专家知识的生成对抗模仿学习技术，实现对复杂电磁频谱环境中重点信号的持续定位跟踪。能够控制跨区域多传感器之间异步执行各自任务，具备对短猝发信号的捕捉与定位能力，并具备一定的持续定位跟踪能力。The multi-sensor intelligent collaborative control device provided in this embodiment realizes the collaborative control of multiple sensors by establishing a reinforcement learning intelligent agent model, which enables multiple sensors to have corresponding processing capabilities for complex signal situations in complex working environments. Utilizing reinforcement learning technology and generative adversarial imitation learning technology based on expert knowledge, continuous positioning and tracking of key signals in complex electromagnetic spectrum environments can be achieved. It can control the asynchronous execution of tasks between multiple sensors across regions, has the ability to capture and locate short burst signals, and has a certain continuous positioning and tracking capability.

实施例3Example 3

本优选实施例提供了一种计算机设备，该计算机设备可以实现本申请实施例所提供的多传感器智能协同控制方法任一实施例中的步骤，因此，可以实现本申请实施例所提供的多传感器智能协同控制方法的有益效果，详见前面的实施例，在此不再赘述。This preferred embodiment provides a computer device, which can implement the steps in any embodiment of the multi-sensor intelligent collaborative control method provided in the embodiments of the present application. Therefore, the beneficial effects of the multi-sensor intelligent collaborative control method provided in the embodiments of the present application can be achieved. Please refer to the previous embodiments for details and will not be repeated here.

实施例4Example 4

本领域普通技术人员可以理解，上述实施例的各种方法中的全部或部分步骤可以通过指令来完成，或通过指令控制相关的硬件来完成，该指令可以存储于一计算机可读存储介质中，并由处理器进行加载和执行。为此，本发明实施例提供一种存储介质，其中存储有多条指令，该指令能够被处理器进行加载，以执行本发明实施例所提供的多传感器智能协同控制方法中任一实施例的步骤。Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by controlling related hardware through instructions, and the instructions can be stored in a computer-readable storage medium and loaded and executed by a processor. To this end, an embodiment of the present invention provides a storage medium, which stores multiple instructions, and the instructions can be loaded by a processor to execute the steps of any embodiment of the multi-sensor intelligent collaborative control method provided in the embodiment of the present invention.

其中，该存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取记忆体(RAM，Random Access Memory)、磁盘或光盘等。The storage medium may include: a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.

由于该存储介质中所存储的指令，可以执行本发明实施例所提供的任一多传感器智能协同控制方法实施例中的步骤，因此，可以实现本发明实施例所提供的任一多传感器智能协同控制方法所能实现的有益效果，详见前面的实施例，在此不再赘述。Since the instructions stored in the storage medium can execute the steps in any multi-sensor intelligent collaborative control method embodiment provided in the embodiments of the present invention, the beneficial effects that can be achieved by any multi-sensor intelligent collaborative control method provided in the embodiments of the present invention can be achieved. Please refer to the previous embodiments for details and will not be repeated here.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

1. The intelligent cooperative control method for the multiple sensors is characterized by comprising the following steps of:

establishing a reinforcement learning agent model corresponding to each sensor, wherein a state space of the reinforcement learning agent model comprises global comprehensive situation expression and single sensor state embedded representation, and an action space of the reinforcement learning agent model comprises action output values abstracted by different tasks executed by a plurality of sensors;

training the reinforcement learning intelligent agent model through sampling, and guiding the reinforcement learning intelligent agent model to learn through rewarding modeling engineering, wherein the training comprises centering training and decentralizing execution;

the reinforcement learning agent model obtained through training cooperatively controls a plurality of sensors.

2. The multi-sensor intelligent cooperative control method according to claim 1, wherein the training of the reinforcement learning agent model by sampling specifically includes:

starting a plurality of sampling threads to independently sample in a plurality of parallel simulation environments or parallel real environments with different configuration scenes respectively;

uniformly placing the sampling data into a sampling experience buffer pool;

when the training conditions are met, sampling data are taken out from the sampling experience buffer pool, the intelligent body model is updated according to the corresponding reinforcement learning algorithm centralized training, and then the updated model parameters are put into the model parameter buffer pool.

3. The multi-sensor intelligent cooperative control method of claim 1, wherein the guiding the reinforcement learning agent model learning through a bonus modeling project specifically comprises:

setting artificial design rewards, final office mode rewards and curiosity rewards;

the artificial design rewards comprise corresponding rewards when a preset task is completed or the task fails;

the final pattern rewards include awarding rewards based on overall signal detection effectiveness;

the curiosity rewards include awarding rewards when an unknown space is explored.

4. The multi-sensor intelligent cooperative control method according to claim 1, wherein the confidence region strategy optimization method adopted when the method establishes the reinforcement learning agent model specifically comprises the following steps:

simplifying a confidence region strategy optimization algorithm by using a first-order approximation form, wherein the confidence region strategy optimization algorithm is simplified as follows:

the corresponding constraint conditions are:

wherein pi is a new policy; pi _old Is an old strategy; s is the state; a is the action of the device,

is a dominance function of the old strategy;

state action value function:

state value function:

dominance function:

wherein, gamma is an attenuation factor,

the average value of KL divergence of the new strategy and the old strategy;

approximating the simplified confidence region strategy optimization algorithm by using a Monte Carlo method to obtain

Order the

For representing the ratio of new strategy to old strategy, get

Approximating the constraint of the confidence region policy optimization algorithm to r _t (θ)∈[1-∈,1+∈]Where e is a clip coefficient, the constrained confidence region policy optimization algorithm objective function may represent an unconstrained objective function:

adding the entropy of the objective function of the state value function and the entropy of the strategy model to the unconstrained objective function to obtain a complete objective function:

wherein ,

C ₁ 、C ₂ the coefficients of the corresponding items are preset respectively.

5. The multi-sensor intelligent cooperative control method of claim 1, wherein the performing of the centering training and the decentralizing specifically comprises:

collecting global states of all agents and making a unified decision by a central controller;

and each sensor asynchronously executes each task according to the current state of the sensor.

6. The multi-sensor intelligent cooperative control method of claim 1, further comprising guiding learning of the reinforcement learning agent model based on generating expert policies against imitation learning.

7. The multi-sensor intelligent cooperative control method of claim 1, wherein the learning based on the expert strategy generating the anti-imitation learning to guide the reinforcement learning agent model specifically comprises:

repeating the following steps until an optimal strategy is obtained:

interaction between the reinforcement learning agent model corresponding to the current sensor and the environment is utilized to obtain an agent generation track;

the intelligent agent generates a track and an demonstration track to be input into the discriminator together and updates the parameters of the discriminator in a supervised learning mode;

the updated discriminator outputs a new discrimination reward function;

the agent policy is further updated with the updated reward function to provide a reward signal.

8. A multi-sensor intelligent cooperative control device, the device comprising:

the system comprises an intelligent body model building module, a multi-sensor intelligent body model processing module and a multi-sensor intelligent body model processing module, wherein the intelligent body model building module builds a reinforcement learning intelligent body model corresponding to each sensor, a state space of the reinforcement learning intelligent body model comprises global comprehensive situation expression and single sensor state embedded representation, and an action space of the reinforcement learning intelligent body model comprises action output values abstracted by different tasks executed by a plurality of sensors;

the intelligent body model training module is used for training the reinforcement learning intelligent body model through sampling and guiding the reinforcement learning intelligent body model to learn through rewarding modeling engineering, and the training comprises centering training and decentralizing execution;

and the sensor control module cooperatively controls the plurality of sensors through the reinforcement learning intelligent body model obtained through training.

9. A computer device, characterized in that it comprises a processor and a memory in which a computer program is stored, which computer program is loaded and executed by the processor to implement the multi-sensor intelligent cooperative control method according to any of claims 1 to 7.

10. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program that is loaded and executed by a processor to implement the multi-sensor intelligent cooperative control method according to any one of claims 1 to 7.