CN116956011B

CN116956011B - Combat entity defending method and device, electronic equipment and storage medium

Info

Publication number: CN116956011B
Application number: CN202310217085.8A
Authority: CN
Inventors: 沈震; 陶浩; 李伟; 龚俊斌; 熊刚; 韩云君; 董西松; 屈天慈; 颜军
Original assignee: Institute of Automation of Chinese Academy of Science; China Ship Development and Design Centre
Current assignee: Institute of Automation of Chinese Academy of Science; China Ship Development and Design Centre
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2025-08-15
Anticipated expiration: 2043-03-07
Also published as: CN116956011A

Abstract

The invention provides a defending method, a defending device, electronic equipment and a storage medium of a fight entity, wherein the method comprises the steps of acquiring fight data information under the condition that an opponent fight entity is detected to attack an opponent fight entity, wherein the fight data information comprises first relevant state information corresponding to the opponent fight entity, second relevant state information corresponding to the opponent fight entity, action information corresponding to the opponent fight entity and fight loss information; the method comprises the steps of determining data information to be trained according to combat data information, wherein the data information to be trained comprises state information to be trained, action information to be trained, rewarding information to be trained and state information to be trained at the next moment, inputting the data information to be trained into a defense strategy model to obtain a target defense strategy output by the defense strategy model, and controlling own combat entity to defend against an enemy combat entity according to the target defense strategy. The method controls the own combat entity to accurately defend the enemy combat entity based on the target defense strategy, and effectively improves the defending performance of the combat entity.

Description

Combat entity defending method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of intelligent control technologies, and in particular, to a method and apparatus for defending a combat entity, an electronic device, and a storage medium.

Background

Under the digital condition, the military operation battlefield has high transparency, the participation battlefield entity is complex and has multidimensional space such as sea, land, air, space and the like, and the participation battlefield entity participates in the battlefield operation in a cooperative mode, so that the war process becomes more complex, and higher requirements are provided for the global information acquisition capability, the global observation and operation orchestration capability, the control coordination capability of the participation battlefield entity in the multidimensional space and the like of command decision-making personnel.

In the process of the existing fight entity, the own fight entity often depends on remote control of a command decision maker or a decision method based on expert knowledge/data driving when defending against the opponent fight entity, but the own fight entity may not be capable of defending against the opponent fight entity accurately due to the fact that the whole defend process tends to use a simple method.

Disclosure of Invention

The invention provides a defending method, a defending device, electronic equipment and a storage medium of a fight entity, which can obtain a target defending strategy based on a defending strategy model with better generalization and robustness according to acquired fight data information under a complex and changeable environment, and control a fight entity on the own side to defend the fight entity on the enemy side accurately according to the target defending strategy, thereby effectively improving the defending performance of the fight entity on the own side.

The invention provides a defending method of a combat entity, which comprises the following steps:

Under the condition that an opponent fight entity is detected to attack an opponent fight entity, fight data information is acquired, wherein the fight data information comprises first relevant state information corresponding to the opponent fight entity, second relevant state information corresponding to the opponent fight entity, action information corresponding to the opponent fight entity and fight loss information;

Determining data information to be trained according to the combat data information, wherein the data information to be trained comprises state information to be trained, action information to be trained, rewarding information to be trained and state information to be trained at the next moment;

Inputting the data information to be trained into a defense strategy model to obtain a target defense strategy output by the defense strategy model, wherein the defense strategy model is obtained by training based on a historical data information set to be trained in an offline reinforcement learning process;

and controlling the own combat entity to defend the enemy combat entity according to the target defending strategy.

According to the defense method of the combat entity, the to-be-trained state information is obtained according to the first relevant state information and the second relevant state information, the to-be-trained state information comprises a first real-time position, a first movement speed, a first movement direction, a first combat state, a first damage condition, a second real-time position, a second movement speed, a second movement direction, a second combat state, a second damage condition, ammunition storage and energy cruising condition corresponding to the combat entity of the enemy, the to-be-trained state information comprises a position moving anchor point corresponding to the combat entity of the enemy and an interception action of the combat entity when defending a missile sent by the combat entity of the enemy, and the to-be-trained rewarding information comprises combat loss information and interception missile information.

The defense method of the combat entity is based on a first formula, wherein the first formula is that r=r ₁-r₂, r represents the to-be-trained rewarding information, r ₁＝Value_t-Value_t+1 represents the combat loss information, the combat loss information is the life Value change quantity of the combat entity of the host between the time t and the time t+1, value _t represents the first life Value of the combat entity of the host at the time t, value _t+1 represents the second life Value of the combat entity of the host at the time t+1, r ₂＝q×n_t represents the intercepted missile information, the intercepted missile information is the life Value used when the combat entity of the host successfully intercepts missiles, q represents a preset coefficient, and n _t represents the number of missiles successfully intercepted or successfully avoided by the combat entity of the host.

The defending method of the combat entity comprises the steps of obtaining a historical data information set to be trained, a defending strategy model to be trained, an initializing Q network and a target Q network in an offline reinforcement learning process, determining a first behavior value function corresponding to the initializing Q network and a second behavior value function corresponding to the target Q network according to the historical data information set to be trained, and updating the defending strategy model to be trained according to the first behavior value function and the second behavior value function to obtain a trained defending strategy model, wherein action information in the defending strategy model is larger than a preset action threshold.

The method for defending the combat entity comprises the steps of determining a first behavior value function corresponding to an initialized Q network and a second behavior value function corresponding to a target Q network according to a historical data information set to be trained, determining a historical data information sample to be trained corresponding to a combat entity sample from the historical data information set to be trained, and determining the first behavior value function corresponding to the initialized Q network and the second behavior value function corresponding to the target Q network according to the historical data information sample to be trained.

According to the defense method of the combat entity, the defense strategy model to be trained is updated according to the first behavior value function and the second behavior value function to obtain a trained defense strategy model, and the method comprises the steps of determining a loss function according to the first behavior value function and the second behavior value function; and updating network parameters corresponding to the defense strategy model to be trained according to the loss function to obtain the trained defense strategy model.

The invention also provides a defending device of the combat entity, which comprises:

The data collection module is used for acquiring fight data information under the condition that an opponent fight entity is detected to attack an opponent fight entity, wherein the fight data information comprises first relevant state information corresponding to the opponent fight entity, second relevant state information corresponding to the opponent fight entity, action information corresponding to the opponent fight entity and fight loss information;

The data preprocessing module is used for determining data information to be trained according to the combat data information, wherein the data information to be trained comprises state information to be trained, action information to be trained, reward information to be trained and state information to be trained at the next moment;

The decision module is used for inputting the data information to be trained into a defense strategy model to obtain a target defense strategy output by the defense strategy model, wherein the defense strategy model is obtained by training based on a historical data information set to be trained in the offline reinforcement learning process, and the own combat entity is controlled to defend the enemy combat entity according to the target defense strategy.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the defensive method of any of the above-mentioned combat entities when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of defending a combat entity as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of defending against a combat entity as described in any of the above.

The defending method, the defending device, the electronic equipment and the storage medium for the fight entity are characterized in that fight data information is acquired under the condition that the fight entity attack the fight entity of the opponent is detected, the fight data information comprises first relevant state information corresponding to the fight entity of the opponent, second relevant state information corresponding to the fight entity of the opponent, action information corresponding to the fight entity of the opponent and fight loss information, the data information to be trained is determined according to the fight data information, the data information to be trained comprises the state information to be trained, the action information to be trained, reward information to be trained and the state information to be trained at the next moment, the data information to be trained is input into a defending strategy model to obtain a target defending strategy output by the defending strategy model, the defending strategy model is obtained by training based on a historical data information set to be trained in the process of offline reinforcement learning, and the fight entity of the opponent is controlled according to the target defending strategy. According to the method, under a complex and changeable environment, a target defense strategy is obtained based on a defense strategy model with good generalization and robustness according to the acquired combat data information, and the host combat entity is controlled to accurately defend the enemy combat entity according to the target defense strategy, so that the defense performance of the host combat entity is effectively improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method of defending a combat entity provided by the present invention;

FIG. 2 is a schematic diagram of a framework of a decision algorithm corresponding to a defense strategy model provided by the invention;

FIG. 3 is a schematic view of a defense method of a combat entity according to the present invention;

FIG. 4 is a schematic diagram of a data collection module according to the present invention;

FIG. 5 is a schematic diagram of a decision module according to the present invention;

FIG. 6 is a schematic diagram of a sensing module according to the present invention;

fig. 7 is a schematic structural diagram of a defensive device of a combat entity provided by the invention;

Fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The development of unmanned technology not only changes the daily life style of people, but also deeply influences the trend of future war. The unmanned weapon equipment carries a large number of important combat missions, and the position of the unmanned naval vessel with firepower in naval combat is remarkably improved.

The unmanned ship is different from the traditional battle ship, and is not controlled in real time by commanders or operators in a cockpit/control cabin during battle, and the unmanned ship needs to have the capability of making a decision in real time and making an independent decision, so that a given task can be reliably and smoothly completed, and if a set of good decision making system is not provided, the unmanned ship cannot play an effective attack and defense role even if more missiles are mounted.

In the prior art, the development process of standard reinforcement learning offline reinforcement learning is as follows:

At present, the decision of the combat entity is mostly dependent on remote control of a commander or a decision method based on expert knowledge/data driving, which is feasible in simple scenes, such as fixed-point cruising or obstacle avoidance of an unmanned plane, but the marine environment is complex, and the combat situation is changed in a great deal, so that the decision method is difficult to be suitable for the complex battlefield environment.

However, the standard reinforcement learning offline reinforcement learning process described above has the following drawbacks:

Standard reinforcement learning learns how to perform tasks by repeating trial and error and balances between exploration and utilization to achieve better performance. However, these algorithms are difficult to apply to complex, real world problems. For example, firstly, the lack of data provides a scene for the training of the algorithm, secondly, even if unmanned ships with missile mounted thereon are available, repeated combat is needed in the experiment, the cost is high, and finally, the training of the strategy algorithm in the simulation environment requires a simulator with high fidelity, and the simulator is difficult to construct.

In summary, the existing combat entity may not accurately defend against the opponent combat entity during the course of combat.

It should be noted that, the combat entity according to the embodiment of the present invention refers to an individual having a certain military capability in a battlefield space, and may interact with a battlefield environment or other combat entities.

Alternatively, the combat entity can include, but is not limited to, an aircraft, a ship/unmanned ship, an unmanned plane, and the like.

The opponent fight entity refers to a fight entity which initiates an attack (for example, throwing a missile), and the own fight entity refers to a fight entity which defends the attack initiated by the opponent fight entity.

The electronic equipment related to the embodiment of the invention refers to equipment consisting of electronic components such as an integrated circuit, a transistor, an electronic tube and the like, and can perform data information interaction with a combat entity.

Optionally, the electronic device can comprise a computer, a mobile terminal, a wearable device and the like.

Optionally, the electronic device and the combat entity may be connected by a wireless communication technology, which may include, but is not limited to, one of the fourth generation communication technology (the 4th Generation mobile communication technology,4G), the fifth generation communication technology (the 5th Generation mobile communication technology,5G), and the wireless fidelity technology (WIRELESS FIDELITY, WIFI), etc.

The execution subject according to the embodiment of the present invention may be a defending device of a combat entity or may be an electronic device, and the embodiment of the present invention will be further described below by taking the electronic device as an example.

As shown in fig. 1, a flow chart of a defense method of a combat entity provided by the present invention may include:

101. and under the condition that the attack of the opponent combat entity on the own combat entity is detected, combat data information is acquired.

The combat data information can comprise first relevant state information corresponding to the own combat entity, second relevant state information corresponding to the enemy combat entity, action information corresponding to the own combat entity and combat loss information.

Optionally, the combat data information may further include environmental information of the environment in which the combat entity is located, where the environmental information may include, but is not limited to, wind speed information, temperature information, humidity information, and the like.

Optionally, the first relevant state information may include a real-time position, a movement speed, a movement direction, a combat state, a damage condition, an ammunition reserve, an energy cruising condition, and the like, corresponding to the own combat entity.

Alternatively, the combat state may include, but is not limited to, one of a silence state, an attack state, a defending state, a attack and defense preparation state, and the like.

Optionally, the second related state information may include real-time position, movement speed, movement direction, combat state, damage condition, etc. corresponding to the enemy combat entity.

Optionally, the action information may include a position moving anchor point, a moving direction corresponding to the own combat entity, an interception action of the own combat entity when defending against the missile sent by the enemy combat entity, and the like.

The fight loss information refers to a life value consumed by a host fight entity when defending against missiles sent by an enemy fight entity.

Under the condition that the electronic equipment detects that the opponent fight entity exists, if the opponent fight entity is detected to attack the own fight entity, corresponding fight data information is directly obtained so as to provide corresponding defending strategies for the own fight entity later, otherwise, no operation is performed.

Optionally, the step of acquiring the fight data information by the electronic device when the enemy fight entity is detected to attack the own fight entity may include the step of acquiring a distance between the enemy fight entity and the own fight entity by the electronic device when the enemy fight entity is detected to exist, and the step of judging whether the enemy fight entity attacks the own fight entity or not by the electronic device when the enemy fight entity is determined to be located in a corresponding preset defending range of the own fight entity according to the distance, and acquiring the fight data information when the enemy fight entity is determined to attack.

Optionally, the preset defense range is a section formed by a first distance threshold and a second distance threshold, where the first distance threshold is smaller than the second distance threshold, and the preset defense range may be set by a host combat entity in a factory or may be user-defined, and is not specifically limited herein.

The electronic equipment can firstly acquire the distance between the opponent fight entity and the own fight entity under the condition that the existence of the opponent fight entity is detected, then the electronic equipment judges whether the distance is in a preset defense range corresponding to the own fight entity, and indicates that the opponent fight entity is close to the own fight entity under the condition that the distance is in the preset defense range, at the moment, the electronic equipment can judge whether the opponent fight entity attacks the own fight entity or not, and then the electronic equipment can directly acquire fight data information under the condition that the opponent fight entity is determined to attack the own fight entity.

Optionally, the electronic device acquiring the combat data information may include the electronic device acquiring the combat data information using radar and/or satellite.

The radar refers to a device for finding out an fight entity by utilizing a radio method and determining the space position of the fight entity, and the satellite refers to a device for acquiring images of the fight entity to obtain relevant state information of the fight entity.

Optionally, after step 101, the method may further include the electronic device storing the acquired plurality of combat data information in a data pool.

Wherein the plurality of combat data information may constitute a data set of a combat process.

Therefore, the electronic equipment can directly call the combat data information in the data pool, and the use power consumption of the electronic equipment is effectively saved.

102. And determining the data information to be trained according to the combat data information.

The data information to be trained can also be called as four-element data, and the data information to be trained can comprise state information a to be trained, action information s to be trained, reward information r to be trained and state information a' at the next moment to be trained. That is, the four-tuple data can be represented by (a, s, r, a').

After the electronic equipment acquires the combat data information, the combat data information can be correspondingly processed, and the data information to be trained required by model training can be accurately obtained.

Optionally, the electronic device determines the data information to be trained according to the combat data information, and the electronic device may perform denoising processing on the combat data information to obtain the data information to be trained.

Because the combat data information acquired by the electronic equipment possibly has noise, the noise is unnecessary data information during model training, if the combat data information is directly used for model training, the final trained defense strategy model is easy to be inaccurate, so that the electronic equipment can carry out denoising processing on the combat data information after acquiring the combat data information so as to acquire more accurate combat data information, namely data information to be trained, and therefore, the electronic equipment can acquire the defense strategy model with higher accuracy later.

In some embodiments, the state information to be trained is obtained according to the first related state information and the second related state information, where the state information to be trained may include a first real-time position, a first movement speed, a first movement direction, a first combat state, a first damage condition, a second real-time position, a second movement speed, a second movement direction, a second combat state, a second damage condition, an ammunition reserve, an energy cruise condition, and the like, corresponding to the enemy combat entity.

After the electronic equipment acquires the first relevant state information and the second relevant state information, denoising processing can be carried out on the first relevant state information and the second relevant state information, and accurate state information to be trained is obtained.

In some embodiments, the action information to be trained is obtained according to the action information, and the action information to be trained can comprise a position moving anchor point corresponding to the own combat entity, an interception action of the own combat entity when defending against a missile sent by the enemy combat entity, and the like.

The interception action can comprise a movement speed, a movement direction and the like corresponding to the interception missile sent by the own combat entity.

After the electronic equipment acquires the action information corresponding to the own combat entity, the action information can be subjected to denoising processing, so that more accurate action information to be trained is obtained.

In some embodiments, the reward information to be trained may include combat damage information, intercept missile information, and the like.

The reward information to be trained can avoid interception of missile information.

The intercepted missile information may include whether the missile was successfully intercepted, the number of successfully intercepted missiles, the number of unsuccessfully intercepted missiles, and the like.

In some embodiments, the bonus information to be trained may be derived based on a first formula.

The first formula is that r=r ₁-r₂;

r ₁＝Value_t-Value_t+1 represents reward information to be trained, r ₁＝Value_t-Value_t+1 represents fight loss information, the fight loss information is the life Value variation quantity of a fight entity at the own side between the time t and the time t+1, t >0, value _t represents the first life Value of the fight entity at the own side under the time t, value _t+1 represents the second life Value of the fight entity at the own side under the time t+1, r ₂＝q×n_t represents interception missile information, the interception missile information is the life Value used when the fight entity at the own side successfully intercepts a missile, q represents a preset coefficient, and n _t represents the number of missiles successfully intercepted or successfully avoided by the fight entity at the own side.

It should be noted that, the larger the reward information r to be trained, the larger the loss corresponding to the enemy combat entity, so the electronic device needs to ensure that the combat loss information r ₁ is as large as possible and the life value r ₂ is as small as possible based on the first formula, so that the reward information r to be trained with larger value can be obtained.

Optionally, in the case where the combat data includes environmental information, the electronic device may preprocess the environmental information to obtain a combat factor, where if the environmental information satisfies a preset condition, the combat factor may be set to 1, which indicates that the own combat entity is capable of defending against the enemy combat entity, and conversely, the combat factor may be set to 0, which indicates that the own combat entity is not capable of defending against the enemy combat entity.

The environment information satisfies the preset condition that the wind speed information is located in a preset wind speed range when the environment information includes wind speed information, that the temperature information is located in a preset temperature range when the environment information includes temperature information, and that the humidity information is located in a preset humidity range when the environment information includes humidity information.

The preset wind speed range is a section formed by a first wind speed threshold value and a second wind speed threshold value, the first wind speed threshold value is smaller than the second wind speed threshold value, the preset temperature range is a section formed by a first temperature threshold value and a second temperature threshold value, the first temperature threshold value is smaller than the second temperature threshold value, the preset humidity range is a section formed by a first humidity threshold value and a second humidity threshold value, and the first humidity threshold value is smaller than the second humidity threshold value.

Optionally, the preset wind speed range, the preset temperature range and the preset humidity range may be set before the electronic device leaves the factory, or may be user-defined, which is not specifically limited herein.

In addition, in the case that the combat entity comprises a ship, the electronic device can also sense the surplus accuracy of the enemy combat entity so as to avoid accidentally injuring the civil ships nearby the ship.

Optionally, after step 102, the method may further include the electronic device storing the acquired plurality of data information to be trained in a data pool.

103. And inputting the data information to be trained into the defense strategy model to obtain a target defense strategy output by the defense strategy model.

The defending strategy model is obtained by training based on a historical data information set to be trained in the offline reinforcement learning process, has higher generalization and robustness, and can cope with more complex countermeasure environments.

Alternatively, the historical data information set to be trained may include a plurality of historical data information to be trained.

The electronic equipment can directly input the acquired data information to be trained into the defense strategy model so as to accurately obtain the target defense strategy output by the defense strategy model.

In some embodiments, the defense strategy model is obtained based on the steps that in the offline reinforcement learning process, the electronic device obtains a historical data information set to be trained, a defense strategy model to be trained, an initialization Q network and a target Q network, the electronic device determines a first behavior value function corresponding to the initialization Q network and a second behavior value function corresponding to the target Q network according to the historical data information set to be trained, and the electronic device updates the defense strategy model to be trained according to the first behavior value function and the second behavior value function to obtain a trained defense strategy model, wherein action information in the defense strategy model is larger than a preset action threshold.

Optionally, the preset action threshold may be set before the electronic device leaves the factory, or may be customized by the electronic device, which is not specifically limited herein.

Optionally, the to-be-trained defense strategy model may further include a test defense strategy, where the test defense strategy may be tested based on the fight process data set, or may be used as an auxiliary strategy to perform an effect test when the fight entity performs training.

In the offline reinforcement learning process of the electronic equipment, a first behavior value function corresponding to an initialized Q network and a second behavior value function corresponding to a target Q network can be determined based on the acquired historical data information set to be trained, and then the electronic equipment updates the acquired defense strategy model to be trained according to the first behavior value function and the second behavior value function to obtain a trained defense strategy model.

The action information in the defense strategy model is as follows:

a′=agrmaxQ(s′,max_aG(a|s′;w);θ);

a ' represents action information at the next moment in the defense strategy model, G (a|s '; w) represents the defense strategy model at the current moment, s ' represents state information at the next moment in the defense strategy model, w represents network parameters corresponding to the defense strategy model, a represents action information in the initialized Q network, and θ represents network parameters corresponding to the initialized Q network.

Alternatively, the preset action threshold may be represented by τq (s ', a'; θ), τ representing a super parameter.

Alternatively, the target Q network may be represented by Q (s ', a'; θ).

In some embodiments, the electronic device determining the first behavior value function corresponding to the initialized Q network and the second behavior value function corresponding to the target Q network according to the historical data information set to be trained may include the electronic device determining a historical data information sample to be trained corresponding to the combat entity sample from the historical data information set to be trained, and the electronic device determining the first behavior value function corresponding to the initialized Q network and the second behavior value function corresponding to the target Q network according to the historical data information sample to be trained.

Wherein, the first behavior value function can be represented by Q (s, a; w), the second behavior value function can be represented by Q (s ', a'; θ ^-), and θ ^- represents the network parameter corresponding to the target Q network.

In some embodiments, the electronic device updates the to-be-trained defense strategy model according to the first behavior value function and the second behavior value function to obtain a trained defense strategy model, and the method comprises the steps that the electronic device determines a loss function according to the first behavior value function and the second behavior value function, and updates network parameters corresponding to the to-be-trained defense strategy model according to the loss function to obtain the trained defense strategy model.

Wherein, the loss function is L (theta) =l _k(r+γQ(s′,a′;θ^-) -Q (s, a; w);

L (θ) represents a loss function, L _k represents a smoothing loss (huber loss), which can be made insensitive to outliers, and γ represents a preset parameter.

The electronic equipment updates the network parameter w corresponding to the defense strategy model to be trained according to the loss function, and the network parameter w is represented by w ∈min _w -log (G (a|s; w)).

Optionally, the electronic device may update the network parameter θ ^- corresponding to the target Q network by using the loss function, and may be represented by θ ^-←θ←argmin_θ L (θ).

The defense strategy model may be a hybrid model of a generated (Variational Auto Encoder, VAE) model and a newly generated (Nouveau, VAE, NVAE) model.

Network parameters w of the defense strategy model, available This means that generalization of the defense strategy model can be effectively improved. Wherein, the

D _KL (·) represents the divergence information; Omega ₂ represents a first preset parameter of a defense strategy model, a-N (mu, sigma) represents a obeying Gaussian distribution, mu represents a first parameter, which can be called a position parameter, sigma represents a second parameter, which can be called a shape parameter; A second action evaluated based on the divergence information is represented.

Where μ, σ=e _ω1(s,a);ω₁ represents the second preset parameter of the defense strategy model.

The parameter updating mode of the corresponding disturbance model of the defense strategy model is as follows:

Phi represents a model parameter of the disturbance model, theta ₁ represents a first preset parameter corresponding to the disturbance model, and xi _φ represents a second preset parameter corresponding to the disturbance model; A defensive strategy model with a network parameter w ₁ is represented; A defensive policy model representing network parameters w ₂.

Optionally, in the process of updating the network parameter θ ^- corresponding to the target Q network, the electronic device may replace a policy (simply referred to as a random policy) for initializing random selection action information in the Q network with a model that is pre-trained by the random policy.

Specifically, the electronic device predicts Q values of different action information corresponding to state information at a current moment through an evaluation network (eval net) according to the state information at the current moment, the Q values can be represented by Q (t), then the electronic device determines action information corresponding to the maximum Q value in the plurality of Q values through a greedy algorithm and carries out state conversion on the action information, then the electronic device calculates the Q value at the next moment through a target network (TARGET NET) according to the state information at the next moment, the Q value can be represented by Q (t+1), finally the electronic device determines a loss function according to the Q (t) and the Q (t+1), and updates a network parameter theta ^- corresponding to the target Q network based on the loss function.

Optionally, the electronic device may periodically update network parameters corresponding to the defensive strategy model to be trained, to obtain a trained defensive strategy model.

Optionally, the training period may be set before the electronic device leaves the factory, or may be user-defined, which is not specifically limited herein.

Exemplary, as shown in fig. 2, a schematic diagram of a framework of a corresponding decision algorithm of a defense strategy model provided by the invention is shown. In fig. 2, an electronic device acquires combat data information from an environment where combat entities are located and determines data information to be trained according to the combat data information, then the electronic device stores the data information to be trained in a data pool, then the electronic device utilizes a data processing module to extract the data information to be trained from the data pool and utilizes an offline reinforcement learning strategy module to perform model training in an offline reinforcement learning process based on the data information to be trained, then the electronic device acquires a strategy algorithm for the model training process and updates the measurement algorithm in a virtual combat system, so that the virtual combat system can perform interactive learning with the online reinforcement learning strategy module based on the strategy algorithm.

104. And controlling the own combat entity to defend the enemy combat entity according to the target defending strategy.

After the electronic equipment obtains the target defense strategy output by the defense strategy, the host combat entity can be guided according to the target defense strategy so as to effectively control the host combat entity to accurately defend the enemy combat entity.

Optionally, after step 104, the method may further include the electronic device outputting a defensive result.

The defending result is target rewarding information corresponding to the target defending strategy.

Optionally, the electronic device outputs the defending result, which may include, but is not limited to, at least one implementation of:

the realization mode 1 is that the electronic equipment displays the defending result in a text form.

Optionally, the text form may be set before the electronic device leaves the factory, or may be user-defined, which is not specifically limited herein.

The electronic equipment displays the defending result on the display screen in the form of characters.

And 2, the electronic equipment broadcasts the defending result in a voice mode.

Optionally, the voice form may be set before the electronic device leaves the factory, or may be input by the user, which is not limited herein.

The electronic equipment broadcasts the defending result in a voice mode by using the loudspeaker.

And 3, the electronic equipment sends the defending result to the associated equipment so that the associated equipment outputs the defending result.

Optionally, the associated device can comprise a computer, a mobile terminal, a wearable device and the like.

Alternatively, the electronic device and the associated device may be connected by a wireless communication technology.

It should be noted that, in any implementation manner, the user can timely and intuitively learn the defending result of the own combat entity.

In the embodiment of the invention, under the condition that the attack of the opponent combat entity on the own combat entity is detected, combat data information is acquired, data information to be trained is determined according to the combat data information, the data information to be trained is input into a defense strategy model to obtain a target defense strategy output by the defense strategy model, and the own combat entity is controlled to defend the opponent combat entity according to the target defense strategy. According to the method, under a complex and changeable environment, a target defense strategy is obtained based on a defense strategy model with good generalization and robustness according to the acquired combat data information, and the host combat entity is controlled to accurately defend the enemy combat entity according to the target defense strategy, so that the defense performance of the host combat entity is effectively improved.

Exemplary, as shown in fig. 3, a schematic view of a defense method of a combat entity provided by the present invention is shown. In fig. 3, in the process of controlling the host combat entity to defend against the enemy combat entity by the electronic device, the data collection module may acquire combat data information when the host combat entity is detected to attack the host combat entity, the data preprocessing module may determine data information to be trained according to the combat data information, the data pool may store the combat data information or the data information to be trained, the decision module may input the data information to be trained into the defending policy model to obtain a target defending policy output by the defending policy model, and control the host combat entity to defend against the enemy combat entity according to the target defending policy.

Exemplary, as shown in fig. 4, a schematic structural diagram of the data collection module provided by the present invention is shown. In fig. 4, the data collection module may include an environmental information collection unit, a combat entity information collection unit, a situational awareness data collection unit, and a decision instruction collection unit.

The environment information collecting unit is used for collecting environment information.

And the combat entity information collection unit is used for collecting the type of the combat entity, whether the combat entity is an enemy combat entity or a host combat entity.

And the situation awareness data collection unit is used for collecting relevant state information.

And the decision instruction collection unit is used for collecting the defense strategies.

The data preprocessing module is used for processing environment information, types of combat entities, relevant state information and defense strategies.

Exemplary, as shown in fig. 5, a schematic structural diagram of the decision module provided by the present invention is shown. In fig. 5, the decision module may include a decision generation unit and a decision validation unit, and the decision generation unit may include a decision training unit and a decision testing unit.

The decision generation unit is used for training a defense strategy model in the offline reinforcement learning process.

The decision validation unit is used for obtaining a target defense strategy based on the trained defense strategy model.

Exemplary, as shown in fig. 6, a schematic structural diagram of a sensing module provided by the present invention is shown. In FIG. 6, the sensing module may include a radar unit, a camera unit, a weather sensing unit, and other sensing units.

The sensing module may be provided in the data collection module as shown in fig. 4.

And the radar unit is used for acquiring radar data corresponding to the combat entity.

And the camera unit is used for acquiring images corresponding to the combat entity.

And the weather sensing unit is used for collecting wind speed information, temperature information, humidity information and the like in the environment information.

The defending device of the combat entity provided by the invention is described below, and the defending device of the combat entity described below and the defending method of the combat entity described above can be correspondingly referred to each other.

As shown in fig. 7, a schematic structural diagram of a defensive apparatus of a combat entity according to the present invention may include:

the data collection module 701 is configured to obtain, when it is detected that an opponent fight entity attacks an opponent fight entity, fight data information, where the fight data information includes first relevant status information corresponding to the opponent fight entity, second relevant status information corresponding to the opponent fight entity, action information corresponding to the opponent fight entity, and fight loss information;

the data preprocessing module 702 is configured to determine to-be-trained data information according to the combat data information, where the to-be-trained data information includes to-be-trained state information, to-be-trained action information, to-be-trained reward information, and to-be-trained next-time state information;

the decision module 703 is configured to input the data information to be trained into a defense strategy model, obtain a target defense strategy output by the defense strategy model, where the defense strategy model is obtained by training based on a historical data information set to be trained in an offline reinforcement learning process, and control the own combat entity to defend the enemy combat entity according to the target defense strategy.

Optionally, the to-be-trained state information is obtained according to the first relevant state information and the second relevant state information, the to-be-trained state information comprises a first real-time position, a first movement speed, a first movement direction, a first fight state, a first damage condition, a second real-time position, a second movement speed, a second movement direction, a second fight state, a second damage condition, ammunition reserve and an energy cruising condition corresponding to the opponent fight entity, the to-be-trained action information is obtained according to the action information, the to-be-trained action information comprises a position movement anchor point corresponding to the opponent fight entity and an interception action of the opponent fight entity when defending a missile sent by the opponent fight entity, and the to-be-trained reward information comprises the fight damage information and interception missile information.

Optionally, the reward information to be trained is obtained based on a first formula, wherein the first formula is that r=r ₁-r₂, r represents the reward information to be trained, r ₁＝Value_t-Value_t+1 represents the combat loss information, the combat loss information is the life Value variation quantity of the combat entity at the position from the time t to the time t+1, value _t represents the first life Value of the combat entity at the time t, value _t+1 represents the second life Value of the combat entity at the time t+1, r ₂＝q×n_t represents the intercepted missile information, the intercepted missile information is the life Value used when the combat entity successfully intercepts a missile, q represents a preset coefficient, and n _t represents the number of missiles successfully intercepted or successfully avoided by the combat entity.

Optionally, the decision module 703 is specifically configured to obtain the historical data information set to be trained, the defense strategy model to be trained, the initialized Q network and the target Q network in the offline reinforcement learning process, where the historical data information set to be trained includes a plurality of historical data information to be trained, determine a first behavior value function corresponding to the initialized Q network and a second behavior value function corresponding to the target Q network according to the historical data information set to be trained, and update the defense strategy model to be trained according to the first behavior value function and the second behavior value function, so as to obtain a trained defense strategy model, where the action information in the defense strategy model is greater than a preset action threshold.

Optionally, the decision module 703 is specifically configured to determine a historical to-be-trained data information sample corresponding to the combat entity sample from the historical to-be-trained data information set, and determine a first behavior value function corresponding to the initialized Q network and a second behavior value function corresponding to the target Q network according to the historical to-be-trained data information sample.

Optionally, the decision module 703 is specifically configured to determine a loss function according to the first behavior value function and the second behavior value function, and update network parameters corresponding to the to-be-trained defense strategy model according to the loss function, so as to obtain the trained defense strategy model.

As shown in fig. 8, the electronic device provided by the present invention may include a processor 810, a communication interface (Communications Interface) 820, a memory 830 and a communication bus 840, where the processor 810, the communication interface 820 and the memory 830 complete communication with each other through the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform a defending method of an opponent fight entity, the method including acquiring fight data information including first related status information corresponding to the opponent fight entity, second related status information corresponding to the opponent fight entity, action information corresponding to the opponent fight entity, and fight loss information in case that an opponent fight entity is detected to attack an opponent fight entity; the method comprises the steps of determining data information to be trained according to combat data information, inputting the data information to be trained into a defense strategy model to obtain a target defense strategy output by the defense strategy model, wherein the defense strategy model is obtained by training based on a historical data information set to be trained in an offline reinforcement learning process, and controlling a host combat entity to defend the host combat entity according to the target defense strategy.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

In another aspect, the invention further provides a computer program product, the computer program product comprises a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer can execute the defending method of the fight entity provided by the methods, the method comprises the steps of acquiring fight data information when the fight entity of an opponent is detected to attack the fight entity of the opponent, wherein the fight data information comprises first relevant state information corresponding to the fight entity of the opponent, second relevant state information corresponding to the fight entity of the opponent, action information corresponding to the fight entity of the opponent and fight loss information, determining data information to be trained according to the fight data information, wherein the data information to be trained comprises the state information to be trained, the action information to be trained, the rewarding information to be trained and the next moment state information to be trained, inputting the data information to be trained into a strategy model, obtaining a target defending strategy outputted by the defending strategy model, the strategy model is based on a target fight entity to be trained on the opponent history training data set, and the defending strategy is controlled by the defending entity of the opponent in an offline learning process.

In still another aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program when executed by a processor is implemented to perform the defense method of an fight entity provided by the above methods, where the method includes, in a case where an attack of an fight entity by an enemy is detected, acquiring fight data information including first related status information corresponding to the fight entity by the enemy, second related status information corresponding to the fight entity by the enemy, action information corresponding to the fight entity by the enemy and fight loss information, determining to-be-trained data information including to-be-trained status information, to-be-trained action information, to-be-trained rewarded status information, to-be-trained status information at a next moment, inputting the to-be-trained data information to a target defense policy model output by the defense policy model, the defense policy model being trained based on a set of to-be-trained data in an offline reinforcement learning process, and controlling the defend entity by the enemy according to the target defense policy.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A method of defending a combat entity, comprising:

controlling the own combat entity to defend the enemy combat entity according to the target defending strategy;

the state information to be trained is obtained according to the first relevant state information and the second relevant state information, and comprises a first real-time position, a first movement speed, a first movement direction, a first combat state, a first damage condition, a second real-time position, a second movement speed, a second movement direction, a second combat state, a second damage condition, ammunition reserve and energy cruising condition corresponding to the opponent combat entity;

the action information to be trained is obtained according to the action information, and comprises a position moving anchor point corresponding to the own combat entity and an interception action of the own combat entity when defending a missile sent by the enemy combat entity;

the reward information to be trained comprises the combat loss information and the interception missile information;

The reward information to be trained is obtained based on a first formula, wherein the first formula is that r=r ₁-r₂;

r represents the reward information to be trained, r ₁＝Value_t-Value_t+1 represents the combat loss information, the combat loss information is the life Value variation quantity of the combat entity at the position from the moment t to the moment t+1, value _t represents the first life Value of the combat entity at the moment t, value _t+1 represents the second life Value of the combat entity at the moment t+1, r ₂＝q×n_t represents the intercepted missile information, the intercepted missile information is the life Value used when the combat entity successfully intercepts the missiles, q represents a preset coefficient, and n _t represents the number of the missiles successfully intercepted or successfully avoided by the combat entity.

2. The method of claim 1, wherein the defensive strategy model is based on the steps of;

In the process of offline reinforcement learning, acquiring the historical data information set to be trained, a defense strategy model to be trained, an initialization Q network and a target Q network, wherein the historical data information set to be trained comprises a plurality of historical data information to be trained;

according to the historical data information set to be trained, determining a first behavior value function corresponding to the initialized Q network and a second behavior value function corresponding to the target Q network;

Updating the to-be-trained defense strategy model according to the first behavior value function and the second behavior value function to obtain a trained defense strategy model, wherein action information in the defense strategy model is larger than a preset action threshold.

3. The method of claim 2, wherein determining the first behavior value function corresponding to the initialized Q network and the second behavior value function corresponding to the target Q network according to the historical set of data to be trained comprises:

Determining a historical data information sample to be trained corresponding to the combat entity sample from the historical data information set to be trained;

And determining a first behavior value function corresponding to the initialized Q network and a second behavior value function corresponding to the target Q network according to the historical data information sample to be trained.

4.A method according to claim 2 or 3, wherein updating the defensive strategy model to be trained according to the first behavior value function and the second behavior value function to obtain a trained defensive strategy model comprises:

determining a loss function according to the first behavior value function and the second behavior value function;

And updating network parameters corresponding to the defense strategy model to be trained according to the loss function to obtain the trained defense strategy model.

5. A defensive apparatus of a combat entity, comprising:

The data collection module is used for acquiring combat data information under the condition that an opponent combat entity is detected to attack an opponent combat entity, wherein the combat data information comprises first relevant state information corresponding to the opponent combat entity, second relevant state information corresponding to the opponent combat entity, action information corresponding to the opponent combat entity and combat loss information;

The decision module is used for inputting the data information to be trained into a defense strategy model to obtain a target defense strategy output by the defense strategy model, wherein the defense strategy model is obtained by training based on a historical data information set to be trained in the process of offline reinforcement learning;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of defending against a combat entity according to any of claims 1 to 4 when the program is executed by the processor.

7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of defending against a combat entity according to any of claims 1 to 4.

8. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a method of defending against a combat entity according to any of claims 1 to 4.