CN119009988A

CN119009988A - Power grid optimal scheduling method and related device

Info

Publication number: CN119009988A
Application number: CN202411108347.8A
Authority: CN
Inventors: 杨楠; 李昕航; 李立新; 黄宇鹏; 刘金波; 董亮; 刘幸蔚; 张喆钧; 宋旭日; 李蕾; 於益军; 万雄; 穆永铮; 李泽科; 王彦沣; 宋磊; 马晓忱; 范海威; 齐晓琳; 叶希
Original assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Fujian Electric Power Co Ltd; State Grid Jibei Electric Power Co Ltd; State Grid Sichuan Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Fujian Electric Power Co Ltd; State Grid Jibei Electric Power Co Ltd; State Grid Sichuan Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2024-08-13
Filing date: 2024-08-13
Publication date: 2024-11-22

Abstract

The present invention belongs to the field of power grid dispatching, and discloses a power grid optimization dispatching method and related devices, including: obtaining the current operating state of the power grid; inputting the current operating state of the power grid into a pre-trained power grid optimization dispatching model based on multi-adversarial reinforcement learning to obtain the current operating state adjustment action of the power grid; wherein the power grid optimization dispatching model based on multi-adversarial reinforcement learning is obtained by modifying the target Q value of the critic network in the deep deterministic action gradient model during training to the sum of the immediate reward and the predicted value of the future potential reward of the power grid; wherein the predicted value of the future potential reward of the power grid is predicted by using a pre-trained generative multi-adversarial network model. The generative multi-adversarial network model is introduced to replace the target network to pre-perceive the operating state in all future decision moments, making up for the shortcoming that the evaluation network in traditional reinforcement learning only fits the evaluation of the next target benefit, and better adapting to the situation of power grid fluctuations.

Description

Power grid optimal scheduling method and related device

Technical Field

The invention belongs to the field of power grid dispatching, and relates to a power grid optimization dispatching method and a related device.

Background

As artificial intelligence continues to evolve, it has become a versatile technology leading the fourth industrial revolution, fully penetrating into various industries and continuously fusing. In the field of regulation and control, a great deal of researches and practices are developed by a plurality of scholars in aspects of scheduling operation auxiliary decision making, new energy prediction, safety and stability analysis and the like by applying artificial intelligence technology. Especially, an autonomous learning model represented by deep reinforcement learning has become a mainstream method for solving various optimization decision class problems. The scholars propose a distributed on-site voltage control method based on deep reinforcement learning, so that on-site distributed reactive voltage control of multiple intelligent agents independent of communication under the condition of insufficient measurement data is realized, voltage fluctuation is effectively restrained, and node voltage out-of-limit is prevented. The learner proposes a double-agent static safety prevention control action automatic generation method based on deep reinforcement learning. The structure of the PQ decoupling double intelligent agents for centralized training is designed, and the two intelligent agents respectively bear the tasks of active power adjustment and voltage adjustment of the generator, so that static safety prevention and control of the power system are realized. The method establishes a node multi-objective model of optimal dispatching action, solves the problem by adopting a flexible action evaluation algorithm, and realizes rapid calculation of the large-scale electric vehicle charging dispatching action.

In continuous time scale optimization scheduling, the deep reinforcement learning algorithm is widely applied to solve the complex scene scheduling problem compared with the model driving algorithm because of the characteristic of no need of accurate modeling and iterative solution. The scholars put forward a comprehensive energy system dynamic energy scheduling method based on a constraint reinforcement learning algorithm, so that the unit output is effectively constrained, and the system operation safety scheduling is realized. Aiming at the characteristic that the data scale of the power system is growing day by day, many researches improve the reinforcement learning algorithm in the aspects of training efficiency, stability and the like. The learner puts forward a multi-task deep reinforcement learning optimization scheduling method for fusing the power grid operation scene clusters, and the model is further called to quickly solve the real-time scheduling task by identifying the scene category of the limited operation data, so that the multi-task optimization scheduling under the random scene is realized. The learner adopts the distributed optimization scheduling idea based on multi-agent deep reinforcement learning to convert the centralized scheduling model into the distributed optimization problem among the multi-agents for solving, thereby improving the training speed and the application range.

However, the above-described existing reinforcement learning-based power grid dispatching method has the following disadvantages. First, there is a lack of fit estimation of the total revenue over the future scheduling period, and there is an insufficient ability to predict future operational trends. Especially, under the condition of high-proportion access of new energy, the uncertainty of power grid operation is increased, so that the dynamic change of the power grid is difficult to accurately capture by the traditional method. In optimizing scheduling in the day, scheduling actions are often formulated to take into account changes over a period of time in the future. However, the existing method often lacks in-depth consideration of this point, so that the scheduling action may not be optimal, the power grid resources cannot be fully utilized, and the economic benefit is maximized. In addition, the adaptability to complex scenes is not strong, various complex scenes such as new energy fluctuation, equipment faults, extreme weather and the like can be faced in the running process of the intelligent power grid, and when the complex scenes are processed by the existing scheduling method, effective coping actions are difficult to find, so that the running safety and stability of the power grid are threatened.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a power grid optimal scheduling method and a related device.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

The first aspect of the invention provides a power grid optimization scheduling method, which comprises the following steps: acquiring the current running state of the power grid; inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid; the grid optimization scheduling model based on multi-countermeasure reinforcement learning is obtained by modifying a target Q value of a criticism network in a depth deterministic action gradient model during training into a sum of instant rewards and a predicted value of potential rewards of a grid future; the future potential rewards predicted value of the power grid is obtained by predicting a pre-trained generating type multi-countermeasure network model based on the operation state of the power grid during training and the daily operation state adjustment action.

Optionally, the pre-trained generated multi-countermeasure network model is used for pre-training: the optimization objective Y ^* of the generator that generates the multi-countermeasure network model is:

Y^*＝arg min_Ymax_DV(Y)

wherein Y is a generator of the generated multi-countermeasure network model, D is a discriminator of the generated multi-countermeasure network model, and V (Y) is the divergence calculated by taking the generator as a variable.

Optionally, the V (Y):

Wherein λ is a parameter for adjusting the degree of fusion, ω _i is a degree of fusion weight of the I-th discriminator, I is the total number of discriminators, V' _i(D_i, Y) is a divergence calculated by the I-th discriminator and the generator;

Where R _t is the jackpot score from start to time t, z is a predefined simple distribution, which is z-P _z,Y(z,[s_t,a_t) is the divergence calculated using z and [ s _t,a_t ] as variables, s _t is the agent state space, a _t is the agent action space, E [. Cndot ] is desired, and R-P _R is the bonus score probability distribution.

Optionally, the pre-trained generated multi-countermeasure network model is used for pre-training: model parameters of the generated multi-countermeasure network model are updated in a random gradient descent mode.

Optionally, the pretraining grid optimization scheduling model based on multi-countermeasure reinforcement learning is that when pretraining: based on the preset historical experience extraction quantity, extracting good historical experiences from a good historical experience sub-pool of a historical experience pool in combination with a good historical experience extraction proportion, and extracting bad historic experiences from a bad historical experience sub-pool of the historical experience pool in combination with a bad historical experience extraction proportion; and pre-training a power grid optimization scheduling model of multi-countermeasure reinforcement learning according to the extracted good history experience and bad history experience.

Optionally, the method further comprises: based on a preset instant rewarding threshold value, dividing the historical experience with instant rewards larger than the instant rewarding threshold value in the historical experience pool into good historical experience, and dividing the historical experience with instant rewards not larger than the instant rewarding threshold value in the historical experience pool into bad historical experience, so as to obtain a good historical experience sub-pool and a bad historical experience sub-pool.

In a second aspect of the present invention, there is provided a power grid optimization scheduling system, including: the data acquisition module is used for acquiring the current running state of the power grid; the optimization scheduling module is used for inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid; the grid optimization scheduling model based on multi-countermeasure reinforcement learning is obtained by modifying a target Q value of a criticism network in a depth deterministic action gradient model during training into a sum of instant rewards and a predicted value of potential rewards of a grid future; the future potential rewards predicted value of the power grid is obtained by predicting a pre-trained generating type multi-countermeasure network model based on the operation state of the power grid during training and the daily operation state adjustment action.

In a third aspect of the present invention, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the grid optimization scheduling method described above when executing the computer program.

In a fourth aspect of the present invention, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the grid optimization scheduling method described above.

Compared with the prior art, the invention has the following beneficial effects:

According to the power grid optimal scheduling method, based on the current running state of the power grid, the current running state adjusting action of the power grid is obtained through the pretrained power grid optimal scheduling model based on multi-countermeasure reinforcement learning, and the power grid optimal scheduling is achieved. The grid optimization scheduling model based on multi-countermeasure reinforcement learning is obtained by modifying a target Q value of a criticizing network in a depth deterministic action gradient model during training into a sum of instant rewards and grid future potential rewards predicted values, the grid future potential rewards predicted values are obtained by adopting a pre-trained generation type multi-countermeasure network model for prediction based on the running state and daily running state of the grid during training, the running state of all future decision moments is pre-perceived and used as a part of the target Q value of the criticizing network during training by introducing the generation type multi-countermeasure network model, and further the defect that the criticizing network only carries out evaluation fitting on the next target benefits in the traditional reinforcement learning is effectively overcome, the future running trend predicting capability of the grid optimization scheduling model based on multi-countermeasure reinforcement learning is improved, the full-time comprehensive consideration of grid scheduling is realized, and the grid fluctuation situation and complex scene can be well adapted.

Drawings

Fig. 1 is a flowchart of a power grid optimization scheduling method according to an embodiment of the present invention.

Fig. 2 is a grid optimization scheduling model architecture diagram based on multi-countermeasure reinforcement learning according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a hybrid empirical cross-drive mechanism according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of the new energy consumption rate of each step in the first scenario under the comparative experiment of the embodiment of the present invention.

FIG. 5 is a schematic diagram of the running cost of each step of the unit in the first scenario under the comparative experiment of the embodiment of the present invention.

FIG. 6 is a graph showing the cumulative prize value per step for scenario one in a comparative experiment according to an embodiment of the present invention.

Fig. 7 is a block diagram of a power grid optimization scheduling system according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

Referring to fig. 1, in an embodiment of the present invention, a power grid optimization scheduling method is provided, which can effectively solve the problem of insufficient fitting estimation of total gains in a future scheduling time period in the existing power grid scheduling method, and the technical problem of weak ability of predicting future operation trends caused by the insufficient fitting estimation.

Specifically, the power grid optimization scheduling method comprises the following steps:

s1: the current running state of the power grid is obtained.

S2: and inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid.

The grid optimization scheduling model based on multi-countermeasure reinforcement learning is obtained by modifying a target Q value of a criticism network in a depth deterministic action gradient model during training into a sum of instant rewards and a predicted value of potential rewards of a grid future; the future potential rewards predicted value of the power grid is obtained by predicting a pre-trained generating type multi-countermeasure network model based on the operation state of the power grid during training and the daily operation state adjustment action.

Specifically, the grid optimization scheduling model based on multi-countermeasure reinforcement learning is based on a depth deterministic action gradient model DDPG and a generated multi-countermeasure network model GMAN, and an intelligent body is built in combination, so that the problem of intra-day optimization scheduling is solved. By adopting GMAN as a target network of the intelligent agent, the potential relation between the state-action of the power grid and the future potential rewards of the power grid is learned, and further the future running situation of the power grid is predicted.

Referring to fig. 2, the grid optimization scheduling model based on multi-countermeasure reinforcement learning mainly includes the following core components: simulation environment, agent, actor activator network, critic network, generator and multiple discriminants. The simulation environment represents the current running state of the power grid and comprises running state information of a generator set, a line, a load and the like; the intelligent agent makes decision action, namely daily operation state adjustment action, based on the current operation state of the power grid, wherein the decision action is generated and evaluated by an Actor network and a Critic network; the generator learns the distribution of potential rewards in the future of the power grid according to the historical data, and further generates the expectation of accumulated benefits in the future according to the current running state of the current power grid; and establishing a plurality of discriminators for game playing in GMAN, wherein the discriminators assist the generator in learning and converging, so that the generated prediction data are more and more lifelike. In the training process, DDPG updates the parameters of the Actor network and the Critic network on the basis of considering the future potential rewards predicted value of the power grid generated by GMAN, thereby enhancing the prospective of reinforcement learning decision and improving the convergence speed and stability of the network.

DDPG is a high-efficiency reinforcement learning method suitable for continuous motion space, which can significantly improve the decision quality in an offline learning environment by combining the advantages of motion gradient and value function. DDPG employ neural networks to approximate the representation of motion functions and cost functions, and training these neural networks using deep learning techniques to find optimal motion in a complex continuous motion space.

DDPG the training process follows the following steps: first, a main network including an Actor network and a Critic network, and respective Target networks Target Actor and TARGET CRITIC are initialized. The design of the target network aims at stabilizing the learning process and reducing the instability caused by the rapid change of the main network. Next, a simulation environment is initialized, resulting in an environmental state, i.e., the running state of the grid s _t, which enables DDPG to learn and test under near-real conditions. DDPG then performs an action based on the current action a _t＝μ(s_t|θ^μ) and obtains feedback from the environment, including the prize r _t, the new state s _t+1, and the end of round flag d. Thereafter, the data is stored as historical experience in a historical experience pool for use in subsequent learning and the environmental state is updated to s _t+1.

DDPG the network update procedure comprises the following key steps:

1) Historical experience pool sampling: m historical experiences M x (s _t,a_t,r_t,s_t+1, d) are extracted from the historical experience pool, and these include historical information and feedback of grid operation.

2) Prediction of the Actor network: s _t+1 is input into the Actor network to yield a' _t+1＝μ'(s_t+1|θ^μ').

3) Evaluation of Critic network: s _t+1 and a '_t+1 are entered into the Critic network to obtain the expected prize value q' _t+1＝Q'(s_t+1,a'_t+1|θ^Q').

4) Calculating a target Q value of the Critic network: target_q=r _t+γ(1-d)q'_t+1.

5) Updating Critic network: inputting s _t,a_t into Critic network, calculating q=q (s _t,a_t|θ^Q); and calculates a loss function (exemplified by Mean Square Error (MSE)) for the Critic network: gradient backhaul is performed by using loss _c to update Critic network.

6) Updating the Actor network: using the Actor network to make new action decisions a' _t＝μ(s_t|θ^μ), calculate an Actor network loss function: loss _a＝-Q(s_t,a'_t |θq); gradient backhaul is performed by using loss _a to update the Actor network.

In one possible implementation, the pre-trained generated multi-countermeasure network model is pre-trained:

The optimization objective Y ^* of the generator that generates the multi-countermeasure network model is:

Y^*＝argmin_Ymax_DV(Y)

Specifically, GMAN is employed to estimate future potential rewards and to guide the training process of DDPG. Meanwhile, GMAN plays back the training generator based on the historical data, so that learning of future potential rewards of the power grid in the decision period is achieved. Specifically, assuming the agent's state at time step t is s _t and the action taken is a _t, the cumulative prize earned from time t to the end of the round isCan be expressed as:

wherein, gamma is a discount factor, Is at the time ofThe instant rewards obtained. GMAN generator (noted Y, parameter θ _Y) aims to learn the jackpotAnd cause it to output a jackpot forecast Y (z, [ s _t,a_t ]); where z-P _z are predefined simple distributions from which samples can be easily drawn. Wherein Y (z, [ s _t,a_t ]) is used to enhance the training process of DDPG and makes the strategy of DDPG prospective. At the same time, the generated jackpot also effectively promotes convergence of reinforcement learning. Specifically, in the present embodiment, Y (z, [ s _t,a_t ]) is simultaneously set as a part of the target Q value of the Critic network of DDPG at the time of training, that is:

target_q＝r_t+Y(z,[s_t,a_t])

Thus, the optimization objective Y ^* of the GMAN generator can be expressed as:

Y^*＝argmin_YDIV(P_Y,P_R)

Wherein Y ^* is a generator optimization target; DIV (P _Y,P_R) is the divergence of (P _Y,P_R), Y-P _Y,R～P_R.

In this embodiment, the authenticity of the jackpot is assessed by a set of discriminators, each of which cooperates with the parameter update of the generator. In GMAN, the generator updates the parameters by performing a max-min game with I homogeneous discriminators. For discriminator D _i (parameter θ _Di), its optimization objective is to distinguish the data produced by generator Y from the original data:

The optimization objective Y ^* of the GMAN generator by the Jensen-Shannon divergence can be expressed as:

Y^*＝argmin_Ymax_DV(Y)

in one possible embodiment, the V (Y):

Wherein V (Y) is the divergence calculated by taking the generator as a variable; λ is a parameter that adjusts the degree of fusion, λ=0 corresponds to direct averaging, ω _i is the degree of fusion weight of the I-th discriminator, I is the total number of discriminators, V' _i(D_i, Y) is the divergence calculated by the I-th discriminator and the generator.

In particular, considering that instability of a single discriminator may hinder learning of a generator, the present embodiment adopts the classical pythagoras method as the fusion function F _soft to fuse the outputs of a plurality of discriminators for convergence stability:

In one possible implementation, the pre-trained generated multi-countermeasure network model is pre-trained: model parameters of the generated multi-countermeasure network model are updated in a random gradient descent mode.

Specifically, the gradient of discriminator D _i is calculatedAnd gradient J _Y of generator Y:

And update their parameters by random gradient descent. Therefore, through training by historical experience playback, the generator can generate a cumulative prize predicted value according to the current state and action of the intelligent agent, and the cumulative prize predicted value is a future potential prize predicted value of the power grid when the cumulative prize predicted value is used for power grid optimal scheduling.

In one possible implementation, the pre-trained grid optimization scheduling model based on multi-countermeasure reinforcement learning is when pre-trained: based on the preset historical experience extraction quantity, extracting good historical experiences from a good historical experience sub-pool of a historical experience pool in combination with a good historical experience extraction proportion, and extracting bad historic experiences from a bad historical experience sub-pool of the historical experience pool in combination with a bad historical experience extraction proportion; and pre-training a power grid optimization scheduling model of multi-countermeasure reinforcement learning according to the extracted good history experience and bad history experience.

Specifically, in order to improve the learning efficiency and adaptability of the power grid optimization scheduling model based on multi-countermeasure reinforcement learning, a hybrid experience cross driving mechanism is introduced in the embodiment. The hybrid experience cross driving mechanism improves the performance of the power grid optimal scheduling model based on multi-countermeasure reinforcement learning by carefully classifying and managing the historical experiences of the intelligent agent and the power grid environment so as to more effectively utilize the historical experiences.

Optionally, the power grid optimization scheduling method further includes: based on a preset instant rewarding threshold value, dividing the historical experience with instant rewards larger than the instant rewarding threshold value in the historical experience pool into good historical experience, and dividing the historical experience with instant rewards not larger than the instant rewarding threshold value in the historical experience pool into bad historical experience, so as to obtain a good historical experience sub-pool and a bad historical experience sub-pool.

Specifically, the historical experience generated by an agent based on a grid optimization scheduling model for multi-contrast reinforcement learning is finely categorized in each interaction with the grid environment. By introducing an instant rewards threshold, historical experience is divided into two categories: good history experience and bad history experience.

Where good historical experience refers to historical experience with an instant prize greater than an instant prize threshold, representing a decision to benefit the system. The good history experience is stored in a good history experience sub-pool, forming a sample set of history experiences that are beneficial to the task. Bad history experience: meaning that the instant prize is not greater than the instant prize threshold, dividing the historical experience into bad historical experiences may require avoidance of similar decisions. The bad history experience is stored in a pool of bad history experiences, forming a sample set of history experiences that are detrimental to the task.

Referring to fig. 3, the key to the hybrid empirical cross-drive mechanism is the empirical extraction. In this embodiment, the good history experience extraction proportion and the bad history experience extraction proportion are combined, the good history experience is extracted from the good history experience sub-pool, and the bad history experience is extracted from the bad history experience sub-pool, so that the extraction mechanism is helpful to balance the learning of the good history experience and the bad history experience, and the power grid optimization scheduling model based on multi-countermeasure reinforcement learning is more robust. The extracted historical experience is used for training a power grid optimization scheduling model based on multi-countermeasure reinforcement learning, and the flexible experience extraction mechanism effectively combines different types of historical experience, so that the power grid optimization scheduling model based on multi-countermeasure reinforcement learning is more adaptive. The hybrid experience cross-drive mechanism introduces a finer experience management strategy for the deep reinforcement learning power grid optimization scheduling algorithm. By distinguishing good historical experience from bad historical experience, the multi-countermeasure reinforcement learning-based power grid optimization scheduling model can more pertinently utilize the historical experience, so that the multi-countermeasure reinforcement learning-based power grid optimization scheduling model is more suitable for complex power grid scenes, an innovative historical experience playback strategy is provided for deep reinforcement learning power grid optimization scheduling, and deep theoretical support is provided for the research of reinforcement learning power grid optimization scheduling algorithm.

In one possible implementation, a grid optimization scheduling model training and decision flow based on multi-contrast reinforcement learning is specified.

Training process of power grid optimization scheduling model based on multi-countermeasure reinforcement learning:

1) Network initialization: (1) initializing an Actor network and a Critic network. And (2) initializing a generating type multi-countermeasure network model. (3) The simulation environment is initialized and an initial state s _t is acquired.

2) Action decision and execution: (1) Action a _t is determined based on current state s _t using an Actor network. (2) Performing action a _t in the simulation environment gathers new state s _t+1, instant prize r _t, and completion flag d.

3) Experience playback: (1) Will history experienceAnd storing in a history experience pool. (2) If the historical experience of the historical experience pool is sufficient, a batch of historical experience pools is randomly extracted from the historical experience pools for training.

4) And (3) updating a network: (1) A generated multi-countermeasure network model is used to estimate a grid future potential rewards prediction. (2) And calculating a target Q value according to the instant rewards r _t and the future potential rewards predicted value of the electric network after discount. (3) The error between the current q=q (s _t,a_t|θ^Q) and the target Q value target_q is calculated and the Critic network is updated to reduce this error. (4) The Actor network is updated with the goal of maximizing the expected rewards for Critic network evaluation.

5) Generating network training: (1) And calculating a cumulative prize R _t corresponding to each historical experience in the round. (2) State s _t and action a _t in the current historical experience are selected. (3) Random noise z is sampled from a certain distribution P _z. (4) Generating a future potential rewards forecast value of the power grid using the generator Y of the generated multi-countermeasure network model, the random noise z and the input s _t,a_t. (5) The generated future potential rewards forecast values of the power grid are evaluated using a plurality of discriminators D _i and the generator is updated to increase the authenticity of its generation. (6) The outputs of all discriminators are integrated to optimize the generator. (7) The discriminator is optimized to improve its ability to discriminate between the generated data and the original data.

6) Repeating the steps until reaching the preset termination condition.

Decision flow of power grid optimization scheduling model based on multi-countermeasure reinforcement learning:

1) And (3) state observation: and (3) observing the current running state of the power grid, namely acquiring the current output of each generator set in the power grid, the load demand and other environmental states.

2) And (3) action generation: based on the current environmental state, an action, i.e., an intra-day operating state adjustment action, is generated by an Actor network of a grid optimization scheduling model based on multi-countermeasure reinforcement learning, which involves adjusting the output of the generator set to meet load demands while maintaining grid stability.

3) And (3) action adjustment: the generated intra-day operating state adjustment actions may need to be adjusted according to the actual operating limits of the grid to ensure its actual feasibility.

In one possible implementation manner, in order to verify the effectiveness of the grid optimization scheduling model based on multi-countermeasure reinforcement learning, taking an SG-126 node network model as an example, an ablation experiment, a comparison experiment and a new energy uncertainty analysis experiment are performed. In this embodiment, the grid optimization scheduling model based on multi-countermeasure reinforcement learning is constructed by a neural network, and includes 1 generator, 2 discriminators, 1 Actor network and 1 Critic network. The input dimension of the generator is 1735, and the dimension is reduced to 1 through a plurality of full-connection layers and an activation function; the 2 discriminators are the same, the input dimension is 1, and the dimension is reduced to 1 after the dimension is increased to 128 through the multi-layer full-connection layer; the input dimension of the Actor network is 1735, and the dimension is reduced to 54 through a plurality of full connection layers, data normalization and activation functions; the input dimension of the Critic network is 1789, and the dimension is reduced to 1 through a plurality of full connection layers, data normalization and activation functions; to avoid gradient extinction and gradient explosion, leakyReLU activation functions are used in the generator and discriminator, and ELU activation functions are used in the Actor network and Critic network. The super-parameter settings of the grid optimization scheduling model based on multi-countermeasure reinforcement learning are shown in table 1.

TABLE 1

Super parameter	Value taking	Super parameter	Value taking
				Generator learning rate	1e-11	Batch size	64
Discriminator learning rate	1e-11	Prize value discount factor	0.2
				Actor network learning rate	5e-17	Historical experience pool capacity	1000000
Critic network learning rate	5e-15	Historical experience pool demarcation value	0.7

Regarding ablation experiments: aiming at the problem of optimizing and dispatching a high-proportion new energy power grid, two dispatching scenes are selected in an online decision and the effect of the power grid optimizing and dispatching model based on multi-countermeasure reinforcement learning is evaluated, each scene needs to continuously optimize and regulate the unit output of each generator in the power grid for 4 hours (5 min intervals) in the future according to the running state of the power grid, and the simulation result of the scene one under an ablation experiment is shown in table 2.

TABLE 2

As shown in table 2, where a represents DDPG, B represents GMAN, and C represents a hybrid empirical cross-drive mechanism. It can be seen that the model without the hybrid experience cross driving mechanism is finished 19 steps earlier than the power grid optimization scheduling model based on multi-countermeasure reinforcement learning, the safety score is reduced by 18.601%, and the rewarding value is also reduced by 19.272%. The result shows that the hybrid experience cross drive mechanism plays a key role in ensuring the stable operation of the power grid system. Compared with the power grid optimization scheduling model based on multi-countermeasure reinforcement learning, the model without GMAN is finished 34 steps earlier, the safety score is obviously reduced, and the amplitude reduction is up to 39.413%. In addition, the running cost of the unit is increased by 1.914%, and the rewarding value is reduced by 36.8573%. These results indicate that GMAN plays an important role in assisting grid dispatching and mitigating the negative impact of renewable energy fluctuations on the grid. When this mechanism is removed, both the stability and the economy of the grid system are significantly affected.

The simulation results of the second scene under the ablation experiment are shown in table 3.

TABLE 3 Table 3

As shown in Table 3, the model without the mixed experience cross driving mechanism is finished 5 steps earlier than the grid optimization scheduling model based on multi-countermeasure reinforcement learning, the safety score is reduced by 14.633%, and the rewarding value is reduced by 8.248%. This suggests that the hybrid empirical cross drive mechanism plays a key role in ensuring that the grid system operates stably and maintains high prize values. The model without GMAN is finished 12 steps earlier than the power grid optimization scheduling model based on multi-countermeasure reinforcement learning, the safety score is reduced by 12.860%, and the rewarding value is reduced by 12.003%. This means GMAN that safe operation of the grid system can be ensured, and the negative effects of renewable energy fluctuations are mitigated.

The performance between the model without GMAN and the model without the mixed experience cross driving mechanism and the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention is compared through an ablation experiment. Experimental results show that whether GMAN or a hybrid experience cross drive mechanism plays an indispensable role in improving the robustness of a power grid dispatching strategy, guaranteeing the safe operation of a power grid system and reducing the negative influence caused by renewable energy fluctuation.

In addition, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention also realizes the remarkable improvement of the new energy consumption rate, achieves a high level of more than 98%, and realizes the efficient consumption of new energy. Experimental results show that the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention has obvious advantages in the aspect of optimization scheduling strategies. Compared with a model without GMAN, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention maintains high safety score, reduces unit operation cost and maintains higher rewarding value. This is because GMAN effectively mitigates the negative impact of renewable energy fluctuations on the grid, improving the stability and economy of the scheduling strategy. Higher security scores and prize values are achieved while maintaining a high number of steps than models that eliminate the hybrid empirical cross-drive mechanism. This demonstrates that the hybrid empirical cross-drive mechanism plays a key role in improving policy scheduling efficiency and robustness. This shows that GMAN and the hybrid experience cross-drive mechanism play an important role in the model, and jointly improve the performance of the power grid optimization scheduling model based on multi-countermeasure reinforcement learning.

Regarding the comparative experiment: comparing the grid optimization scheduling model based on multi-countermeasure reinforcement learning with other grid scheduling methods in two scenes to further verify the scheduling effect, wherein other scheduling algorithms comprise: DDPG, DCR-TD3 (a distributed dual delay depth deterministic strategy gradient method) and PPO (a near-end strategy optimization method with novel objective function proposed by OpenAI).

The simulation results of the first scene under the comparison experiment are shown in table 4.

TABLE 4 Table 4

As shown in table 4, the grid optimization scheduling model based on multi-challenge reinforcement learning in scenario one combines experimental data superior to a single model in several respects. The power grid optimization scheduling model based on multi-countermeasure reinforcement learning, DDPG and the DCR-TD3 algorithm are operated for 96 steps, and 40 steps more than PPO. The safety component ratio DDPG and the DCR-TD3 of the power grid optimization scheduling model based on multi-countermeasure reinforcement learning are 1.058% and 0.184%, and compared with PPO, the safety component ratio is 89.671%. On the aspect of new energy consumption rate, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning is 79.524% higher than DDPG, and 22.342% higher than DCR-TD 3. Compared with the DCR-TD3 with optimal performance, the power grid optimal scheduling model based on multi-countermeasure reinforcement learning provided by the invention has the advantages that the performance is improved by 31.250%, compared with DDPG and PPO, the power grid optimal scheduling model based on multi-countermeasure reinforcement learning is improved by 123.486% and 83.852%, and the superiority of the power grid optimal scheduling model based on multi-countermeasure reinforcement learning in the scheduling method is verified.

Referring to fig. 4-6, the renewable energy utilization, cost, and jackpot for each step in scenario one of the scheduling methods described above are more intuitively depicted. As shown in fig. 4, the PPO and the grid optimization scheduling model based on multi-countermeasure reinforcement learning achieve relatively high renewable energy utilization rate at the beginning, but the scheduling of the PPO is very unstable, and the grid optimization scheduling model based on multi-countermeasure reinforcement learning is ended at step 56, in contrast to the fact that the grid optimization scheduling model based on multi-countermeasure reinforcement learning shows excellent stability and efficiency in terms of renewable energy utilization rate, which is the best method in terms of renewable energy utilization rate, and the grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention is also at a low-medium level in terms of cost, as shown in fig. 5. While figure 6 shows the overall scheduling performance. The reward initially obtained by PPO increases rapidly but ends rapidly, indicating its instability. DDPG and DCR-TD3 run more steps, but the rewards during the run-time are significantly lower than the grid optimization scheduling model based on multi-challenge reinforcement learning proposed by the present invention. In general, the grid optimization scheduling model based on multi-countermeasure reinforcement learning realizes the highest accumulated rewards, the highest stability and efficiency in the aspect of renewable energy utilization rate and the centered scheduling cost, and verifies the superiority of the grid optimization scheduling model in a scheduling method.

The simulation results of the second scene under the comparison experiment are shown in table 5.

TABLE 5

As shown in table 5, the grid optimization scheduling model based on multi-challenge reinforcement learning in scenario two combines experimental data superior to a single algorithm in several respects. Compared with DDPG and DCR-TD3, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention is improved by 24 steps and 15 steps in terms of running steps, and the worse PPO is operated for 72 steps more. The safety component ratio DDPG and the DCR-TD3 of the power grid optimization scheduling model based on multi-countermeasure reinforcement learning are 48.814% and 22.688%, and compared with PPO, the safety component ratio is 327.184%. Compared with DDPG, DCR-TD3 and PPO, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention has the advantages that the unit operation cost is reduced by 2.317%, 0.782% and 0.360%. In terms of new energy consumption rate, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention is improved by 4.822% compared with the optimal PPO, and is improved by 71.491% and 36.293% compared with DDPG and DCR-TD 3. The power grid optimization scheduling model based on multi-countermeasure reinforcement learning has the advantages that the power grid optimization scheduling model is improved by 80.726% compared with the DCR-TD3 with optimal performance on the total rewarding value, is improved by 202.034% and 335.649% compared with DDPG and PPO, and is verified to be superior in the most advanced scheduling method.

New energy uncertainty analysis experiments: and (3) performing experimental comparison on a power grid optimization scheduling model based on multi-countermeasure reinforcement learning and the situation that the DDPG, the DCR-TD3 and the PPO have the energy forecast deviation rate of Jing Yixin%. The simulation results of the scene one under the new energy uncertainty analysis experiment are shown in table 6.

TABLE 6

As shown in table 6, the grid optimization scheduling model based on multi-challenge reinforcement learning combines experimental data superior to a single algorithm in several aspects. In terms of running steps, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning is an optimal 96 steps, runs 5 steps more than the suboptimal DDPG steps, and runs 7 steps and 62 steps more than the DCR-TD3 and PPO respectively. The power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention is 6.095% and 3.814% higher than DDPG and DCR-TD3 in safety score, and 173.726% higher than PPO. In the aspect of unit operation cost, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning provided by the invention has the lowest operation cost, saves 0.165% and 0.835% compared with DDPG and DCR-TD3, and saves 4.409% compared with PPO. On the aspect of new energy consumption rate, the power grid optimization scheduling model based on multi-countermeasure reinforcement learning is 83.111% higher than DDPG, and 21.598% higher than DCR-TD 3. The power grid optimization scheduling model based on multi-countermeasure reinforcement learning has the advantages that the power grid optimization scheduling model is improved by 36.143% compared with the DCR-TD3 with optimal performance on the total rewarding value, is improved by 155.375% and 172.505% compared with DDPG and PPO, and is verified to be superior in the most advanced scheduling method.

The following are device embodiments of the present invention that may be used to perform method embodiments of the present invention. For details not disclosed in the apparatus embodiments, please refer to the method embodiments of the present invention.

Referring to fig. 7, in still another embodiment of the present invention, a power grid optimization scheduling system is provided, which can be used to implement the power grid optimization scheduling method described above, and specifically, the power grid optimization scheduling system includes a data acquisition module and an optimization scheduling module.

The data acquisition module is used for acquiring the current running state of the power grid; the optimization scheduling module is used for inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid. The grid optimization scheduling model based on multi-countermeasure reinforcement learning is obtained by modifying a target Q value of a criticism network in a depth deterministic action gradient model during training into a sum of instant rewards and a predicted value of potential rewards of a grid future; the future potential rewards predicted value of the power grid is obtained by predicting a pre-trained generating type multi-countermeasure network model based on the operation state of the power grid during training and the daily operation state adjustment action.

All relevant contents of each step involved in the foregoing embodiment of the power grid optimization scheduling method may be cited to the functional description of the functional module corresponding to the power grid optimization scheduling system in the embodiment of the present invention, which is not described herein.

The division of the modules in the embodiments of the present invention is schematically only one logic function division, and there may be another division manner in actual implementation, and in addition, each functional module in each embodiment of the present invention may be integrated in one processor, or may exist separately and physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.

In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The Processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application Specific Integrated Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATEARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., which are the computational core and control core of the terminal, adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for operating the power grid optimal scheduling method.

In yet another embodiment of the present invention, a storage medium, specifically a computer readable storage medium (Memory), is a Memory device in a computer device, for storing a program and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the method for power grid optimized scheduling in the above-described embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. The power grid optimal scheduling method is characterized by comprising the following steps of:

Acquiring the current running state of the power grid;

Inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid;

2. The grid-optimized scheduling method of claim 1, wherein the pre-trained generated multi-countermeasure network model is pre-trained:

Y^*＝argmin_Ymax_DV(Y)

3. The grid-optimized dispatching method of claim 2, wherein the V (Y):

4. The grid-optimized scheduling method of claim 2, wherein the pre-trained generated multi-countermeasure network model is pre-trained:

model parameters of the generated multi-countermeasure network model are updated in a random gradient descent mode.

5. The grid optimization scheduling method according to claim 1, wherein the pretrained grid optimization scheduling model based on multi-countermeasure reinforcement learning is characterized in that when pretrained:

based on the preset historical experience extraction quantity, extracting good historical experiences from a good historical experience sub-pool of a historical experience pool in combination with a good historical experience extraction proportion, and extracting bad historic experiences from a bad historical experience sub-pool of the historical experience pool in combination with a bad historical experience extraction proportion; and pre-training a power grid optimization scheduling model of multi-countermeasure reinforcement learning according to the extracted good history experience and bad history experience.

6. The grid-optimized dispatching method of claim 5, further comprising:

Based on a preset instant rewarding threshold value, dividing the historical experience with instant rewards larger than the instant rewarding threshold value in the historical experience pool into good historical experience, and dividing the historical experience with instant rewards not larger than the instant rewarding threshold value in the historical experience pool into bad historical experience, so as to obtain a good historical experience sub-pool and a bad historical experience sub-pool.

7. A grid optimized dispatching system, comprising:

The data acquisition module is used for acquiring the current running state of the power grid;

The optimization scheduling module is used for inputting the current running state of the power grid into a pre-trained power grid optimization scheduling model based on multi-countermeasure reinforcement learning to obtain the current running state adjustment action of the power grid;

8. The grid optimization scheduling system of claim 7, wherein the pre-trained multi-countermeasure reinforcement learning-based grid optimization scheduling model is when pre-trained:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the grid optimized scheduling method according to any one of claims 1 to 6 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the grid optimized scheduling method of any one of claims 1 to 6.