CN119886450A

CN119886450A - Power grid simulation environment scheduling optimization method and system based on reinforcement learning

Info

Publication number: CN119886450A
Application number: CN202510016878.2A
Authority: CN
Inventors: 徐小峰; 宋雨恒; 刘金波; 宋旭日; 王连聚; 万雄; 刘幸蔚; 齐涵; 王彦沣; 黄宇鹏; 范海威; 葛彦泽; 韩昳; 张大伟; 齐晓林; 李泽科; 杨楠; 熊志杰; 张风彬; 张国芳
Original assignee: China University of Petroleum East China; China Electric Power Research Institute Co Ltd CEPRI; State Grid Fujian Electric Power Co Ltd; State Grid Sichuan Electric Power Co Ltd; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: China University of Petroleum East China; China Electric Power Research Institute Co Ltd CEPRI; State Grid Fujian Electric Power Co Ltd; State Grid Sichuan Electric Power Co Ltd; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2024-12-30
Filing date: 2025-01-06
Publication date: 2025-04-25

Abstract

The present invention discloses a method and system for optimizing scheduling in a power grid simulation environment based on reinforcement learning, and designs a multi-objective reward function system and a double insurance mechanism combining reward constraints. During the training process of the A3C algorithm, the actual conditions of a high proportion of new energy grid connection are modified. By introducing the power grid security index, the safe and stable operation of the power system can be effectively maintained, and passive action screening is performed to reduce the action space dimension of a single step, accelerate the convergence speed of reinforcement learning, and enhance the real-time performance of the A3C algorithm in power grid scheduling optimization. Then, by combining economic costs, output limitations of generator sets, and capacity limitations of transmission lines, etc., it is ensured that the algorithm can fully consider and effectively handle these constraints during the training process. Through this standardized constraint processing mechanism, the generated scheduling strategy not only meets the actual operation requirements of the power system, but also can be optimized under a variety of complex constraints.

Description

Power grid simulation environment scheduling optimization method and system based on reinforcement learning

Technical Field

The invention belongs to the field of power system optimization scheduling, and relates to a power grid simulation environment scheduling optimization method and system based on reinforcement learning.

Background

With the ever-increasing scale and increasing complexity of power systems, traditional grid dispatching methods face significant challenges. The power grid dispatching optimization relates to a plurality of aspects such as load prediction, power generation dispatching, power transmission dispatching and the like, and aims to realize economic and efficient power resource allocation on the premise of ensuring safe and stable operation of a power system. However, the high dimensionality, nonlinearity, dynamics, and uncertainty of power systems have caused traditional optimization methods (such as linear programming, dynamic programming, and heuristic algorithms) to exhibit certain limitations in dealing with complex grid scheduling problems.

Reinforcement learning (Reinforcement Learning, RL) is used as a machine learning method to constantly learn and optimize strategies to maximize jackpot by agent interactions with the environment. The RL has unique advantages in processing the problems of high-dimensional state space and decision sequences, so that the RL has wide application prospect in the field of power grid dispatching optimization. In recent years, with the improvement of computing power and the continuous progress of algorithms, the application research of RL in power grid dispatching optimization has been significantly advanced. In grid dispatching optimization, the RL algorithm solves dispatching of the running environment of the simulated power system, so that an intelligent agent can learn an optimal dispatching strategy in continuous trial and error. The key of the RL algorithm is that the behavior of the intelligent agent is guided through the reward signal so as to gradually approach the optimal scheduling scheme. For example, the intelligent agent can adjust the output of the generator set according to the running state of the power grid, the load demand, the power generation cost and other factors so as to realize the economic dispatch and the stable running of the power grid. In particular application, RL algorithms are used to solve a variety of grid scheduling problems, such as load prediction, power generation scheduling, energy storage scheduling, demand response, and the like. For example, in the power generation scheduling optimization, the intelligent agent can optimize the output of each power generation set through learning the operation rule and the power generation cost curve of the power grid so as to minimize the overall power generation cost, and in the energy storage scheduling, the intelligent agent can realize the optimal utilization of energy storage resources through learning the charge and discharge rule of the energy storage equipment.

The A3C (Asynchronous Advantage Actor-Critic) algorithm is an advanced algorithm in the RL, combines an Actor-Critic method and an asynchronous updating strategy, and has the characteristics of high efficiency and rapid convergence. The A3C algorithm can effectively improve learning efficiency and diversity of strategy exploration through asynchronous interaction of a plurality of intelligent agents in different environments. In power grid dispatching optimization, an A3C algorithm is trained in different power grid dispatching environments through a plurality of parallel intelligent agents, and can quickly learn and converge to an optimal dispatching strategy. The A3C algorithm can be applied to power generation dispatching optimization, the output of the power generation unit can be dynamically adapted to load changes through learning of power generation unit dispatching strategies under different load demands, the economical efficiency and stability of a power system are maximized, in energy storage dispatching optimization, the A3C algorithm can optimize the utilization of energy storage resources through learning of the optimal charging and discharging strategies of energy storage equipment, the regulating capacity and economic benefits of the power system are improved, in demand response management, the A3C algorithm can optimize the demand response strategies through learning of user electricity utilization behavior patterns, power supply and demand are balanced, peak-valley differences of the power system are reduced, and the running efficiency of a power grid is improved.

While Reinforcement Learning (RL) algorithms offer great potential in the field of grid dispatching optimization, they still have some significant drawbacks in practical use. These drawbacks are mainly manifested in the aspects of insufficient importance of safety priority, constraint processing requiring additional design complexity, and the like.

The current reinforcement learning algorithm often lacks importance of priority on safety in the field of power grid dispatching optimization. The operational safety of the power system is a primary consideration, and any scheduling strategy must ensure the stability and safety of the system. However, many RL algorithms focus mainly on maximization of economic efficiency, ignoring the safety of power system operation. For example, when the exploration factor is high, reinforcement learning has a certain probability of selecting a power grid scheduling solution which is not suitable for the current state, so as to jump out of a local optimal solution and explore feasibility of other schemes. While the end result may be more targeted than well-established, the power system safety failure phenomena such as out-of-limit that occur during this process are not practically acceptable. RL algorithms often face the problem of lack of constraints in grid scheduling optimization. The power grid dispatching problem is essentially a complex multi-constraint optimization problem, and relates to various constraints such as output limit of a generator set, capacity limit of a power transmission line, environmental regulations and the like. However, current RL algorithms are often difficult to fully account for and effectively address these constraints during the training process. Although the A3C algorithm improves learning efficiency through asynchronous interaction of multiple agents, security constraints are not explicitly incorporated into the reward function during learning, which may result in ignoring security in pursuing optimal economic benefits. For example, under high load conditions, the A3C algorithm may schedule the gensets in some critical conditions, thereby increasing the instability and risk of failure of the power system. Although the A3C algorithm has advantages in policy learning, when faced with multiple complex constraint conditions, it often needs to additionally design a complex constraint processing mechanism, otherwise, the learned scheduling policy is easy to be inconsistent with the operation constraint of the power system in practical application. For example, in performing power generation schedule optimization, the A3C algorithm may generate schedule instructions that exceed the maximum output capabilities of certain gensets, and thus cannot be performed in actual operation.

Therefore, the reinforcement learning algorithm represented by the A3C algorithm has good application prospect in the field of power grid dispatching optimization, but has significant defects in aspects of insufficient importance of safety priority, need of additional design and complex constraint processing and the like. Further research and improvement are needed for solving the problems, so that the actual application effect of the RL algorithm in grid dispatching optimization is improved, and the safe, stable and efficient operation of the power system is ensured.

Disclosure of Invention

The invention aims to solve the problems that the existing power grid dispatching optimization method in the prior art is insufficient in importance of priority of safety and requires additional design of complex constraint processing, and provides a power grid simulation environment dispatching optimization method and system based on reinforcement learning.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the invention provides a power grid simulation environment scheduling optimization method based on reinforcement learning, which comprises the following steps:

constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;

constructing a target-based rewarding function and a constraint-based rewarding function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;

And acquiring weighted rewards according to the rewards function based on the target and the rewards function based on the constraint, and carrying out dispatching optimization on the power grid simulation environment by adopting an Actor-Critic network structure and combining the weighted rewards.

The invention provides a power grid simulation environment scheduling optimization system based on reinforcement learning, which comprises the following steps:

the power grid parameter acquisition module is used for constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;

The rewarding function construction module is used for constructing a rewarding function based on targets and a rewarding function based on constraints according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;

and the dispatching optimization module is used for acquiring weighted rewards according to the rewards function based on the target and the rewards function based on the constraint, and carrying out dispatching optimization on the power grid simulation environment by adopting an Actor-Critic network structure and combining the weighted rewards.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of a reinforcement learning based grid simulation environment scheduling optimization method when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a reinforcement learning based grid simulation environment scheduling optimization method.

Compared with the prior art, the invention has the following beneficial effects:

According to the power grid simulation environment scheduling optimization method based on reinforcement learning, the operation and control of a power system are optimized through the A3C method combining the modes of weighted rewards, action space screening and the like. The optimal scheduling of the power system refers to the technical field of maximizing energy utilization efficiency, ensuring safe and stable operation of a power grid and reducing environmental impact by dynamically adjusting the operation state of the power system. The invention aims to solve a plurality of key defects of the existing reinforcement learning algorithm in the field of power grid dispatching optimization, and particularly solves the problems that the importance of safety is insufficient, real-time dispatching is difficult to realize and constraint conditions are deficient. In order to cope with the challenges, firstly, the invention designs a multi-objective rewarding function system and a double insurance mechanism combined with constraint restriction of rewards aiming at the problem that the existing reinforcement learning algorithm pays insufficient security for the operation of the power grid. And designing a reward function system comprehensively considering indexes such as economy, low-carbon benefit, operation safety and the like. In the training process of the A3C algorithm, the actual conditions of high-proportion new energy grid connection are combined for modification, and by introducing grid safety indexes such as line load rate, line out-of-limit rate and the like, the scheduling strategy can effectively maintain safe and stable operation of the power system while realizing economic optimization, and the passive action screening is performed to reduce the action space dimension of a single step, accelerate the convergence rate of reinforcement learning and enhance the instantaneity of the A3C algorithm in grid scheduling optimization. Then, the invention provides a normalized multi-objective constraint condition system aiming at the complex constraint condition of the existing reinforcement learning algorithm. In the design of the A3C algorithm, the constraint conditions can be comprehensively considered and effectively processed in the training process by combining economic cost, output limit of a generator set, capacity limit of a power transmission line and the like. Through the normalized constraint processing mechanism, the generated scheduling strategy not only meets the actual operation requirement of the power system, but also can realize optimization under various complex constraint conditions. In summary, the invention aims to overcome the main defects of the existing reinforcement learning algorithm in the field of power grid dispatching optimization by introducing a double insurance mechanism and a normalized bonus function design, and provides solid guarantee for efficient and safe operation of a power system.

According to the power grid simulation environment dispatching optimization system based on reinforcement learning, the system is divided into the power grid parameter acquisition module, the rewarding function construction module and the dispatching optimization module, and the power grid simulation environment dispatching optimization is carried out by adopting an Actor-Critic network structure and combining the weighted rewarding value. The modules are mutually independent by adopting a modularized idea, so that the modules are convenient to manage uniformly.

Drawings

For a clearer description of the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a power grid simulation environment scheduling optimization method based on reinforcement learning.

Fig. 2 is a state space construction diagram of the present invention.

Fig. 3 is a position diagram of the passive arbiter of the present invention.

Fig. 4 is a flowchart of the motion screening of the present invention.

Fig. 5 is a schematic structural diagram of an A3C algorithm according to the present invention.

FIG. 6 is a diagram of a reinforcement learning-based grid simulation environment scheduling optimization system of the present invention.

Fig. 7 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The invention is described in further detail below with reference to the attached drawing figures:

The invention provides a power grid simulation environment dispatching optimization method based on reinforcement learning, which builds a reinforcement learning algorithm frame by building a state space and an action space and building a reward function and a training neural network to carry out dispatching optimization of the power grid simulation environment, as shown in figure 1, and comprises the following steps:

s1, constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;

the construction of the state space of the power grid simulation environment and the action space of the power grid simulation environment comprises the following concrete steps:

Acquiring numerical characteristics and non-numerical characteristics, and constructing a state space of a power grid simulation environment based on the numerical characteristics and the non-numerical characteristics, wherein the numerical characteristics comprise generator active power, generator reactive power, load active power, load reactive power, predicted load, line power and constraint power;

in the construction of the action space of the power grid simulation environment, a passive discriminator is arranged to run in the environment, and the specific process of the passive discriminator is that after an agent takes running control actions, all tide constraints of a system are calculated at the same time when rewards are calculated, if the actions taken by the agent cause constraint boundary crossing, the action probabilities are arranged in a descending order, the running control actions with the next highest action probability are selected until the agent finds actions meeting constraint conditions, and if all actions cannot meet constraint conditions, the agent executes actions for maximizing the current rewarding function.

S2, constructing a reward function based on a target and a reward function based on constraint according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;

Constructing a target-based reward function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the reward function comprises the following specific steps:

the target-based rewarding function comprises a first power grid dispatching rewarding, a second power grid dispatching rewarding and a third power grid dispatching rewarding of the novel energy system, wherein the first power grid dispatching rewarding comprises a power flow divergence rewarding, a power control rewarding and a voltage balance control rewarding;

The constraint conditions of the novel energy system are as follows:

Wherein C ₁ to C ₅ represent constraints; representing the lower limit of active power output of each generator, Representing the upper limit of active power output of each generator,Representing the lower limit of reactive power output of each generator,Representing an upper limit of reactive power output of each generator; representing the lower voltage limit of each generator node, Representing the upper voltage limit for each generator node,Representing the lower voltage limit of each generator load node,ΔP _i is the imbalance between the active power at node i, ΔQ _i is the imbalance between the reactive power at node i, P _i is the active power injected at node i, Q _i is the reactive power injected at node i, V _i is the voltage magnitude at node i, V _j is the voltage magnitude at node j, G _ij is the conductance of the branch, B _ij is the susceptance of the branch, θ _ij is the voltage phase angle difference; representing the active power of the generator, Representing the active power of the load node,Representing the reactive power of the generator,The reactive power of the load node is represented, delta >0 indicates a normal value close to 0, N represents the number of the generators, N represents the number of the generators, M represents the number of the loads, and M represents the number of the loads.

The tidal current diverges rewardsThe following formula is shown:

The power control rewards The following formula is shown:

the voltage balance control rewards The following formula is shown:

Wherein R ₁ represents a flow diverging reward function, R ₂ represents a power control reward function, R ₃ represents a voltage balance reward function, e ⁽⁾ represents an exponential normalization, R _min represents a minimum reward during system operation, R _g represents a penalty imposed on an agent when power flow diverges, ρ _w represents a current transmission power of the transmission line w, and X _w is a penalty coefficient exceeding the power flow limit of the transmission line W, R ₀ is a default rewarding function reflecting the running consistency of the power grid, W represents the number of the power lines, W represents the number of the power lines, P _G represents the active power of the generator, and Q _G represents the reactive power of the generator.

The wind and light abandoning cost rewardsThe following formula is shown:

The unit operation cost rewards The following formula is shown:

Wherein R ₄ represents a power generation cost rewarding function, R ₅ represents a wind and solar energy abandoning cost rewarding function, a _n represents a cost coefficient of a secondary term, b _n represents a cost coefficient of a primary term, and c _n represents a cost coefficient of a constant term; C _q represents the unit wind-light rejection cost coefficient; The maximum output value of the new energy generator is represented, P _new is the actual output value of the new energy generator, N _new is the number of the new energy generator, and N _new is the number of the new energy generator.

The renewable energy source output rewardsThe following formula is shown:

The carbon emission rewards The following formula is shown:

wherein R ₆ represents a renewable energy source output rewarding function, R ₇ represents a carbon emission rewarding function, c _p represents a unit conventional unit cost coefficient, P _con represents an actual output value of a conventional generator, N _con represents a number of the conventional generator, and N _con represents the number of the conventional generator.

Constructing a constraint-based rewarding function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the method specifically comprises the following steps:

wherein lambda is a fixed prize value, For constraint-based rewards, C ₁ to C ₅ represent constraints for the novel energy system.

And S3, acquiring weighted rewards according to a reward function based on the target and a reward function based on the constraint, and carrying out grid simulation environment dispatching optimization by adopting an Actor-Critic network structure and combining the weighted rewards.

The method comprises the steps of obtaining weighted rewards according to a rewards function based on a target and a rewards function based on constraint, and specifically comprises the following steps:

Where r is the weighted prize value, alpha _i is the calculation of the normalized prize value, To be a weight coefficient for each bonus item, the weight coefficient is typically in the range of 0 to 1.

The method is described in detail below with reference to the accompanying drawings:

1) State space construction

The input received by the agent determines to some extent whether the performance is good or bad, and as shown in fig. 2, the state space required by the invention includes both numerical characteristics and non-numerical characteristics as input of the agent, so that the agent can access the whole state of the power network at each time step. The state features used in the present invention can be classified into numerical features and non-numerical features. The numerical characteristics include generator active and reactive power, load active and reactive power, predicted load, line power and constraint power. The non-numerical features include line open state, which is incorporated directly into the topological feature as an additional vector.

Wherein S is vector space, P _L、Q_L represents load active power and reactive power, P _G、Q_G represents generator active power and reactive power, P _Lk ^t+1 represents active demand predicted value of the kth load node at the next moment, P _ρ represents line power, Q _C represents constraint power, PR is an additional vector of line on-off state and the like, and represents the topology structure of the power system.

2) Action space construction

In the construction of the action space, as shown in fig. 3, a passive arbiter is set to run in the environment, and the feasible actions are screened only when the system is out of limit, and the simulation is continued until the iteration is finished. The first safety function here is to ensure the safety of the power system.

As shown in FIG. 4, the specific flow of the decision by the passive arbiter is to calculate all flow constraints of the system while calculating rewards after the agent takes the operation control action. If the actions taken by the agent can cause constraint out-of-range, the action probabilities are arranged in descending order, and the operation control action with the next highest action probability is selected. And so on until the agent finds an action that can satisfy the constraint. If all actions cannot meet the constraint condition, the agent will execute the action that maximizes the current bonus function.

In the action screening process, the topological action and the power generation output adjustment action which can enable the system to recover to a safe running state from a tide out-of-limit state form an action space for intelligent training together:

Ω_t＝Ω_t-1∪(Ω_A,t-1∩Ω_TP,t-1)∪(Ω_A,t-1∩Ω_G,t)

Wherein, omega _t and omega _t+1 are respectively the movable sets in the t-step simulation and the t+1-step simulation, omega _A,t is the movable set which can meet all constraint conditions in the t-step simulation and is better than the movable set which does not execute the action in the current rewarding, omega _TP,t is all the feasible topological movable sets in the t-step simulation, and omega _G,t is all the feasible power generation force adjustment movable sets in the t-step simulation.

3) Bonus function architecture construction

3.1 Target-based reward function

(1) Security dispatch rewards

In the dispatching rewards of the safety, mainly relating to three aspects of power flow divergence, active and reactive power control and voltage balance control of a novel power system, according to the constraint conditions related to each aspect, corresponding evaluation indexes are normalized and then converted into corresponding rewarding functionsForming rewards of the novel power system on safety scheduling.

The constraint conditions of the novel energy system are as follows:

Novel energy system tide dispersion rewardsThe following formula is shown:

Rewards for power control The following formula is shown:

Controlling rewards with respect to voltage balance The following formula is shown:

Wherein, The values are normalized negative values, and the closer to 0, the better the running state of the current system is, and the closer to the optimization target is. R ₁ represents a flow divergence rewarding function, R ₂ represents a power control rewarding function, R ₃ represents a voltage balance rewarding function, e ⁽⁾ represents an exponential normalization, R _min represents a lowest rewarding during system operation, R _g represents a penalty imposed on an agent when a power flow divergence occurs, ρ _w represents a current transmission power of the transmission line w, andX _w is a penalty coefficient exceeding the power flow limit of the transmission line W, R ₀ is a default rewarding function reflecting the running consistency of the power grid, W represents the number of the power lines, W represents the number of the power lines, P _G represents the active power of the generator, and Q _G represents the reactive power of the generator.

(2) Economic dispatch rewards

In the economic dispatching rewards, mainly relating to two aspects of unit operation cost and wind and light abandoning cost, corresponding evaluation indexes are normalized and then converted into corresponding rewarding functions

Wind and light abandoning cost rewardsThe following formula is shown:

Unit operation cost rewards The following formula is shown:

Wherein, The values are normalized negative values, the closer to 0 is the better the running state of the current system is, the closer to an optimization target is, R ₄ is the power generation cost rewarding function, R ₅ is the wind and light abandoning cost rewarding function, a _n is the cost coefficient of a secondary term, b _n is the cost coefficient of a primary term, and c _n is the cost coefficient of a constant term; C _q represents the unit wind-light rejection cost coefficient; The maximum output value of the new energy generator is represented, P _new is the actual output value of the new energy generator, N _new is the number of the new energy generator, and N _new is the number of the new energy generator.

(3) Low carbon dispatch rewards

In the low-carbon dispatching rewards, the method mainly relates to two aspects of output of renewable energy sources and carbon emission, and corresponding evaluation indexes are converted into corresponding rewarding functions after being normalized.

Renewable energy source output rewardsThe following formula is shown:

Carbon emission rewards The following formula is shown:

Wherein, The value is between 0 and 1, the closer to 1 is the higher the renewable energy utilization rate of the current system is,The value is between-1 and 0, the closer to 0 is the lower the carbon emission of the current system, R ₆ is the renewable energy source output rewarding function, R ₇ is the carbon emission rewarding function, c _p is the unit conventional unit cost coefficient, P _con is the actual output value of the conventional generator, N _con is the number of the conventional generator, and N _con is the number of the conventional generator.

3.2 Constraint-based rewards

Since most of the above-mentioned target-based dispatch rewards are negative numbers, in order to enable the agent to fully explore the simulated environment, when all target constraints are satisfied, the exploration of the agent in terms of meeting the constraint conditions is encouraged based on the forward rewards. The second insurance function is played in the method about the safety of the power grid, and the out-of-limit is prevented at the action selection place, the exploration is properly encouraged at the rewards place, and the balance of exploration and utilization is ensured:

The final weighted reward value applied to the reinforcement learning agent is constructed as follows, combining the 8 reward functions:

4) Neural network training and algorithm flow based on A3C

The A3C algorithm used by the invention adopts an Actor-Critic network structure and consists of an Actor network and a Critic network, namely an Actor network and a criticism network. The goal of the Actor network is to learn a policy function, i.e., a probability distribution of a selected action in a given state. The goal of Critic networks is to learn a state value function or a state-action value function for evaluating the value of different state or state-action pairs. The training framework of the A3C algorithm used in the present invention is shown in fig. 5.

The A3C algorithm adopted by the invention adopts an asynchronous training mode, each thread independently interacts with the environment, and gradient update is realized through parameter sharing. In addition to the initialized neural network parameters, multiple parallel training threads are created, each thread independently running an agent to interact with the environment, and using the Actor and Critic networks to achieve an approximation of policies and cost functions. Each thread interacts with the environment of the power grid independently, selects the scheduling actions after action screening according to the current strategy network, observes new states and rewards according to the feedback of the power grid environment, and stores the information in a local experience playback buffer zone. When a thread reaches the end of a predetermined number of time steps or trace, the thread samples data from the empirical playback buffer and performs gradient updates by calculating a merit function. After each thread executes gradient updating for a plurality of times, the updated parameters are transferred to the main thread for global parameter updating. The above process is repeated until a predetermined training round is reached or a termination condition is met. This way of training can improve the efficiency and stability of the training and enables better strategies and cost functions to be learned.

The specific algorithm flow in the invention is as follows:

The algorithm input part is characterized in that in the global A3C neural network, the corresponding parameter based on the strategy pi is theta, the corresponding parameter based on the value network is omega, and in the A3C neural network of the branch thread, the corresponding parameter theta ', omega' is T, the global sharing iteration round is T _max, the global maximum iteration times is T _t, and the action space is an attenuation factor gamma;

And the algorithm output part is a power grid dispatching strategy pi and a cost function V.

1 Update time series t=1.

And 2, resetting gradient update quantities dθ+.0 and dω+.0 of the global Actor network and the global Critic network. The purpose of this step is to clear the previously accumulated gradient information at the beginning of each new iteration, ensuring that the new calculation is not disturbed by the old gradients.

And 3, synchronizing specific parameters from the global Actor network and the global Critic network to the neural network theta '=theta, omega' =omega of the branch threads. This step ensures that the neural network of the branch thread has the same initial parameters as the global network.

Initializing t _start =t to obtain a state s _t of time series t.

Action Ω _t is performed based on policy pi (Ω _t|s_t; θ).

And 6, screening an action space, and detecting whether the action is out of limit.

Execution of action Ω _t gets a multi-objective prize r _t and the next state s _t+1 given by the simulation environment.

And 8, updating time step length, and enabling T to be t+1, T to be t+1.

If s _t is the end state, or the time series t reaches a maximum, then step 10 is entered, otherwise the flow goes back to step 510 where R (s, t) is calculated for position s _t at the last time series t, where:

From back to front, for each time series i e (t-1, t-2,., t _start), the following operations are performed:

each instant R (s, i) =r _i +γr (s, i+1) is calculated 12.

13, Gradient update of accumulated Actor:

14, accumulating gradient update of Critic:

15 if i=t _start, go to step 16, otherwise go to step 11.

Updating model parameters of the global neural network, namely performing asynchronous updating of theta by using dtheta, and performing asynchronous updating of omega by using domega.

17 If T > T _max, ending the loop and outputting the parameters theta and omega of the global A3C neural network, otherwise, entering step 3.

The A3C algorithm is compared with the Random algorithm, namely a Random power grid dispatching algorithm. The Random algorithm randomly chooses a scheduling action when the power flow is over-limited without considering any policy.

(1) Average cumulative return comparison

Simulation experiments were performed herein on IEEE-5 and IEEE-36 dataset datasets using the A3C algorithm, and an average cumulative return for each Episode was calculated. The results are shown in the graph:

the multi-objective prize function value is a larger negative number at the beginning, and gradually converges to a positive value after multiple oscillations in the optimized course. The validity of the algorithm is demonstrated.

(2) Comparison of outage time

Because of the lack of renewable energy generating sets in IEEE-5, IEEE-36 data sets are selected and 6 sets of time sequence fragment data are randomly extracted as test sets in the discussion of sustainability comparison of novel energy systems herein.

It can be seen that the Random algorithm has a power-off and power-off phenomenon in 5 data fragments, and the power-off time is more than 50%. The A3C algorithm after weight adjustment has no power failure phenomenon, can effectively realize safe operation of the system, shows good adaptability of the algorithm combined with weight adjustment, and can ensure safe operation of the novel energy system to the greatest extent.

The IEEE-5 node data set comprises related data of 5 substations, 8 power lines, 2 units and 3 loads.

The IEEE-36 node data comprises relevant data of 36 substations, 59 power lines, 22 generators (including 12 renewable energy generators such as wind generators, light generators and the like) and 37 loads.

The actual case analysis and model effect verification take a standard calculation example of a certain provincial grid network rack SG-126 and the running condition thereof as experimental objects, and the figure shows a grid topological diagram of the standard calculation example of the SG-126, wherein the calculation example comprises 54 generator sets (18 new energy sets), 91 loads (including common loads and adjustable loads), 5 energy storage devices, 126 buses and 185 branch lines, and 145 nodes in total. The new energy installation ratio exceeds 30%, and the adjustable unit of the embodiment comprises energy storage equipment and adjustable load except various types of generators, so that a larger decision space is provided. In addition, the SG-126 calculation example simulates the open characteristics of line blockage, line random faults, weather changes and the like in a real power grid environment based on the fluctuation of new energy output, load curve characteristics, energy storage and flexible load control strategies of the real power grid, covers power grid operation typical scenes such as tie line blockage, tie line N-1 faults, severe fluctuation of source load, new energy electricity limitation and the like, comprises 10 ten thousand continuous convergent alternating current power flow sections and prediction data in one year, and further increases decision difficulty.

In order to verify the effectiveness of the technical scheme, the practical case analysis and model effect verification are carried out on the calculation example. The experiment is based on the actual operation data of the power grid, the improved A3C algorithm is adopted for training and testing, and the evaluation index is the average accumulated return. The experimental process comprises data preprocessing, model training and model testing. And in the training stage, the constraint overrun is prevented by a double insurance mechanism of action screening and constraint limitation combined with rewards, and in the testing stage, the performance of the model in different scenes is evaluated. The result shows that the average accumulated return of the model is gradually increased along with the training, and finally reaches a higher level in the test stage, so that the effectiveness and the robustness of the model are proved. Experimental results show that the A3C algorithm adopting the technical scheme has better performance on the SG-126 grid rack, can effectively optimize action space screening, prevent constraint out-of-limit, and further verify feasibility and effectiveness of the technical scheme in practical application, in particular to innovation and uniqueness in data analysis, action space screening and constraint processing.

Example 2

The invention provides a power grid simulation environment scheduling optimization system based on reinforcement learning, as shown in fig. 6, comprising:

The constraint conditions of the novel energy system are as follows:

The tidal current diverges rewardsThe following formula is shown:

The power control rewards The following formula is shown:

the voltage balance control rewards The following formula is shown:

The wind and light abandoning cost rewardsThe following formula is shown:

The unit operation cost rewards The following formula is shown:

The renewable energy source output rewardsThe following formula is shown:

The carbon emission rewards The following formula is shown:

Where r is the weighted prize value, alpha _i is the calculation of the normalized prize value, Is a weight coefficient for each bonus item.

Example 3

Referring to fig. 7, the present invention further provides an electronic device 100 for power grid simulation environment scheduling optimization method based on reinforcement learning, where the electronic device 100 includes a memory 101, at least one processor 102, a computer program 103 stored in the memory 101 and executable on the at least one processor 102, and at least one communication bus 104.

The memory 101 may be used to store the computer program 103, and the processor 102 implements the power grid simulation environment scheduling optimization method step based on reinforcement learning described in embodiment 1 by running or executing the computer program stored in the memory 101 and invoking the data stored in the memory 101. The memory 101 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data (such as audio data) created according to the use of the electronic device 100, etc. In addition, memory 101 may include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.

The at least one Processor 102 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 102 may be a microprocessor or the processor 102 may be any conventional processor or the like, the processor 102 being a control center of the electronic device 100, the various interfaces and lines being utilized to connect various portions of the overall electronic device 100.

The memory 101 in the electronic device 100 stores a plurality of instructions to implement a reinforcement learning based grid simulation environment scheduling optimization method, and the processor 102 is configured to execute the plurality of instructions to implement:

Example 4

The modules/units integrated in the electronic device 100 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, and a Read-Only Memory (ROM).

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the specific embodiments of the present invention without departing from the spirit and scope of the present invention, and any modifications and equivalents are intended to be included in the scope of the claims of the present invention.

Claims

1. The power grid simulation environment scheduling optimization method based on reinforcement learning is characterized by comprising the following steps of:

2. The reinforcement learning-based power grid simulation environment scheduling optimization method according to claim 1, wherein the construction of the state space of the power grid simulation environment and the action space of the power grid simulation environment is specifically as follows:

3. The reinforcement learning-based power grid simulation environment scheduling optimization method according to claim 1, wherein a target-based reward function is constructed according to an action space of a power grid simulation environment and a state space of the power grid simulation environment, specifically:

The constraint conditions of the novel energy system are as follows:

Wherein, To the point ofRepresenting constraint conditions; representing the lower limit of active power output of each generator, Representing the upper limit of active power output of each generator,Representing the lower limit of reactive power output of each generator,Representing an upper limit of reactive power output of each generator; representing the lower voltage limit of each generator node, Representing the upper voltage limit for each generator node,Representing the lower voltage limit of each generator load node,Representing an upper voltage limit for each generator load node; Is an imbalance between active power at node i, Is the imbalance between reactive power at node i,Is the active power injected at node i,Is the reactive power injected at node i,Is the magnitude of the voltage at node i,Is the magnitude of the voltage at node j,Is the conductance of the branch and,Is susceptance of the branch; is the voltage phase angle difference; representing the active power of the generator, Representing the active power of the load node,Representing the reactive power of the generator,Representing the reactive power of the load node,Indicating a normal value approaching 0; a number indicating the generator; representing the number of generators; a number indicating the load; indicating the number of loads.

4. A reinforcement learning-based grid simulation environment scheduling optimization method according to claim 3, wherein the trend dispersion prize r _t ¹ is represented by the following formula:

the power control prize r _t ² is shown as follows:

The voltage balance control prize r _t ³ is represented by the following formula:

Wherein, Representing a flow-spread bonus function,Representing a power control bonus function,Representing a voltage balance reward function; representing index normalization; Representing a minimum prize during system operation; representing penalties imposed on the agent when power flow divergence occurs; Representing the current transmission power of the transmission line w Indicating the rated maximum transmission power of the transmission line w,Is a penalty factor exceeding the w power flow limit of the transmission line; W represents the number of the power line; Indicating the number of power lines, Representing the active power of the generator,Representing the reactive power of the generator.

5. The reinforcement learning-based power grid simulation environment scheduling optimization method according to claim 3, wherein the wind-solar-power-rejection cost reward r _t ⁴ is represented by the following formula:

The unit operation cost rewards r _t ⁵ are shown in the following formula:

Wherein, Representing a cost-to-generate-power-reward function,Representing a cost reward function of wind and light abandoning cost; a cost coefficient representing a quadratic term is represented, The cost coefficient representing the primary term is represented,A cost coefficient representing a constant term; Representing the active power of the generator; representing a wind and light cost coefficient of unit abandon; Representing the maximum output value of the new energy generator; Representing the actual output value of the new energy generator; the number of the new energy generator is represented; Indicating the number of new energy generators.

6. The reinforcement learning-based grid simulation environment scheduling optimization method according to claim 3, wherein the renewable energy output rewards r _t ⁶ are represented by the following formula:

The carbon emission reward r _t ⁷ is represented by the following formula:

Wherein, Representing a renewable energy source output bonus function,Representing a carbon emission reward function; representing cost coefficients of unit conventional units; Representing the actual output value of a conventional generator; A number representing a conventional generator; representing the number of conventional generators.

7. The reinforcement learning-based power grid simulation environment scheduling optimization method according to claim 1, wherein a constraint-based reward function is constructed according to an action space of a power grid simulation environment and a state space of the power grid simulation environment, specifically:

Wherein, In order to fix the prize value,For a constraint-based incentive to be awarded,To the point ofRepresenting constraints of the new energy system.

8. The reinforcement learning-based grid simulation environment scheduling optimization method according to claim 1, wherein the obtaining the weighted reward value according to the objective-based reward function and the constraint-based reward function is specifically as follows:

Wherein, In order to weight the prize value,To be the result of the calculation of the normalized prize value,Is a weight coefficient for each bonus item.

9. The utility model provides a grid simulation environment dispatch optimization system based on reinforcement study which characterized in that includes:

10. The reinforcement learning-based grid simulation environment scheduling optimization system of claim 9, wherein the objective-based reward function is constructed according to the action space of the grid simulation environment and the state space of the grid simulation environment, specifically:

The constraint conditions of the novel energy system are as follows:

11. The reinforcement learning based grid simulation environment scheduling optimization system of claim 10, wherein the tidal current diverging rewards r _t ¹ is represented by the formula:

the power control prize r _t ² is shown as follows:

12. The reinforcement learning-based grid simulation environment scheduling optimization system of claim 10, wherein the wind and light rejection cost prize r _t ⁴ is represented by the following formula:

The unit operation cost rewards r _t ⁵ are shown in the following formula:

13. The reinforcement learning-based grid simulation environment scheduling optimization system of claim 10, wherein the renewable energy output prize r _t ⁶ is represented by the following formula:

The carbon emission reward r _t ⁷ is represented by the following formula:

14. The reinforcement learning-based grid simulation environment scheduling optimization system of claim 9, wherein the constraint-based reward function is constructed according to the action space of the grid simulation environment and the state space of the grid simulation environment, specifically:

15. The reinforcement learning based grid simulation environment scheduling optimization system of claim 9, wherein the obtaining weighted incentive values according to the goal based incentive function and the constraint based incentive function is as follows:

16. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, realizes the steps of the reinforcement learning based grid simulation environment scheduling optimization method according to any one of claims 1-8.

17. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the reinforcement learning-based grid simulation environment scheduling optimization method according to any one of claims 1 to 8.