Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The invention is described in further detail below with reference to the attached drawing figures:
The invention provides a power grid simulation environment dispatching optimization method based on reinforcement learning, which builds a reinforcement learning algorithm frame by building a state space and an action space and building a reward function and a training neural network to carry out dispatching optimization of the power grid simulation environment, as shown in figure 1, and comprises the following steps:
s1, constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;
the construction of the state space of the power grid simulation environment and the action space of the power grid simulation environment comprises the following concrete steps:
Acquiring numerical characteristics and non-numerical characteristics, and constructing a state space of a power grid simulation environment based on the numerical characteristics and the non-numerical characteristics, wherein the numerical characteristics comprise generator active power, generator reactive power, load active power, load reactive power, predicted load, line power and constraint power;
in the construction of the action space of the power grid simulation environment, a passive discriminator is arranged to run in the environment, and the specific process of the passive discriminator is that after an agent takes running control actions, all tide constraints of a system are calculated at the same time when rewards are calculated, if the actions taken by the agent cause constraint boundary crossing, the action probabilities are arranged in a descending order, the running control actions with the next highest action probability are selected until the agent finds actions meeting constraint conditions, and if all actions cannot meet constraint conditions, the agent executes actions for maximizing the current rewarding function.
S2, constructing a reward function based on a target and a reward function based on constraint according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;
Constructing a target-based reward function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the reward function comprises the following specific steps:
the target-based rewarding function comprises a first power grid dispatching rewarding, a second power grid dispatching rewarding and a third power grid dispatching rewarding of the novel energy system, wherein the first power grid dispatching rewarding comprises a power flow divergence rewarding, a power control rewarding and a voltage balance control rewarding;
The constraint conditions of the novel energy system are as follows:
Wherein C 1 to C 5 represent constraints; representing the lower limit of active power output of each generator, Representing the upper limit of active power output of each generator,Representing the lower limit of reactive power output of each generator,Representing an upper limit of reactive power output of each generator; representing the lower voltage limit of each generator node, Representing the upper voltage limit for each generator node,Representing the lower voltage limit of each generator load node,ΔP i is the imbalance between the active power at node i, ΔQ i is the imbalance between the reactive power at node i, P i is the active power injected at node i, Q i is the reactive power injected at node i, V i is the voltage magnitude at node i, V j is the voltage magnitude at node j, G ij is the conductance of the branch, B ij is the susceptance of the branch, θ ij is the voltage phase angle difference; representing the active power of the generator, Representing the active power of the load node,Representing the reactive power of the generator,The reactive power of the load node is represented, delta >0 indicates a normal value close to 0, N represents the number of the generators, N represents the number of the generators, M represents the number of the loads, and M represents the number of the loads.
The tidal current diverges rewardsThe following formula is shown:
The power control rewards The following formula is shown:
the voltage balance control rewards The following formula is shown:
Wherein R 1 represents a flow diverging reward function, R 2 represents a power control reward function, R 3 represents a voltage balance reward function, e () represents an exponential normalization, R min represents a minimum reward during system operation, R g represents a penalty imposed on an agent when power flow diverges, ρ w represents a current transmission power of the transmission line w, and X w is a penalty coefficient exceeding the power flow limit of the transmission line W, R 0 is a default rewarding function reflecting the running consistency of the power grid, W represents the number of the power lines, W represents the number of the power lines, P G represents the active power of the generator, and Q G represents the reactive power of the generator.
The wind and light abandoning cost rewardsThe following formula is shown:
The unit operation cost rewards The following formula is shown:
Wherein R 4 represents a power generation cost rewarding function, R 5 represents a wind and solar energy abandoning cost rewarding function, a n represents a cost coefficient of a secondary term, b n represents a cost coefficient of a primary term, and c n represents a cost coefficient of a constant term; C q represents the unit wind-light rejection cost coefficient; The maximum output value of the new energy generator is represented, P new is the actual output value of the new energy generator, N new is the number of the new energy generator, and N new is the number of the new energy generator.
The renewable energy source output rewardsThe following formula is shown:
The carbon emission rewards The following formula is shown:
wherein R 6 represents a renewable energy source output rewarding function, R 7 represents a carbon emission rewarding function, c p represents a unit conventional unit cost coefficient, P con represents an actual output value of a conventional generator, N con represents a number of the conventional generator, and N con represents the number of the conventional generator.
Constructing a constraint-based rewarding function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the method specifically comprises the following steps:
wherein lambda is a fixed prize value, For constraint-based rewards, C 1 to C 5 represent constraints for the novel energy system.
And S3, acquiring weighted rewards according to a reward function based on the target and a reward function based on the constraint, and carrying out grid simulation environment dispatching optimization by adopting an Actor-Critic network structure and combining the weighted rewards.
The method comprises the steps of obtaining weighted rewards according to a rewards function based on a target and a rewards function based on constraint, and specifically comprises the following steps:
Where r is the weighted prize value, alpha i is the calculation of the normalized prize value, To be a weight coefficient for each bonus item, the weight coefficient is typically in the range of 0 to 1.
The method is described in detail below with reference to the accompanying drawings:
1) State space construction
The input received by the agent determines to some extent whether the performance is good or bad, and as shown in fig. 2, the state space required by the invention includes both numerical characteristics and non-numerical characteristics as input of the agent, so that the agent can access the whole state of the power network at each time step. The state features used in the present invention can be classified into numerical features and non-numerical features. The numerical characteristics include generator active and reactive power, load active and reactive power, predicted load, line power and constraint power. The non-numerical features include line open state, which is incorporated directly into the topological feature as an additional vector.
Wherein S is vector space, P L、QL represents load active power and reactive power, P G、QG represents generator active power and reactive power, P Lk t+1 represents active demand predicted value of the kth load node at the next moment, P ρ represents line power, Q C represents constraint power, PR is an additional vector of line on-off state and the like, and represents the topology structure of the power system.
2) Action space construction
In the construction of the action space, as shown in fig. 3, a passive arbiter is set to run in the environment, and the feasible actions are screened only when the system is out of limit, and the simulation is continued until the iteration is finished. The first safety function here is to ensure the safety of the power system.
As shown in FIG. 4, the specific flow of the decision by the passive arbiter is to calculate all flow constraints of the system while calculating rewards after the agent takes the operation control action. If the actions taken by the agent can cause constraint out-of-range, the action probabilities are arranged in descending order, and the operation control action with the next highest action probability is selected. And so on until the agent finds an action that can satisfy the constraint. If all actions cannot meet the constraint condition, the agent will execute the action that maximizes the current bonus function.
In the action screening process, the topological action and the power generation output adjustment action which can enable the system to recover to a safe running state from a tide out-of-limit state form an action space for intelligent training together:
Ωt=Ωt-1∪(ΩA,t-1∩ΩTP,t-1)∪(ΩA,t-1∩ΩG,t)
Wherein, omega t and omega t+1 are respectively the movable sets in the t-step simulation and the t+1-step simulation, omega A,t is the movable set which can meet all constraint conditions in the t-step simulation and is better than the movable set which does not execute the action in the current rewarding, omega TP,t is all the feasible topological movable sets in the t-step simulation, and omega G,t is all the feasible power generation force adjustment movable sets in the t-step simulation.
3) Bonus function architecture construction
3.1 Target-based reward function
(1) Security dispatch rewards
In the dispatching rewards of the safety, mainly relating to three aspects of power flow divergence, active and reactive power control and voltage balance control of a novel power system, according to the constraint conditions related to each aspect, corresponding evaluation indexes are normalized and then converted into corresponding rewarding functionsForming rewards of the novel power system on safety scheduling.
The constraint conditions of the novel energy system are as follows:
Wherein C 1 to C 5 represent constraints; representing the lower limit of active power output of each generator, Representing the upper limit of active power output of each generator,Representing the lower limit of reactive power output of each generator,Representing an upper limit of reactive power output of each generator; representing the lower voltage limit of each generator node, Representing the upper voltage limit for each generator node,Representing the lower voltage limit of each generator load node,ΔP i is the imbalance between the active power at node i, ΔQ i is the imbalance between the reactive power at node i, P i is the active power injected at node i, Q i is the reactive power injected at node i, V i is the voltage magnitude at node i, V j is the voltage magnitude at node j, G ij is the conductance of the branch, B ij is the susceptance of the branch, θ ij is the voltage phase angle difference; representing the active power of the generator, Representing the active power of the load node,Representing the reactive power of the generator,The reactive power of the load node is represented, delta >0 indicates a normal value close to 0, N represents the number of the generators, N represents the number of the generators, M represents the number of the loads, and M represents the number of the loads.
Novel energy system tide dispersion rewardsThe following formula is shown:
Rewards for power control The following formula is shown:
Controlling rewards with respect to voltage balance The following formula is shown:
Wherein, The values are normalized negative values, and the closer to 0, the better the running state of the current system is, and the closer to the optimization target is. R 1 represents a flow divergence rewarding function, R 2 represents a power control rewarding function, R 3 represents a voltage balance rewarding function, e () represents an exponential normalization, R min represents a lowest rewarding during system operation, R g represents a penalty imposed on an agent when a power flow divergence occurs, ρ w represents a current transmission power of the transmission line w, andX w is a penalty coefficient exceeding the power flow limit of the transmission line W, R 0 is a default rewarding function reflecting the running consistency of the power grid, W represents the number of the power lines, W represents the number of the power lines, P G represents the active power of the generator, and Q G represents the reactive power of the generator.
(2) Economic dispatch rewards
In the economic dispatching rewards, mainly relating to two aspects of unit operation cost and wind and light abandoning cost, corresponding evaluation indexes are normalized and then converted into corresponding rewarding functions
Wind and light abandoning cost rewardsThe following formula is shown:
Unit operation cost rewards The following formula is shown:
Wherein, The values are normalized negative values, the closer to 0 is the better the running state of the current system is, the closer to an optimization target is, R 4 is the power generation cost rewarding function, R 5 is the wind and light abandoning cost rewarding function, a n is the cost coefficient of a secondary term, b n is the cost coefficient of a primary term, and c n is the cost coefficient of a constant term; C q represents the unit wind-light rejection cost coefficient; The maximum output value of the new energy generator is represented, P new is the actual output value of the new energy generator, N new is the number of the new energy generator, and N new is the number of the new energy generator.
(3) Low carbon dispatch rewards
In the low-carbon dispatching rewards, the method mainly relates to two aspects of output of renewable energy sources and carbon emission, and corresponding evaluation indexes are converted into corresponding rewarding functions after being normalized.
Renewable energy source output rewardsThe following formula is shown:
Carbon emission rewards The following formula is shown:
Wherein, The value is between 0 and 1, the closer to 1 is the higher the renewable energy utilization rate of the current system is,The value is between-1 and 0, the closer to 0 is the lower the carbon emission of the current system, R 6 is the renewable energy source output rewarding function, R 7 is the carbon emission rewarding function, c p is the unit conventional unit cost coefficient, P con is the actual output value of the conventional generator, N con is the number of the conventional generator, and N con is the number of the conventional generator.
3.2 Constraint-based rewards
Since most of the above-mentioned target-based dispatch rewards are negative numbers, in order to enable the agent to fully explore the simulated environment, when all target constraints are satisfied, the exploration of the agent in terms of meeting the constraint conditions is encouraged based on the forward rewards. The second insurance function is played in the method about the safety of the power grid, and the out-of-limit is prevented at the action selection place, the exploration is properly encouraged at the rewards place, and the balance of exploration and utilization is ensured:
wherein lambda is a fixed prize value, For constraint-based rewards, C 1 to C 5 represent constraints for the novel energy system.
The final weighted reward value applied to the reinforcement learning agent is constructed as follows, combining the 8 reward functions:
Where r is the weighted prize value, alpha i is the calculation of the normalized prize value, To be a weight coefficient for each bonus item, the weight coefficient is typically in the range of 0 to 1.
4) Neural network training and algorithm flow based on A3C
The A3C algorithm used by the invention adopts an Actor-Critic network structure and consists of an Actor network and a Critic network, namely an Actor network and a criticism network. The goal of the Actor network is to learn a policy function, i.e., a probability distribution of a selected action in a given state. The goal of Critic networks is to learn a state value function or a state-action value function for evaluating the value of different state or state-action pairs. The training framework of the A3C algorithm used in the present invention is shown in fig. 5.
The A3C algorithm adopted by the invention adopts an asynchronous training mode, each thread independently interacts with the environment, and gradient update is realized through parameter sharing. In addition to the initialized neural network parameters, multiple parallel training threads are created, each thread independently running an agent to interact with the environment, and using the Actor and Critic networks to achieve an approximation of policies and cost functions. Each thread interacts with the environment of the power grid independently, selects the scheduling actions after action screening according to the current strategy network, observes new states and rewards according to the feedback of the power grid environment, and stores the information in a local experience playback buffer zone. When a thread reaches the end of a predetermined number of time steps or trace, the thread samples data from the empirical playback buffer and performs gradient updates by calculating a merit function. After each thread executes gradient updating for a plurality of times, the updated parameters are transferred to the main thread for global parameter updating. The above process is repeated until a predetermined training round is reached or a termination condition is met. This way of training can improve the efficiency and stability of the training and enables better strategies and cost functions to be learned.
The specific algorithm flow in the invention is as follows:
The algorithm input part is characterized in that in the global A3C neural network, the corresponding parameter based on the strategy pi is theta, the corresponding parameter based on the value network is omega, and in the A3C neural network of the branch thread, the corresponding parameter theta ', omega' is T, the global sharing iteration round is T max, the global maximum iteration times is T t, and the action space is an attenuation factor gamma;
And the algorithm output part is a power grid dispatching strategy pi and a cost function V.
1 Update time series t=1.
And 2, resetting gradient update quantities dθ+.0 and dω+.0 of the global Actor network and the global Critic network. The purpose of this step is to clear the previously accumulated gradient information at the beginning of each new iteration, ensuring that the new calculation is not disturbed by the old gradients.
And 3, synchronizing specific parameters from the global Actor network and the global Critic network to the neural network theta '=theta, omega' =omega of the branch threads. This step ensures that the neural network of the branch thread has the same initial parameters as the global network.
Initializing t start =t to obtain a state s t of time series t.
Action Ω t is performed based on policy pi (Ω t|st; θ).
And 6, screening an action space, and detecting whether the action is out of limit.
Execution of action Ω t gets a multi-objective prize r t and the next state s t+1 given by the simulation environment.
And 8, updating time step length, and enabling T to be t+1, T to be t+1.
If s t is the end state, or the time series t reaches a maximum, then step 10 is entered, otherwise the flow goes back to step 510 where R (s, t) is calculated for position s t at the last time series t, where:
From back to front, for each time series i e (t-1, t-2,., t start), the following operations are performed:
each instant R (s, i) =r i +γr (s, i+1) is calculated 12.
13, Gradient update of accumulated Actor:
14, accumulating gradient update of Critic:
15 if i=t start, go to step 16, otherwise go to step 11.
Updating model parameters of the global neural network, namely performing asynchronous updating of theta by using dtheta, and performing asynchronous updating of omega by using domega.
17 If T > T max, ending the loop and outputting the parameters theta and omega of the global A3C neural network, otherwise, entering step 3.
The A3C algorithm is compared with the Random algorithm, namely a Random power grid dispatching algorithm. The Random algorithm randomly chooses a scheduling action when the power flow is over-limited without considering any policy.
(1) Average cumulative return comparison
Simulation experiments were performed herein on IEEE-5 and IEEE-36 dataset datasets using the A3C algorithm, and an average cumulative return for each Episode was calculated. The results are shown in the graph:
the multi-objective prize function value is a larger negative number at the beginning, and gradually converges to a positive value after multiple oscillations in the optimized course. The validity of the algorithm is demonstrated.
(2) Comparison of outage time
Because of the lack of renewable energy generating sets in IEEE-5, IEEE-36 data sets are selected and 6 sets of time sequence fragment data are randomly extracted as test sets in the discussion of sustainability comparison of novel energy systems herein.
It can be seen that the Random algorithm has a power-off and power-off phenomenon in 5 data fragments, and the power-off time is more than 50%. The A3C algorithm after weight adjustment has no power failure phenomenon, can effectively realize safe operation of the system, shows good adaptability of the algorithm combined with weight adjustment, and can ensure safe operation of the novel energy system to the greatest extent.
The IEEE-5 node data set comprises related data of 5 substations, 8 power lines, 2 units and 3 loads.
The IEEE-36 node data comprises relevant data of 36 substations, 59 power lines, 22 generators (including 12 renewable energy generators such as wind generators, light generators and the like) and 37 loads.
The actual case analysis and model effect verification take a standard calculation example of a certain provincial grid network rack SG-126 and the running condition thereof as experimental objects, and the figure shows a grid topological diagram of the standard calculation example of the SG-126, wherein the calculation example comprises 54 generator sets (18 new energy sets), 91 loads (including common loads and adjustable loads), 5 energy storage devices, 126 buses and 185 branch lines, and 145 nodes in total. The new energy installation ratio exceeds 30%, and the adjustable unit of the embodiment comprises energy storage equipment and adjustable load except various types of generators, so that a larger decision space is provided. In addition, the SG-126 calculation example simulates the open characteristics of line blockage, line random faults, weather changes and the like in a real power grid environment based on the fluctuation of new energy output, load curve characteristics, energy storage and flexible load control strategies of the real power grid, covers power grid operation typical scenes such as tie line blockage, tie line N-1 faults, severe fluctuation of source load, new energy electricity limitation and the like, comprises 10 ten thousand continuous convergent alternating current power flow sections and prediction data in one year, and further increases decision difficulty.
In order to verify the effectiveness of the technical scheme, the practical case analysis and model effect verification are carried out on the calculation example. The experiment is based on the actual operation data of the power grid, the improved A3C algorithm is adopted for training and testing, and the evaluation index is the average accumulated return. The experimental process comprises data preprocessing, model training and model testing. And in the training stage, the constraint overrun is prevented by a double insurance mechanism of action screening and constraint limitation combined with rewards, and in the testing stage, the performance of the model in different scenes is evaluated. The result shows that the average accumulated return of the model is gradually increased along with the training, and finally reaches a higher level in the test stage, so that the effectiveness and the robustness of the model are proved. Experimental results show that the A3C algorithm adopting the technical scheme has better performance on the SG-126 grid rack, can effectively optimize action space screening, prevent constraint out-of-limit, and further verify feasibility and effectiveness of the technical scheme in practical application, in particular to innovation and uniqueness in data analysis, action space screening and constraint processing.
Example 2
The invention provides a power grid simulation environment scheduling optimization system based on reinforcement learning, as shown in fig. 6, comprising:
the power grid parameter acquisition module is used for constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;
the construction of the state space of the power grid simulation environment and the action space of the power grid simulation environment comprises the following concrete steps:
Acquiring numerical characteristics and non-numerical characteristics, and constructing a state space of a power grid simulation environment based on the numerical characteristics and the non-numerical characteristics, wherein the numerical characteristics comprise generator active power, generator reactive power, load active power, load reactive power, predicted load, line power and constraint power;
in the construction of the action space of the power grid simulation environment, a passive discriminator is arranged to run in the environment, and the specific process of the passive discriminator is that after an agent takes running control actions, all tide constraints of a system are calculated at the same time when rewards are calculated, if the actions taken by the agent cause constraint boundary crossing, the action probabilities are arranged in a descending order, the running control actions with the next highest action probability are selected until the agent finds actions meeting constraint conditions, and if all actions cannot meet constraint conditions, the agent executes actions for maximizing the current rewarding function.
The rewarding function construction module is used for constructing a rewarding function based on targets and a rewarding function based on constraints according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;
Constructing a target-based reward function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the reward function comprises the following specific steps:
the target-based rewarding function comprises a first power grid dispatching rewarding, a second power grid dispatching rewarding and a third power grid dispatching rewarding of the novel energy system, wherein the first power grid dispatching rewarding comprises a power flow divergence rewarding, a power control rewarding and a voltage balance control rewarding;
The constraint conditions of the novel energy system are as follows:
Wherein C 1 to C 5 represent constraints; representing the lower limit of active power output of each generator, Representing the upper limit of active power output of each generator,Representing the lower limit of reactive power output of each generator,Representing an upper limit of reactive power output of each generator; representing the lower voltage limit of each generator node, Representing the upper voltage limit for each generator node,Representing the lower voltage limit of each generator load node,ΔP i is the imbalance between the active power at node i, ΔQ i is the imbalance between the reactive power at node i, P i is the active power injected at node i, Q i is the reactive power injected at node i, V i is the voltage magnitude at node i, V j is the voltage magnitude at node j, G ij is the conductance of the branch, B ij is the susceptance of the branch, θ ij is the voltage phase angle difference; representing the active power of the generator, Representing the active power of the load node,Representing the reactive power of the generator,The reactive power of the load node is represented, delta >0 indicates a normal value close to 0, N represents the number of the generators, N represents the number of the generators, M represents the number of the loads, and M represents the number of the loads.
The tidal current diverges rewardsThe following formula is shown:
The power control rewards The following formula is shown:
the voltage balance control rewards The following formula is shown:
Wherein R 1 represents a flow diverging reward function, R 2 represents a power control reward function, R 3 represents a voltage balance reward function, e () represents an exponential normalization, R min represents a minimum reward during system operation, R g represents a penalty imposed on an agent when power flow diverges, ρ w represents a current transmission power of the transmission line w, and X w is a penalty coefficient exceeding the power flow limit of the transmission line W, R 0 is a default rewarding function reflecting the running consistency of the power grid, W represents the number of the power lines, W represents the number of the power lines, P G represents the active power of the generator, and Q G represents the reactive power of the generator.
The wind and light abandoning cost rewardsThe following formula is shown:
The unit operation cost rewards The following formula is shown:
Wherein R 4 represents a power generation cost rewarding function, R 5 represents a wind and solar energy abandoning cost rewarding function, a n represents a cost coefficient of a secondary term, b n represents a cost coefficient of a primary term, and c n represents a cost coefficient of a constant term; C q represents the unit wind-light rejection cost coefficient; The maximum output value of the new energy generator is represented, P new is the actual output value of the new energy generator, N new is the number of the new energy generator, and N new is the number of the new energy generator.
The renewable energy source output rewardsThe following formula is shown:
The carbon emission rewards The following formula is shown:
wherein R 6 represents a renewable energy source output rewarding function, R 7 represents a carbon emission rewarding function, c p represents a unit conventional unit cost coefficient, P con represents an actual output value of a conventional generator, N con represents a number of the conventional generator, and N con represents the number of the conventional generator.
Constructing a constraint-based rewarding function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment, wherein the method specifically comprises the following steps:
wherein lambda is a fixed prize value, For constraint-based rewards, C 1 to C 5 represent constraints for the novel energy system.
And the dispatching optimization module is used for acquiring weighted rewards according to the rewards function based on the target and the rewards function based on the constraint, and carrying out dispatching optimization on the power grid simulation environment by adopting an Actor-Critic network structure and combining the weighted rewards.
The method comprises the steps of obtaining weighted rewards according to a rewards function based on a target and a rewards function based on constraint, and specifically comprises the following steps:
Where r is the weighted prize value, alpha i is the calculation of the normalized prize value, Is a weight coefficient for each bonus item.
Example 3
Referring to fig. 7, the present invention further provides an electronic device 100 for power grid simulation environment scheduling optimization method based on reinforcement learning, where the electronic device 100 includes a memory 101, at least one processor 102, a computer program 103 stored in the memory 101 and executable on the at least one processor 102, and at least one communication bus 104.
The memory 101 may be used to store the computer program 103, and the processor 102 implements the power grid simulation environment scheduling optimization method step based on reinforcement learning described in embodiment 1 by running or executing the computer program stored in the memory 101 and invoking the data stored in the memory 101. The memory 101 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data (such as audio data) created according to the use of the electronic device 100, etc. In addition, memory 101 may include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.
The at least one Processor 102 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 102 may be a microprocessor or the processor 102 may be any conventional processor or the like, the processor 102 being a control center of the electronic device 100, the various interfaces and lines being utilized to connect various portions of the overall electronic device 100.
The memory 101 in the electronic device 100 stores a plurality of instructions to implement a reinforcement learning based grid simulation environment scheduling optimization method, and the processor 102 is configured to execute the plurality of instructions to implement:
constructing a state space of a power grid simulation environment and an action space of the power grid simulation environment;
constructing a target-based rewarding function and a constraint-based rewarding function according to the action space of the power grid simulation environment and the state space of the power grid simulation environment;
And acquiring weighted rewards according to the rewards function based on the target and the rewards function based on the constraint, and carrying out dispatching optimization on the power grid simulation environment by adopting an Actor-Critic network structure and combining the weighted rewards.
Example 4
The modules/units integrated in the electronic device 100 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, and a Read-Only Memory (ROM).
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the specific embodiments of the present invention without departing from the spirit and scope of the present invention, and any modifications and equivalents are intended to be included in the scope of the claims of the present invention.