CN117117989A

CN117117989A - Deep reinforcement learning solving method for unit combination

Info

Publication number: CN117117989A
Application number: CN202311096902.5A
Authority: CN
Inventors: 徐桂磊; 吴秋伟; 林镇佳
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-24

Abstract

The invention provides a unit combination deep reinforcement learning solving method, which comprises the following steps: s1: the method comprises the steps of receiving unit parameters and a network topological structure, and establishing a unit combination model based on traditional optimization; s2: carrying out Markov decision process modeling based on a traditional optimized unit combination model; s3: improving a set combination Markov decision process to account for wind power uncertainty; s4: building a deep reinforcement learning model for solving a Markov decision process; s5: receiving wind power and load historical data to train parameters of the deep reinforcement learning model; s6: and the trained deep reinforcement learning model is used for solving a unit combination problem considering wind power uncertainty, so that a unit combination scheduling scheme is obtained. According to the method, the influence of wind power uncertainty on the scheduling scheme is adaptively learned by training of the deep reinforcement learning model parameters by utilizing the historical wind power data, so that an optimal unit combination decision scheme is made.

Description

Deep reinforcement learning solving method for unit combination

Technical Field

The invention relates to the technical field of power grid unit combination, in particular to a unit combination deep reinforcement learning solving method.

Background

Crew combination is a classical problem in the operation of electrical power systems. It is necessary to determine the start-stop scheme of the generator set to meet the power demand and minimize costs. Because the start-up or shut-down time of a thermodynamic unit is as long as several hours, it is often necessary to determine a unit start-stop scheme several hours or days in advance to keep the system running stably. In recent years, as the permeability of renewable energy sources in traditional power grids is higher, additional consideration is required for the high utilization of renewable energy sources in the preparation of unit combination schemes. In addition, the uncertainty caused by the renewable energy power prediction error creates new challenges for system operators, and a reliable and economical crew scheduling scheme needs to be formulated for coping with the uncertainty of the system.

Conventional deterministic set-up problems are often expressed as convex optimization problems, which can be solved numerically, but computation time grows exponentially. As a branch of machine learning methods, reinforcement learning aims to derive approximations of optimal decision strategies to maximize system performance. Reinforcement learning algorithms are powerful tools to find optimal solutions in complex problems and have shown impressive results in recent years in the gaming arts where an agent can rely on itself to surpass the performance of human experts without prior knowledge. Currently, a reinforcement learning method is used for solving the problem of unit combination, and the solving time can be greatly shortened. The conventional method comprises the steps of firstly constructing a unit combination mathematical model taking the minimum generating cost as an objective function, taking power balance, unit output limit and minimum start-stop time constraint as constraint conditions, then completing description of unit combination problems under a reinforcement learning framework, defining a state space, an action space and a reward function, and finally training and solving the unit combination problem model by adopting a deep Q network algorithm.

The above work is limited to solving deterministic set combination problems, and does not consider uncertainty of the power system such as wind power output prediction errors.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a unit combination deep reinforcement learning solving method, which utilizes historical wind power data to perform training of deep reinforcement learning model parameters to adaptively learn the influence of wind power uncertainty on a scheduling scheme so as to make an optimal unit combination decision scheme.

In order to achieve the above purpose, the present invention adopts the following scheme:

the invention provides a unit combination deep reinforcement learning solving method, which comprises the following steps:

s1: the method comprises the steps of receiving unit parameters and a network topological structure, and establishing a unit combination model based on traditional optimization;

s2: carrying out Markov decision process modeling based on a traditional optimized unit combination model;

s3: improving a set combination Markov decision process to account for wind power uncertainty;

s4: building a deep reinforcement learning model for solving a Markov decision process;

s5: receiving wind power and load historical data to train parameters of the deep reinforcement learning model;

S6: and the trained deep reinforcement learning model is used for solving a unit combination problem considering wind power uncertainty, so that a unit combination scheduling scheme is obtained.

In some embodiments, the present invention further includes the following technical features:

in step S1, the unit combination model based on the conventional optimization includes an objective function, a unit operation constraint and a power system constraint;

the objective function expression is:

wherein T is a scheduling period, N is the number of units, u _i,t Is the start-stop state of the ith unit in t period, P _i,t Is the output of the ith unit in the period t,is a unit fuel cost function, +.>Is a function of the start-stop cost of the unit,

the unit fuel cost function expression is:

wherein alpha is _i ,β _i ,γ _i Is a parameter of a unit cost quadratic function curve; p (P) _i,t The output of the machine set is;

the power system constraint comprises a unit capacity constraint, a unit minimum start-stop time constraint and a climbing rate constraint;

the unit operation constraint comprises a system power balance constraint and a rotary standby constraint.

The unit capacity constraint represents the upper and lower limits of unit output in each period, and the unit capacity constraint expression is:

in the method, in the process of the invention,P _i is the lower limit of the output of the unit, Is the upper limit of the output of the unit, u _i,t Is a starting and stopping state of the unit;

the minimum start-stop time constraint of the unit represents the minimum time length required to be maintained when the unit is in a start-up or shut-down state, and the larger the unit capacity is, the longer the time is required in general, and the minimum start-stop time constraint expression of the unit is as follows:

in TS _i Is the minimum continuous shutdown time of the unit, TO _i Minimum continuous starting time of the unit;

the ramp rate constraint represents that the power output of the generator may be increased or decreased over a period of time is limited; the climbing rate constraint expression is:

wherein P is _up,i Is the ascending and climbing speed of the unit, P _down,i Is the downward climbing speed of the unit, P _shut,i Is the maximum power output of the allowed unit;

the power balance constraint expression is:

wherein P is _D,t Is the aggregate load of the system during period t;

to ensure stable operation of the system, the rotational reserve is typically set to 10% of the system load, the rotational reserve constraint expression being:

wherein R is _t Is the rotational reserve capacity of the system during period t.

In step S2, the markov decision process includes a state space, an action space, a reward function, an environment transition probability, and a discount factor quintuple, the quintuple is constructed, and the process for converting the unit combination model into the markov decision process specifically includes the following steps:

S201: constructing an action space: a is that _t ＝[a _1,t ,…,a _N,t ],a _n,t ∈{0,1}

Wherein a is _n,t Is a binary vector, the length of the binary vector is the number N of the units, the value is 0/1, and a _n,t =1 represents that the nth unit is turned on at time t+1 and remains on for the period, a _n,t The n-th set of (0) is turned off at time t+1 and remains off for that period, the action space dimension of the whole Markov decision process is 2 ^N ；

S202: building a state space: s is(s) _t ＝(b _i ,d _t+1 ,w _t+1 ,τ _i,t ,t)

Where t is the current scheduling period, vector b _i Basic information of the unit i such as minimum start-up time, start-up cost and parameters of a secondary fuel curve are included; d, d _t+1 Is the load predictive value of t+1, w _t+1 Is the wind power generation predicted value of t+1, τ _i,t Is the number of time periods that the unit i has been turned on/off at the current time t;

s203: constructing a reward function, wherein the reward function is a scalar quantity returned to the intelligent agent by the environment, the scalar quantity reflects the quality of the state action mapping relation adopted by the current intelligent agent and is used for guiding the intelligent agent to adjust the current decision strategy according to the magnitude of the reward value, and the expression of the reward function is as follows:

in the method, in the process of the invention,is the unit fuel cost>Is the starting cost of the unit, which is->Is a load shedding penalty; the load shedding penalty expression is as follows:

Wherein, c ^voll The method is the unit cost of load reduction capacity per megawatt hour, zeta is load reduction resolution, and is generally 0.1% of load, and the load reduction penalty penalizes the agent, so that adverse actions causing load reduction are avoided, and the intelligent volume is helped to explore a feasible solution;

s204: computing the environment transition probability:

where the probability of an environmental transition is determined by an environmental transition function that moves the current state of the system from s _t Conversion to s _t+1 And returning a reward value r, wherein in the unit combination problem, an environment transfer function should strictly limit illegal actions, such as start-stop actions which violate the unit start-stop constraint;

the environment transition probability comprises a unit running state transition, a wind power predicted value transition and a load predicted value transition, wherein the wind power predicted value transition and the load predicted value transition are data for updating the predicted value to the next period;

the unit running state transfer function expression is as follows:

wherein τ _i,t Is the current running/stopping time of the unit, according to different actions taken by the agent,

the transition of the state is deterministic.

S205: determining a discount factor: the discount factor represents the importance degree of the intelligent agent on the long-term rewards and the short-term rewards, and the value range is between 0 and 1.

In some embodiments, the discount factor takes 0.9.

The step S3 specifically comprises the following steps:

s301: strategy learning is carried out by adopting real wind power data;

for utilizing the data driving characteristic of the model-free reinforcement learning method, the complex randomness of modeling of the power system is avoided, and strategy learning of real historical data is utilized, namely strategy of an intelligent agent is learned by utilizing wind power data collected in the real world; in the training process, the change rule between the wind energy predicted value and the actual measured value is learned, and an optimal scheduling scheme which can adapt to the actual testing environment is provided on the basis of grasping the distribution characteristic of the wind energy predicted value and the actual measured value;

s302: adopting a Markov decision process based on state disturbance;

in a Markov decision process based on state disturbance, in a correction state conversion process between actual wind power and predicted wind power, disturbance state observation of an agent is defined as predicted wind power, so that the agent can take corresponding actions according to a random strategy; the environment is still transited from the real state to the next state, and rewards obtained by the intelligent agent are calculated according to the actual wind power value; the actions executed by the intelligent agent under the original learning strategy according to the prediction state can be suboptimal, so that rewards are correspondingly reduced, and actions with lower rewards under the prediction state are less adopted in the exploration process;

S303: adopting an improved state space; the state space should take into account factors affecting the decision as much as possible; when a state space is defined, considering the problem of time sequence correlation, adding the change of the wind power predicted value of the adjacent period into the state space, and representing the change by a first-order differential form of the wind power adjacent two periods;

the change expression of the wind power predicted value in the adjacent period is as follows:

wherein P is _w,t-1 Is the predicted force value of wind power in t-1 period, P _w,t The predicted force value of the wind power in the t period;

wherein P is _w,t-1 Is the predicted force value of wind power in t-1 period, P _w',t-1 The measured output value of the wind power in the period t is;

further, the improved state space expression is as follows:

in step S5, the training comprises the steps of,

s501: receiving historical wind power and load data for training, wherein the historical wind power needs to simultaneously comprise wind power prediction data and wind power actual measurement data;

s502: the mobile network interacts with the environment and stores the obtained key information in an experience playback pool; the output of the action network is the scheduling scheme of the next time unit combination, and the output of the evaluation network is the state value; in the training process, the intelligent agent and the unit combination reinforcement learning environment interact according to the time interval sequence, the intelligent agent acquires the environment state defined in the step S302 and outputs actions, and the unit combination reinforcement learning environment calculates scalar rewarding values defined in the step S203 to the intelligent agent; storing the state, action and rewards value of each period of the whole dispatching cycle in an experience pool;

S503: evaluating the network to calculate state value and discount rewards, further solving a dominance function by using the calculation result, and updating the network weight of the evaluation network;

s504: calculating importance sampling ratio through a sampling network, so as to update the weight of the action network; wherein the sampling network and the action network share the same weight, but the updating of the weight parameters of the sampling network lags behind the action network.

In step S4, the deep reinforcement learning model is a policy-based near-end policy optimization model;

the near-end strategy optimization model is based on an actor-critter framework, is provided with an action network and an evaluation network, and is a strategy-based deep reinforcement learning algorithm; the method has the following characteristics: approximating and optimizing the strategy function by maximizing the strategy objective function, wherein the finally optimized strategy represents probability distribution of the actions of the intelligent agent; generating actions by the optimal strategy in the strategy-based method with a certain probability, wherein even if the observation results are the same, the results can be different; the policy objective function can evaluate the goodness of the current policy, and the policy objective function expression is as follows:

in the method, in the process of the invention,is under the current policy pi _θ Probability of system environment in state s, pi _θ (s, a) is the agent according to the current policy pi _θ Probability of taking action a after observing state s, +.>Is a prize value;

further, the near-end policy optimization model expression is as follows:

where KL (θ, θ ') is the KL divergence describing the probability distribution θ and probability distribution)' A ^θ' (s _t ,a _t ) Is a dominance function under strategy θ'; the near-end policy optimization model directly places the KL divergence into the objective function to be optimized so that gradient-lifting algorithms can be used to maximize the objective function

The near-end strategy optimization model needs to be applicable to a discrete action space, and the discrete near-end strategy optimization model comprises the following steps: in the original evaluation network, the Q function takes as input the state and action and outputs a Q value, since there are infinite possible actions; in the unit combined Markov decision process, the action space is limited to 2 ^N The actions, therefore, can be mapped to a vector containing the Q value for each action; for an action network, the action network no longer needs to output the mean and covariance of the continuous action distribution; rather, actions will directly output the classification distribution of all actions in the action space; the softmax function is used in the present patent to apply to the last network layer to ensure an effective probability distribution between 0 and 1.

The beneficial effects of the invention are as follows:

1. the method provided by the invention can adaptively consider wind power uncertainty, and can complete the uncertainty study by only using historical data by reflecting the influence of wind power output prediction errors on a scheduling scheme on a reward function, thereby avoiding the accurate description of random variable distribution and the artificial uncertainty budget selection which are required like the traditional robust optimization algorithm/random optimization algorithm, and avoiding the influence caused by subjective factors.

2. According to the method provided by the invention, a large amount of calculation time is deployed in a training stage in an offline manner, and scheduling is completed by means of trained model parameters, so that the solution time can be greatly shortened while the accurate approximation of the solution to the optimal solution is ensured. Compared with the traditional optimization algorithm, the method directly outputs the decision scheme through the trained deep reinforcement learning agent without solving by means of a solver.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of a set-up combined Markov decision process established in the present invention;

FIG. 3 is a schematic diagram of a state transition process taking wind power uncertainty into account established in the present invention;

FIG. 4 is a training process of the near-end policy optimization algorithm action network and evaluation network of the present invention;

FIG. 5 is a schematic diagram of scheduling results for a 5-unit system under wind power uncertainty in an embodiment;

FIG. 6 is a graph comparing the results of the calculation of the solution by the deep reinforcement learning model and the two-stage robust optimization in the embodiment.

Detailed Description

In order to make the technical solution and advantages of the present invention more clear, the technical solution of the embodiments of the present invention will be fully described below with reference to the accompanying drawings in the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The unit combination deep reinforcement learning solving method disclosed in the embodiment is specifically described below:

1. example test conditions

Taking a traditional 5-unit system as an example for analysis, the system comprises 5 thermodynamic units, and meanwhile, the uncertainty of wind power generation is connected into a test system, and the unit operation parameters are shown in table 1. Wherein, parameter 1, parameter 2, parameter 3 represent the coefficient of the secondary fuel cost function of the unit respectively.

TABLE 1 example set parameters

2. Model building, transformation and solving

Referring to fig. 1, the present invention specifically includes the following steps:

The following is a specific description:

s1, establishing a traditional unit combination problem as a mixed integer secondary optimization problem, wherein the established model is as follows:

the unit combination model comprises an objective function, unit operation constraint and power system constraint;

the objective function expression is:

wherein T is a scheduling period, N is the number of units, u _i,t Is the state of the machine set, P _i,t Is the output of the machine set,is a unit fuel cost function, +.>Is a function of the start-stop cost of the unit,

the unit fuel cost function expression is:

the unit operation constraint comprises a unit capacity constraint, a unit minimum start-stop time constraint and a climbing rate constraint; the unit capacity constraint represents the upper and lower limits of unit output in each period, and the unit capacity constraint expression is:

in the method, in the process of the invention,P _i is the lower limit of the output of the unit,is the upper limit of the output of the unit, u _i,t Is a starting and stopping state of the unit.

In TS _i Is the minimum continuous shutdown time of the unit, TO _i The minimum duration of the unit is started.

The ramp rate represents a finite power output that the generator may increase or decrease over a period of time. The constraint is related to the start-stop states of the generator set in different time periods, belongs to time-related dynamic constraint, and is one of the constraints (influencing the solving efficiency of a set combination model) which is difficult to process in a large-scale set combination problem for a long time. The climbing rate constraint expression is:

wherein P is _up,i Is the ascending and climbing speed of the unit, P _down,i Is the downward climbing speed of the unit, P _shut,i Is the maximum power output of the unit allowed.

The power system constraint comprises a system power balance constraint and a rotation standby constraint;

the power balance constraint expression is:

wherein P is _D,t Is the aggregate load of the system during period t.

S2, converting the unit combination model into a Markov decision process, wherein the Markov decision process comprises states, actions, environment transition probabilities, rewards and discount factor quintuples, and constructing the quintuples comprises the following steps:

S201: action space: a is that _t ＝[a _1,t ,…,a _N,t ],a _n,t ∈{0,1}

S202: state space: s is(s) _t ＝(b _i ,d _t+1 ,w _i+1 ,τ _i,t ,t)

s203: the rewarding function is a scalar quantity returned to the intelligent agent by the environment, the scalar quantity reflects the quality of the state action mapping relation adopted by the current intelligent agent and is used for guiding the intelligent agent to adjust the current decision strategy according to the magnitude of the rewarding value, and the expression of the rewarding function is as follows:

in the method, in the process of the invention,is the unit fuel cost>Is the starting cost of the unit, which is->Is the load shedding penalty. The load shedding penalty expression is as follows:

wherein, c ^voll The method is the unit cost of load reduction capacity per megawatt hour, zeta is the load reduction resolution, and is generally 0.1% of load, the load reduction penalty penalizes the agent, bad actions causing load reduction are avoided, and the intelligent volume is helped to explore a feasible solution.

S204: environmental transition probability:

the environment transition probability comprises unit running state transition, wind power predicted value transition and load predicted value transition: the wind power predicted value transfer and the load predicted value transfer are data for updating the predicted value to the next period;

the unit running state transfer function expression is as follows:

the transition of the state is deterministic.

As in fig. 3, during the state transition, the disturbance state observations of the agent are defined as predicted wind power so that the agent can perform the corresponding actions according to a random strategy. Meanwhile, the environment is transited from the current real state to the next state, and rewards obtained by the intelligent agent are calculated according to the actual wind power. It should be noted that since the agent would have learned the strategy based on the predicted state, the action it performs may be suboptimal, which would result in a corresponding reduction in rewards earned. In the exploration process, the action of obtaining lower rewards under predicted wind power is less selected, so wind power uncertainty can be taken into account.

S205: determining a discount factor: the discount factor represents the degree of importance of the agent to the long-term rewards and the short-term rewards, and the value range is between 0 and 1, and the discount factor takes 0.9.

S3, improving a unit combination Markov decision process to consider wind power uncertainty;

s301: strategy learning is carried out by adopting real wind power data;

in existing methods of reinforcement learning process uncertainty, wind power uncertainty is typically modeled artificially, e.g., assuming that it follows a certain probability distribution, then transferring strategies trained and learned in the artificially designed uncertainty environment into the test environment; even if wind power output is randomly generated within a predefined range, it is essentially a uniform sampling; however, the environment consisting of artificially assumed wind power probability distributions is not exactly the same as the real wind output scenario; if the learned strategy is not robust to such modeling errors, such gaps typically result in unsuccessful transitions; for utilizing the data driving characteristic of the model-free reinforcement learning method, the complex randomness of modeling of the power system is avoided, and strategy learning of real historical data is utilized, namely strategy of an intelligent agent is learned by utilizing wind power data collected in the real world; in the training process, the change rule between the wind energy predicted value and the actual measured value is learned, and an optimal scheduling scheme which can adapt to the actual testing environment is provided on the basis of grasping the distribution characteristic of the wind energy predicted value and the actual measured value;

S302: adopting a Markov decision process based on state disturbance;

the state perturbed markov decision process is similar to a state-countermarkov decision process in that it considers uncertainty to be due to unavoidable sensor errors or inherent inaccurate characteristics of the device; by means of a Markov decision process of state disturbance, a correction state conversion process between actual wind power and predicted wind power can be illustrated by using FIG. 2, and disturbance state observation of an agent is defined as predicted wind power, so that the agent can take corresponding actions according to a random strategy; the environment is still transited from the real state to the next state, and rewards obtained by the intelligent agent are calculated according to the actual wind power value; because the actual wind power output can be different from the predicted wind power output, the actions executed by the intelligent agent under the original learning strategy according to the predicted state can be suboptimal, so that corresponding reduction is rewarded (namely, the unit combination cost caused by a scheduling scheme is higher); thus, actions with lower rewards in the predictive state will be less employed in the exploration process;

s303: adopting an improved state space; the state space should take into account factors affecting the decision as much as possible; because the wind power has time sequence correlation, the change of the predicted value of the wind power is closely related to the time step t of each scheduling period; this means that a single point wind power predictor cannot be used to determine the strategy for the entire scheduling period; therefore, timing correlation issues must be considered when defining the state space, otherwise the convergence of the algorithm is affected; the change of the wind power predicted value of the adjacent period is added into a state space, and the change is represented by a first-order difference form of the wind power adjacent two periods;

further, the improved state space expression is as follows:

s4, selecting a deep reinforcement learning algorithm to solve a unit combination Markov decision process, wherein the algorithm needs to simultaneously meet the capability of solving a discrete action space and having the capability of converging to a random optimal strategy, and a near-end strategy optimization algorithm based on the strategy can well meet the two requirements.

Specifically, the unit combination markov decision process is an environment consisting of a continuous state space (wind power variable, load variable, etc.) and a discrete action space (unit start-stop variable). Although deep Q learning is suitable for the setting of state space and action space of the group combination problem, the problem of the inherent high valuation function will result in the generation of sub-optimal strategies. Furthermore, value-based methods eventually converge to deterministic actions, thereby learning deterministic strategies. Thus, these methods are typically used for deterministic crew combining problems, where the policy (crew combining) is deterministic due to fixed predictors. But when considering wind power uncertainty, the value-based deep reinforcement learning method is obviously not applicable. Specifically, under the condition of predicting the wind power value before the day, the prediction error of the wind power is not observable before the day, so that no optimal unit scheduling scheme according to the actual wind power output is determined. Deterministic policies will get the same actions under the same state observations. However, in our problem set-up, even if the same wind power predictions are observed, we cannot certainly give the same scheduling scheme due to uncertainty of wind power. Therefore, a strategy-based method is adopted to solve the problem of unit combination, and the aim is to improve the economy of a scheduling scheme on the premise of ensuring the safety of the unit combination.

The near-end strategy optimization algorithm adopted by the invention is based on an actor-commentator framework, is provided with an action network and an evaluation network, and is a strategy-based deep reinforcement learning algorithm. The method has the following characteristics: the strategy function is approximated and optimized by maximizing the strategy objective function instead of optimizing the cost function, and the final optimized strategy represents the probability distribution of the agent action. The optimal strategy in the strategy-based approach generates actions with a certain probability, and even if the observations are the same, the results may be different. The policy objective function can evaluate the goodness of the current policy, and the policy objective function expression is as follows:

in the method, in the process of the invention,is under the current policy pi _θ Probability of system environment in state s, pi _θ (s, a) is the agent according to the current policy pi _θ Probability of taking action a after observing state s, +.>Is a prize value.

Further, the expression of the near-end policy optimization algorithm is as follows:

where KL (θ, θ ') is the KL divergence describing the probability distribution θ and the probability distribution θ', A ^θ' (s _t ,a _t ) Is a dominance function under strategy θ'. The near-end policy optimization algorithm directly places the KL divergence into the objective function to be optimized so that the gradient-ascending algorithm can be used to maximize the objective function

Furthermore, the proximal strategy optimization algorithm needs to be applicable to a discrete action space, and the steps of discretizing a proximal strategy optimization model are as follows: in the original evaluation network, the Q function takes as input the state and action and outputs a Q value, since there are an infinite number of possible actions. In the unit combined Markov decision process, the action space is limited to 2 ^N And thus can be mapped to a vector containing the Q value of each action. For action networks, the action network no longer needs to output the mean and covariance of the continuous action distribution. Rather, actions will directly output the classification distribution of all actions in the action space. The softmax function is used in the present patent to apply to the last network layer to ensure an effective probability distribution between 0 and 1.

S5, training network parameters of a near-end strategy optimization algorithm, wherein the training comprises the following steps;

s501: selecting historical wind power and load data for training, wherein in order to implement the characteristics in the step S301, the historical wind power needs to contain wind power prediction data and wind power actual measurement data at the same time;

s502: the mobile network interacts with the environment and stores the obtained key information in an experience playback pool; the output of the action network is the scheduling scheme of the next time unit combination, and the output of the evaluation network is the state value. In the training process, the intelligent agent interacts with the unit combination reinforcement learning environment according to the time interval sequence, the intelligent agent acquires the environment state defined in the step S302 and outputs actions, and the unit combination reinforcement learning environment calculates the scalar rewarding value defined in the step S203 to the intelligent agent. The status, action and prize values for each period of the overall scheduling period are stored in an experience pool.

S503: evaluating the network to calculate state value and discount rewards, further solving a dominance function (the dominance of a certain action relative to average) by using the calculation result, and updating the network weight of the evaluation network;

s504: and calculating importance sampling ratio through the sampling network so as to update the weight of the action network. Wherein the sampling network and the action network share the same weight, but the updating of the weight parameters of the sampling network lags behind the action network.

As shown in fig. 4, during the training process, the action network has a plurality of trainable hidden layers responsible for generating probability distributions of actions that an agent may take in a given state; the evaluation network accepts the status and corresponding actions as inputs and outputs a status-action value for estimating the expected return of the agent over a long period of time. The two networks perform interactive training under a unified framework: the actions output by the action network are executed in the environment to generate new states and rewards, which are then used to update the evaluation network and further guide the policy optimization of the action network through the loss function.

And S6, solving a unit combination problem considering wind power uncertainty by adopting a trained deep reinforcement learning model, and further obtaining a unit combination scheduling scheme.

The output results of each generator set after being solved by the deep reinforcement learning model are shown in fig. 5, in a scheduling period of 24 hours a day, the generator set 1 is a main power supply unit, and in order to meet wind power fluctuation in the day, the generator set 2 and the generator set 4 are started and stopped as peak shaver sets in partial periods, and as can be seen from fig. 5:

1) In different time periods of the dispatching day, the wind power output predicted value deviates from the actual value to different degrees; in the face of wind power prediction errors, the deep reinforcement learning intelligent agent adaptively gives a scheduling scheme to avoid the loss of wind abandoning and load reduction.

Table 2 summarizes the computational time-consuming (in seconds) of the deep reinforcement learning model and the two-stage robust optimization algorithm solution in the experimental test for the 7-day tuning day, wherein the time-consuming of the near-stage strategy optimization algorithm was stabilized at 0.01 seconds, while the computational time-consuming of the two-stage robust optimization algorithm was changed from different test days and required at most 8 seconds.

TABLE 2 calculation time consuming for deep reinforcement learning model and two-stage robust optimization algorithm solution in the examples

As can be seen from table 2:

1) Because a large amount of calculation time is transferred to the offline training period for execution, the deep reinforcement learning algorithm can give a unit decision scheme in a short time, the traditional two-stage robust optimization algorithm has a large fluctuation range of solving time under different wind power scenes, and the calculation time is far more than that of the deep reinforcement learning algorithm

The calculation result pair of the deep reinforcement learning model and the two-stage robust optimization solution is shown in fig. 6, wherein the total test days of seven scheduling days are included, the actual cost of the unit combination scheduling scheme obtained by the solution of the deep reinforcement learning algorithm is basically at the optimal level within seven days, and meanwhile, the scheduling cost of the traditional two-stage robust optimization algorithm is always higher than the actual cost of the given scheduling scheme, as can be seen from fig. 6:

1) The economy of the unit combination scheduling scheme obtained by solving the deep reinforcement learning algorithm is obviously better than that of the traditional two-stage robust optimization algorithm; in actual scheduling, the deep reinforcement learning algorithm can be more suitable for the influence caused by wind power prediction errors, so that a more economical scheduling scheme is provided.

In the description of the present specification, reference to the terms "one embodiment" and "example" and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms are not necessarily aimed at being combined in a suitable manner in the opposite embodiments or examples.

It must be pointed out that the above description of the embodiments is not intended to be limiting but to assist in understanding the core idea of the invention, and that any modifications to the invention and alternatives equivalent to the present product, which do not depart from the principle of the invention, are intended to be within the scope of the claims of the invention.

Claims

1. The unit combination deep reinforcement learning solving method is characterized by comprising the following steps of:

2. The method for deep reinforcement learning and solving of unit combination according to claim 1, wherein in step S1, the unit combination model based on conventional optimization includes an objective function, a unit operation constraint and a power system constraint;

the objective function expression is:

the unit fuel cost function expression is:

the unit operation constraint comprises a unit capacity constraint, a unit minimum start-stop time constraint and a climbing rate constraint;

the power system constraints include a system power balance constraint, a rotational reserve constraint.

3. The method for solving the deep reinforcement learning of the unit combination according to claim 2, wherein,

in the method, in the process of the invention,P _i is the lower limit of the output of the unit,is the upper limit of the output of the unit, u _i,t Is a starting and stopping state of the unit;

the power balance constraint expression is:

wherein P is _D,t Is the aggregate negative of the system during period tA lotus;

4. The method of claim 1, wherein in step S2, the markov decision process includes a state space, an action space, a reward function, an environmental transition probability, a discount factor quintuple, and the quintuple is constructed, and the converting the set combination model into the markov decision process specifically includes the following steps:

S202: building a state space: s is(s) _t ＝(b _i ,d _t+1 ,w _t+1 ,τ _i,t ,t)

s204: computing the environment transition probability:

the unit running state transfer function expression is as follows:

wherein τ _i,t Is the current running/stopping time of the unit, and the transition of the state is deterministic according to different actions taken by the agent;

5. The method of claim 4, wherein the discount factor is 0.9.

6. The method for solving the deep reinforcement learning of the unit combination according to claim 4, wherein the step S3 specifically comprises the following steps:

s301: strategy learning is carried out by adopting real wind power data;

S302: adopting a Markov decision process based on state disturbance;

further, the improved state space expression is as follows:

7. the method of claim 5, wherein in step S5, the training comprises the steps of,

8. The method for solving the deep reinforcement learning of the unit combination according to claim 1, wherein in the step S4, the deep reinforcement learning model is a policy-based near-end policy optimization model;

further, the near-end policy optimization model expression is as follows:

where KL (θ, θ ') is the KL divergence describing the probability distribution θ and the probability distribution θ', A ^θ' (s _t ,a _t ) Is a dominance function under strategy θ'; the near-end policy optimization model directly places the KL divergence into the objective function to be optimized so that gradient-lifting algorithms can be used to maximize the objective function

9. The method of claim 7, wherein the near-end strategy optimization model is required to be applicable to a discrete motion space, and the step of discretizing the near-end strategy optimization model is as follows: in the original evaluation network, the Q-value function takes the state and action asAs input, and output a Q value, since there are an infinite number of possible actions; in the unit combined Markov decision process, the action space is limited to 2 ^N The actions, therefore, can be mapped to a vector containing the Q value for each action; for an action network, the action network no longer needs to output the mean and covariance of the continuous action distribution; rather, actions will directly output the classification distribution of all actions in the action space; the softmax function is used in the present patent to apply to the last network layer to ensure an effective probability distribution between 0 and 1.