CN118886689A

CN118886689A - A safe power dispatching method based on deep reinforcement learning

Info

Publication number: CN118886689A
Application number: CN202411372731.9A
Authority: CN
Inventors: 胡晓松; 钟泽波; 高啸; 谢光彬; 覃露熹; 李光彬; 郭逢权; 王秋平; 唐小梅; 李思岑; 江漫; 施寻; 张鹏
Original assignee: Yibin Power Supply Co Of Sichuan Electric Power Corp
Current assignee: Yibin Power Supply Co Of Sichuan Electric Power Corp
Priority date: 2024-09-29
Filing date: 2024-09-29
Publication date: 2024-11-01
Anticipated expiration: 2044-09-29
Also published as: CN118886689B

Abstract

The invention discloses a safe power dispatching method based on deep reinforcement learning, which belongs to the field of power dispatching and comprises the following steps: s1, training a scheduling model according to countermeasure learning by adopting a Markov decision process; s2, training a main expert model based on an expert rule base interactive tuning mode; and S3, outputting a final power dispatching scheme according to the main expert model optimization dispatching model. Based on the Rainbow DQN algorithm, two novel training modes are provided, namely, safety constraint guarantee learning is carried out by training an countermeasure model and a scheduling model in a manner based on countermeasure learning, and a main expert model is obtained by training in a manner based on expert rule base interactive optimization. Through a large number of experiments performed on the virtual simulation power grid environment, the effectiveness of the safety power dispatching algorithm based on the deep reinforcement learning is verified, a certain degree of disturbance and attack can be resisted, and the high-efficiency and robust power dispatching capability of the power system is improved.

Description

A safe power dispatching method based on deep reinforcement learning

技术领域Technical Field

本发明属于电力调度领域，具体涉及基于深度强化学习的安全电力调度方法。The present invention belongs to the field of electric power dispatching, and specifically relates to a safe electric power dispatching method based on deep reinforcement learning.

背景技术Background Art

电力调度是一种有效的管理手段，旨在确保电网的安全稳定运行、可靠供电以及各类电力生产工作的有序进行。电力调度的具体工作内容包括依据各类信息采集设备反馈回来的数据信息，或监控人员提供的信息，结合电网实际运行参数（如电压、电流、频率、负荷等），综合考虑各项生产工作的开展情况，对电网的安全和经济运行状态进行判断。调度中心通过电话或自动系统发布操作指令，指挥现场操作人员或自动控制系统进行调整，例如调整发电机出力、负荷分布以及投切电容器、电抗器等，从而确保电网持续安全稳定运行。Power dispatching is an effective management method that aims to ensure the safe and stable operation of the power grid, reliable power supply, and the orderly progress of various types of power production work. The specific work content of power dispatching includes judging the safety and economic operation status of the power grid based on the data information fed back by various information collection devices or the information provided by monitoring personnel, combined with the actual operation parameters of the power grid (such as voltage, current, frequency, load, etc.), comprehensively considering the development of various production work. The dispatching center issues operating instructions through telephone or automatic system, directing on-site operators or automatic control systems to make adjustments, such as adjusting generator output, load distribution, and switching capacitors and reactors, so as to ensure the continuous safe and stable operation of the power grid.

新能源的快速发展和峰值负荷的持续增长，给发电侧调度能力带来了新的挑战。新能源指的是风电和光伏等可再生能源，可以通过自然的能力获得，具有间歇性和随机性的特点。不是每一个区域都具备建风电厂或者光伏厂的能力，所以需要做电网的快速大范围的互联。随着各种各样的柔性设备或者电子设备的增加，整个电力系统的调控规模变得更大，并且不确定性和波动性也加剧了。电网调度中需要满足每一个时刻的瞬时安全约束。一旦某一个约束不满足，可能整个电网就会崩溃，造成大范围的断电，带来巨大的经济损失。因此整个电力系统需要具有高效和鲁棒的调度能力。The rapid development of new energy and the continuous growth of peak load have brought new challenges to the dispatching capacity of the power generation side. New energy refers to renewable energy such as wind power and photovoltaic power, which can be obtained through natural capabilities and has the characteristics of intermittent and randomness. Not every region has the ability to build a wind power plant or a photovoltaic plant, so it is necessary to interconnect the power grid quickly and on a large scale. With the increase of various flexible devices or electronic devices, the regulation scale of the entire power system has become larger, and the uncertainty and volatility have also increased. The instantaneous safety constraints at every moment need to be met in the grid dispatching. Once a constraint is not met, the entire power grid may collapse, causing large-scale power outages and huge economic losses. Therefore, the entire power system needs to have efficient and robust dispatching capabilities.

近年来，深度强化学习（Deep Reinforcement Learning）算法已被广泛用于电力调度领域，通过深度强化学习来进行电力调度是一项具有极大应用价值和前景的研究方向。强化学习已应用于涉及虚拟电厂的经济调度问题，便于对虚拟电厂的调度指令进行快速分解。该技术也可应用于风电场和储能系统的自调度，实现更快的调度决策。In recent years, deep reinforcement learning algorithms have been widely used in the field of power dispatching. Power dispatching through deep reinforcement learning is a research direction with great application value and prospects. Reinforcement learning has been applied to economic dispatch problems involving virtual power plants, facilitating the rapid decomposition of dispatch instructions for virtual power plants. This technology can also be applied to the self-dispatching of wind farms and energy storage systems to achieve faster dispatching decisions.

发明内容Summary of the invention

针对现有技术中的上述不足，本发明提供的基于深度强化学习的安全电力调度方法解决了现有电力调度方法的效率性和鲁棒性较差的问题。In view of the above-mentioned deficiencies in the prior art, the safe power dispatching method based on deep reinforcement learning provided by the present invention solves the problems of poor efficiency and robustness of the existing power dispatching methods.

为了达到上述发明目的，本发明采用的技术方案为：基于深度强化学习的安全电力调度方法，包括以下步骤：In order to achieve the above-mentioned invention object, the technical solution adopted by the present invention is: a safe power dispatching method based on deep reinforcement learning, comprising the following steps:

S1、采用马尔可夫决策过程根据对抗学习训练调度模型；S1, train the scheduling model based on adversarial learning using Markov decision process;

S2、基于专家规则库交互式调优的方式训练主专家模型；S2, training the main expert model based on interactive tuning of the expert rule base;

S3、根据主专家模型优化调度模型输出最终的电力调度方案。S3. Output the final power dispatching plan based on the dispatching model optimized by the main expert model.

进一步地：所述S1具体为：Further: S1 specifically includes:

通过马尔可夫决策过程定义动作空间、状态空间和奖励函数，模拟调度模型和对抗模型在虚拟仿真电网环境中的行为，通过Rainbow DQN 算法更新调度模型和对抗模型的网络参数，对调度模型进行对抗学习训练；The action space, state space and reward function are defined through the Markov decision process, the behavior of the scheduling model and the adversarial model in the virtual simulation power grid environment is simulated, the network parameters of the scheduling model and the adversarial model are updated through the Rainbow DQN algorithm, and the scheduling model is trained through adversarial learning;

其中，马尔可夫决策过程定义的策略具体为：基于在虚拟仿真电网环境中的观测数据，通过调度模型或对抗模型生成电力调度方案；The strategy defined by the Markov decision process is as follows: based on the observed data in the virtual simulation power grid environment, a power dispatching plan is generated through a dispatching model or an adversarial model;

马尔可夫决策过程定义的观察空间包括节点的电压值、线路的电流值、发电机的输出功率和负荷的电力需求的电网信息，状态空间的表达式具体为：The observation space defined by the Markov decision process includes the grid information of node voltage value, line current value, generator output power and load power demand, and the state space The specific expression is:

式中，表示所有节点的电压值，表示所有线路的电流值，表示所有发电机的输出功率，表示所有负荷的电力需求；In the formula, represents the voltage value of all nodes, Indicates the current value of all lines, represents the output power of all generators, Indicates the power demand of all loads;

马尔可夫决策过程定义的动作空间包括所有电力调度动作的排列组合，动作空间的表达式具体为：The action space defined by the Markov decision process includes all permutations and combinations of power dispatch actions. The specific expression is:

式中，表示发电机输出功率的调整值，表示开关操作状态，表示负荷调整；In the formula, Indicates the adjustment value of the generator output power, Indicates the switch operation status, Indicates load adjustment;

马尔可夫决策过程定义的奖励函数具体为基于电网运行的目标，其包括供电可靠性、经济效益和鲁棒性，奖励函数R_t的表达式具体为：The reward function defined by the Markov decision process is specifically based on the goal of power grid operation, which includes power supply reliability, economic benefits and robustness. The expression of the reward function R _t is specifically:

式中，表示供电可靠性，用于衡量电压稳定性、频率偏差，表示第i个节点的电压，表示电网中各节点的额定电压，N表示电网中节点的总数，表示第j条线路的电流，表示第j条线路的最大允许电流，M表示电网中线路的总数，表示经济效益，用于衡量发电成本，为第 k台发电机的成本系数，G表示发电机总数，表示第k台发电机输出功率的调整值，表示环境效益，用于衡量污染物排放，为第k台发电机的排放系数，表示鲁棒性，用于衡量对故障的抵抗能力，为第l个故障的影响评估值，F表示故障的总数，表示第一指标权重，表示第二指标权重，表示第三指标权重，表示第四指标权重。In the formula, Indicates power supply reliability, used to measure voltage stability, frequency deviation, represents the voltage of the ith node, represents the rated voltage of each node in the power grid, N represents the total number of nodes in the power grid, represents the current of the jth line, represents the maximum allowable current of the jth line, M represents the total number of lines in the power grid, Represents economic benefits, used to measure the cost of power generation, is the cost coefficient of the kth generator, G represents the total number of generators, represents the adjustment value of the output power of the kth generator, Represents environmental benefits and is used to measure pollutant emissions. is the emission coefficient of the kth generator, Represents robustness, which is used to measure the ability to resist failure. is the impact assessment value of the lth fault, F represents the total number of faults, represents the weight of the first indicator, represents the weight of the second indicator, represents the weight of the third indicator, Indicates the weight of the fourth indicator.

进一步地：通过Rainbow DQN 算法更新调度模型和对抗模型的网络参数的表达式具体为：Further: The expression for updating the network parameters of the scheduling model and the adversarial model through the Rainbow DQN algorithm is specifically:

式中，为在具有原子的向量上的投影，表示离散支持上的原子数量，为目标分布，为KL散度，为时间差误差的概率，为调节对的影响程度的控制参数。In the formula, For having Atomic vector The projection on represents the number of atoms on the discrete support, is the target distribution, is the KL divergence, is the probability of time difference error, To adjust right The control parameters of the degree of influence.

进一步地：所述S1中，马尔可夫决策过程的安全成本根据状态设置；Further: In S1, the security cost of the Markov decision process according to Status settings;

安全成本的表达式具体为：Security costs The specific expression is:

式中，表示第一权重系数，表示第二权重系数，表示第三权重系数，表示电压安全成本，用于反映电压偏离额定值的程度，表示允许的电压偏差范围，表示电流安全成本，用于反映线路电流是否超过安全运行的限制，表示频率安全成本，用于衡量电网的频率偏差情况，表示当前电网频率，表示电网的额定频率，表示允许的频率偏差范围，表示最大值函数。In the formula, represents the first weight coefficient, represents the second weight coefficient, represents the third weight coefficient, Indicates the voltage safety cost, which is used to reflect the degree to which the voltage deviates from the rated value. Indicates the allowable voltage deviation range, Indicates the current safety cost, which is used to reflect whether the line current exceeds the limit of safe operation. Represents the frequency security cost, which is used to measure the frequency deviation of the power grid. Indicates the current grid frequency. Indicates the rated frequency of the power grid, Indicates the allowable frequency deviation range, Represents the maximum value function.

上述进一步方案的有益效果为：通过Rainbow DQN 算法根据电力系统的特性，采用对抗学习和专家规则库交互式调优的方式训练调度模型，进一步提升电力调度方案的优化效果。The beneficial effect of the above further scheme is: according to the characteristics of the power system, the scheduling model is trained by the Rainbow DQN algorithm using adversarial learning and expert rule base interactive tuning to further improve the optimization effect of the power scheduling scheme.

进一步地：对调度模型进行对抗学习训练的目标函数具体为下式：Furthermore, the objective function of adversarial learning training for the scheduling model is specifically as follows:

式中，表示当前状态，表示调度模型的动作，表示扰动，表示执行调度和扰动后，系统转移到的状态，表示调度模型的策略，表示对抗模型的策略，和均通过状态空间获取，通过动作空间获取；In the formula, Indicates the current state. represents the actions of the scheduling model, represents disturbance, It indicates the state to which the system transfers after executing the scheduling and disturbance. represents the strategy of the scheduling model, represents the strategy of adversarial model, and All are obtained through the state space, Acquired through action space;

表示对抗模型尝试寻找最不利的扰动，表示调度模型学习如何在对抗模型生成的最不利情景下调整调度策略，表示所有可能状态转移的平均情况。 Indicates that the adversarial model tries to find the most unfavorable perturbation , It means that the scheduling model learns how to adjust the scheduling strategy under the most unfavorable scenario generated by the adversarial model. Represents the average of all possible state transitions.

进一步地：所述S2包括以下分步骤：Further: S2 comprises the following sub-steps:

S21、设置主专家模型和从专家模型，主专家模型包括电力调度的基本知识和初步优化能力，从专家模型根据主专家模型的副本生成；S21, setting a master expert model and a slave expert model, the master expert model includes basic knowledge and preliminary optimization capabilities of power dispatching, and the slave expert model is generated based on a copy of the master expert model;

S22、初始化主专家模型和从专家模型；S22, initializing the master expert model and the slave expert model;

S23、通过专家规则库优化从专家模型的输出，进而训练主专家模型。S23. Optimize the output of the slave expert model through the expert rule base, and then train the master expert model.

上述进一步方案的有益效果为：本发明采用基于电力专家构建的专家规则库交互式调优的方式训练主专家模型，每个从专家模型对应一个电力专家，主专家模型用于统一更新和管理，进一步提升电力调度方案的优化效果。The beneficial effect of the above further scheme is: the present invention adopts an interactive tuning method based on the expert rule base constructed by power experts to train the main expert model, each slave expert model corresponds to a power expert, and the main expert model is used for unified updating and management, further improving the optimization effect of the power dispatching scheme.

进一步地：所述S23包括以下分步骤：Further: S23 includes the following sub-steps:

S231、通过调度模型输出初始电力调度方案，将初始电力调度方案输入从专家模型，得到调整后的电力调度方案；S231, outputting an initial power dispatching plan through the dispatching model, inputting the initial power dispatching plan into the expert model, and obtaining an adjusted power dispatching plan;

S232、通过专家规则库对调整后的电力调度方案进行故障业务分析，生成调整后的电力调度方案的反馈结果；S232, performing fault business analysis on the adjusted power dispatching plan through an expert rule base, and generating feedback results of the adjusted power dispatching plan;

S233、通过虚拟仿真电网环境将反馈结果发送至从专家模型，根据反馈结果更新从专家模型的模型参数；S233, sending the feedback result to the slave expert model through the virtual simulation power grid environment, and updating the model parameters of the slave expert model according to the feedback result;

S234、按照预设周期将从专家模型更新的模型参数反馈至主专家模型，对主专家模型进行训练。S234. Feedback the model parameters updated from the expert model to the main expert model according to a preset period to train the main expert model.

上述进一步方案的有益效果为：使用专家规则库交互式调优调整主专家模型，使其更符合电力专家的专业判断。使用环境反馈调整从专家模型，使其更符合电网运行的实际情况。定期将从专家模型的更新反馈到主专家模型，确保主专家模型能够逐渐吸收各个电力专家的知识和经验。The beneficial effects of the above further scheme are: using the expert rule base to interactively tune and adjust the master expert model to make it more consistent with the professional judgment of power experts. Using environmental feedback to adjust the slave expert model to make it more consistent with the actual situation of power grid operation. Regularly feeding back the updates of the slave expert model to the master expert model to ensure that the master expert model can gradually absorb the knowledge and experience of various power experts.

进一步地：所述S234中，对主专家模型进行训练的损失函数Loss的表达式具体为：Further: In S234, the expression of the loss function Loss for training the main expert model is specifically:

式中，为专家规则库对第i调整后的电力调度方案的专家确信度评分，为从专家模型对第i调整后的电力调度方案的预测评分，N为调度方案的数量。In the formula, is the expert confidence score of the expert rule base for the i-th adjusted power dispatching plan, is the predicted score of the ith adjusted power dispatch plan from the expert model, and N is the number of dispatch plans.

进一步地：所述S3包括以下分步骤：Further: S3 includes the following sub-steps:

S31、将虚拟仿真电网环境中的观测值输入调度模型，生成电力调度方案，其中，观测值包括节点电压、线路电流、发电机输出功率和负荷需求；S31, inputting the observed values in the virtual simulation power grid environment into the dispatching model to generate a power dispatching plan, wherein the observed values include node voltage, line current, generator output power and load demand;

S32、将电力调度方案发送至主专家模型，生成优化后的电力调度方案；S32, sending the power dispatching plan to the main expert model to generate an optimized power dispatching plan;

S33、通过专家规则库对优化后的电力调度方案进行故障业务分析，生成优化后的电力调度方案的反馈结果；S33, performing fault business analysis on the optimized power dispatching plan through an expert rule base, and generating feedback results of the optimized power dispatching plan;

S34、根据优化后的电力调度方案的反馈结果判断优化后的电力调度方案是否合格，若是，则将优化后的电力调度方案作为最终的电力调度方案；若否，则根据优化后的电力调度方案的反馈结果优化主专家模型，返回S32。S34. Determine whether the optimized power dispatching plan is qualified according to the feedback results of the optimized power dispatching plan. If so, use the optimized power dispatching plan as the final power dispatching plan; if not, optimize the main expert model according to the feedback results of the optimized power dispatching plan and return to S32.

本发明的有益效果为：The beneficial effects of the present invention are:

（1）本发明提供了基于深度强化学习的安全电力调度方法，通过对电力调度相关知识的学习，并且结合深度强化学习的特点和缺点，以有效地应对电力调度中的各种安全问题，本发明基于 Rainbow DQN 算法，提出了两种新颖的训练方式，分别是以基于对抗学习的方式训练对抗模型和调度模型来进行安全约束保障学习，以基于专家规则库交互式调优的方式训练得到主专家模型。本发明提出了一种基于深度强化学习的安全电力调度算法，马尔可夫决策过程的安全成本将电压安全成本、电流安全成本、频率安全成本考虑在内，提升了电网运行的安全性，理论分析表明该方法在安全性和调度性能方面具有潜在优势。(1) The present invention provides a safe power dispatching method based on deep reinforcement learning. By learning the knowledge related to power dispatching and combining the characteristics and disadvantages of deep reinforcement learning, the present invention can effectively deal with various safety issues in power dispatching. Based on the Rainbow DQN algorithm, the present invention proposes two novel training methods, namely, training the adversarial model and the dispatching model based on adversarial learning to perform safety constraint assurance learning, and training the main expert model based on the interactive tuning of the expert rule base. The present invention proposes a safe power dispatching algorithm based on deep reinforcement learning. The safety cost of the Markov decision process Taking voltage safety cost, current safety cost and frequency safety cost into consideration improves the safety of power grid operation. Theoretical analysis shows that this method has potential advantages in terms of safety and dispatch performance.

（2）本发明的目的在于基于电力调度中的各种安全标准与规则，结合电网系统的特点，将问题拓展为电力安全调度，包括负荷管理、故障检测与隔离、应急预案等，最后建立一个基于深度强化学习的安全电力调度算法。用于在用电高峰期通过需求响应、负荷转移和削峰填谷等措施，平衡供需，减少高峰负荷对系统的压力；还用于快速检测并隔离故障区域，防止故障扩展，减少对电力系统的影响；还用于制定应急预案，确保在突发事件发生时能够快速响应和恢复供电。(2) The purpose of this invention is to expand the problem into safe power dispatching, including load management, fault detection and isolation, emergency plans, etc., based on various safety standards and rules in power dispatching and in combination with the characteristics of the power grid system, and finally establish a safe power dispatching algorithm based on deep reinforcement learning. It is used to balance supply and demand and reduce the pressure of peak load on the system through demand response, load transfer and peak-to-valley filling during peak power consumption periods; it is also used to quickly detect and isolate fault areas, prevent fault expansion, and reduce the impact on the power system; it is also used to formulate emergency plans to ensure rapid response and power restoration in the event of an emergency.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的基于深度强化学习的安全电力调度方法流程图。FIG1 is a flow chart of a safe power dispatching method based on deep reinforcement learning according to the present invention.

图2为基于马尔可夫决策过程的对抗学习训练的示意图。FIG2 is a schematic diagram of adversarial learning training based on a Markov decision process.

图3为主从专家模型训练的示意图。FIG3 is a schematic diagram of the master-slave expert model training.

图4为电力调度方案生成及调整流程图。FIG4 is a flow chart of power dispatching plan generation and adjustment.

具体实施方式DETAILED DESCRIPTION

下面对本发明的具体实施方式进行描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。The specific implementation modes of the present invention are described below to facilitate those skilled in the art to understand the present invention. However, it should be clear that the present invention is not limited to the scope of the specific implementation modes. For those of ordinary skill in the art, as long as various changes are within the spirit and scope of the present invention as defined and determined by the attached claims, these changes are obvious, and all inventions and creations utilizing the concept of the present invention are protected.

如图1所示，在本发明的一个实施例中，基于深度强化学习的安全电力调度方法，包括以下步骤：As shown in FIG1 , in one embodiment of the present invention, a safe power dispatching method based on deep reinforcement learning includes the following steps:

所述S1具体为：The S1 is specifically:

式中，表示发电机输出功率的调整值，表示开关操作状态，如线路开/关，表示负荷调整；In the formula, Indicates the adjustment value of the generator output power, Indicates switch operation status, such as line on/off, Indicates load adjustment;

式中，表示供电可靠性，用于衡量电压稳定性、频率偏差，表示第i个节点的电压，表示电网中各节点的额定电压，其是一个固定值，代表电网的正常运行电压，N表示电网中节点的总数，表示第j条线路的电流，表示第j条线路的最大允许电流，M表示电网中线路的总数，表示经济效益，用于衡量发电成本，为第 k台发电机的成本系数，G表示发电机总数，表示第k台发电机输出功率的调整值，表示环境效益，用于衡量污染物排放，为第k台发电机的排放系数，表示鲁棒性，用于衡量对故障的抵抗能力，为第l个故障的影响评估值，F表示故障的总数，表示第一指标权重，表示第二指标权重，表示第三指标权重，表示第四指标权重，指标权重可根据实际需求调整。In the formula, Indicates power supply reliability, used to measure voltage stability, frequency deviation, represents the voltage of the ith node, It represents the rated voltage of each node in the power grid. It is a fixed value and represents the normal operating voltage of the power grid. N represents the total number of nodes in the power grid. represents the current of the jth line, represents the maximum allowable current of the jth line, M represents the total number of lines in the power grid, Represents economic benefits, used to measure the cost of power generation, is the cost coefficient of the kth generator, G represents the total number of generators, represents the adjustment value of the output power of the kth generator, Represents environmental benefits and is used to measure pollutant emissions. is the emission coefficient of the kth generator, Represents robustness, which is used to measure the ability to resist failure. is the impact assessment value of the lth fault, F represents the total number of faults, represents the weight of the first indicator, represents the weight of the second indicator, represents the weight of the third indicator, Indicates the weight of the fourth indicator. The indicator weight can be adjusted according to actual needs.

如图2所示，在本实施例中，将训练调度模型和对抗模型的任务建模为一个强化学习问题，即马尔可夫决策过程，马尔可夫决策过程定义的策略、动作空间、观察空间和奖励函数等强化学习关键组成部分如下：策略是指基于模型在虚拟仿真电网环境中的观测，由调度模型或者对抗模型生成电力调度方案；其中为对抗策略，为调度策略，为对抗策略输出的动作，为调度策略输出的动作，为对抗策略与环境交互的当前状态、动作、奖励、下一步状态，,,,)为调度策略与环境交互的当前状态、动作、奖励、下一步状态；动作空间包含了所有可能的电力调度动作的排列组合；观察空间则包括节点电压、线路电流、发电机输出功率、负荷需求等电网信息；奖励函数基于电网运行的目标，如供电可靠性：确保电网在供电过程中不发生故障，如过载、欠压等；经济效益：尽量减少发电成本，优化发电机组的输出；环境效益：尽量减少污染排放；鲁棒性：能够抵御一定程度的扰动和攻击，保持正常运行。As shown in FIG2 , in this embodiment, the task of training the scheduling model and the adversarial model is modeled as a reinforcement learning problem, namely, a Markov decision process. The key components of reinforcement learning, such as the strategy, action space, observation space, and reward function defined by the Markov decision process, are as follows: The strategy refers to the power scheduling plan generated by the scheduling model or the adversarial model based on the observation of the model in the virtual simulated power grid environment; To counter the strategy, is the scheduling strategy, is the action output by the adversarial strategy, The action output by the scheduling strategy, The current state, action, reward, and next state of the interaction between the adversarial strategy and the environment. , , , ) is the current state, action, reward, and next state of the interaction between the dispatching strategy and the environment; the action space contains all possible combinations of power dispatching actions; the observation space includes grid information such as node voltage, line current, generator output power, load demand, etc.; the reward function is based on the goals of grid operation, such as power supply reliability: ensuring that the grid does not fail during the power supply process, such as overload, undervoltage, etc.; economic benefits: minimizing power generation costs and optimizing the output of generator sets; environmental benefits: minimizing pollution emissions; robustness: being able to withstand a certain degree of disturbance and attack and maintain normal operation.

通过Rainbow DQN 算法更新调度模型和对抗模型的网络参数的表达式具体为：The expression for updating the network parameters of the scheduling model and the adversarial model through the Rainbow DQN algorithm is as follows:

式中，为在具有原子的向量上的投影，表示离散支持上的原子数量，为目标分布，为KL散度，为时间差误差的概率，为调节对的影响程度的控制参数。在本实施例中，原子是用于表示回报分布的离散点，通过这些离散点，算法可以近似回报的分布，而不仅仅是期望回报。In the formula, For having Atomic vector The projection on represents the number of atoms on the discrete support, is the target distribution, is the KL divergence, is the probability of time difference error, To adjust right In this embodiment, atoms are discrete points used to represent the distribution of rewards, through which the algorithm can approximate the distribution of rewards, rather than just the expected rewards.

在本实施例中，通过Rainbow DQN 算法根据电力系统的特性，采用对抗学习和专家规则库交互式调优的方式训练调度模型，进一步提升电力调度方案的优化效果，其中Rainbow DQN 算法将强化学习社区对 DQN（Deep Q Learning）算法的六种独立的改进结合到了一起，并取得了不错的效果。In this embodiment, the Rainbow DQN algorithm is used to train the scheduling model according to the characteristics of the power system by using adversarial learning and interactive tuning of the expert rule base to further improve the optimization effect of the power scheduling scheme. The Rainbow DQN algorithm combines six independent improvements to the DQN (Deep Q Learning) algorithm by the reinforcement learning community and achieves good results.

所述S1中，所述S1中，马尔可夫决策过程的安全成本根据状态设置；In S1, the security cost of the Markov decision process in S1 according to Status settings;

安全成本的表达式具体为：Security costs The specific expression is:

式中，表示第一权重系数，表示第二权重系数，表示第三权重系数，反映各个安全指标对总安全成本的贡献程度，表示电压安全成本，用于反映电压偏离额定值的程度，对电网运行稳定性有直接影响，表示允许的电压偏差范围，表示电流安全成本，用于反映线路电流是否超过安全运行的限制，过大电流可能导致线路损坏，表示频率安全成本，用于衡量电网的频率偏差情况，频率失稳会影响设备的正常运行，表示当前电网频率，表示电网的额定频率，表示允许的频率偏差范围，表示最大值函数。In the formula, represents the first weight coefficient, represents the second weight coefficient, It represents the third weight coefficient, which reflects the contribution of each safety indicator to the total safety cost. It represents the voltage safety cost, which is used to reflect the degree to which the voltage deviates from the rated value and has a direct impact on the stability of power grid operation. Indicates the allowable voltage deviation range, Indicates the current safety cost, which is used to reflect whether the line current exceeds the limit of safe operation. Excessive current may cause line damage. Indicates the frequency safety cost, which is used to measure the frequency deviation of the power grid. Frequency instability will affect the normal operation of the equipment. Indicates the current grid frequency. Indicates the rated frequency of the power grid, Indicates the allowable frequency deviation range, Represents the maximum value function.

在本实施例中，安全成本与 St+1 状态有关，比如某个输电线路在 t+1 时刻突然中断，cost 的成本就会非常大。因此可以提前去做 Worst-case 的训练。通常采用对抗学习的方式来对 N-1 线路中断 Worst-case 进行安全约束保障学习。通过这种对抗训练，调度模型可以在各种 Worst-case 场景中学习到如何优化调度策略，从而提前掌握应对 N-1 线路中断等极端情况下的安全保障措施。最终目标是使电力调度在面临最不利的情况下，仍然能够保持安全运行，满足电网的安全性要求。对调度模型和对抗模型进行对抗学习训练的目标函数可以用如下的极小极大（Minimax）优化问题表示：In this embodiment, the security cost It is related to the St+1 state. For example, if a transmission line is suddenly interrupted at time t+1, the cost will be very high. Therefore, the worst-case training can be performed in advance. Adversarial learning is usually used to learn the safety constraints for the worst-case of N-1 line interruption. Through this adversarial training, the dispatching model can learn how to optimize the dispatching strategy in various worst-case scenarios, so as to master the safety measures for extreme situations such as N-1 line interruption in advance. The ultimate goal is to enable power dispatching to maintain safe operation and meet the safety requirements of the power grid in the most unfavorable conditions. The objective function of adversarial learning training for the dispatching model and the adversarial model can be expressed as the following Minimax optimization problem:

式中，为当前状态；为调度模型的动作；对抗模型的动作，即生成的扰动，如线路中断；为执行调度和扰动后，系统转移到的状态；为调度模型的策略；为对抗模型的策略，和均通过状态空间获取，通过动作空间获取。In the formula, is the current state; To schedule the actions of the model; Actions against the model, i.e. generated perturbations, such as line interruptions; The state that the system transfers to after executing the scheduling and disturbance; The strategy for scheduling the model; To combat the model, and All are obtained through the state space, Obtained through action space.

最大化部分表示对抗模型尝试寻找最不利的扰动，例如模拟 N-1 中断场景，寻找对电网安全性影响最大的线路中断组合。这部分是为了提前预见最坏的情况。Maximize the section Indicates that the adversarial model tries to find the most unfavorable perturbation , for example, simulating N-1 outage scenarios to find the line outage combination that has the greatest impact on grid security. This is partly to foresee the worst-case scenario in advance.

最小化部分表示调度模型学习如何在对抗模型生成的最不利情景下调整调度策略，使得系统的安全成本最小化，即在最不利的情况下仍能保证电网安全运行。Minimize the part It means that the dispatch model learns how to adjust the dispatch strategy under the most unfavorable scenario generated by the adversarial model to minimize the safety cost of the system, that is, to ensure the safe operation of the power grid under the most unfavorable circumstances.

期望表示所有可能状态转移的平均情况，综合考虑不同的扰动和调度组合，评估整体性能。expect It represents the average of all possible state transitions, comprehensively considering different perturbations and scheduling combinations to evaluate the overall performance.

所述S2包括以下分步骤：The S2 comprises the following sub-steps:

S22、初始化主专家模型和从专家模型，本实施例采用调度模型来初始化专家模型，确保专家模型和调度模型具有相同的基础知识。这减少了知识不匹配可能导致的问题，如幻觉和不一致性。S22, initializing the master expert model and the slave expert model. In this embodiment, the scheduling model is used to initialize the expert model to ensure that the expert model and the scheduling model have the same basic knowledge, which reduces problems that may be caused by knowledge mismatch, such as hallucinations and inconsistencies.

如图3所示，在本实施例中，本发明采用基于电力专家构建的专家规则库交互式调优的方式训练主专家模型，每个从专家模型对应一个电力专家，主专家模型用于统一更新和管理，进一步提升电力调度方案的优化效果。As shown in FIG3 , in this embodiment, the present invention trains the master expert model by interactive tuning based on the expert rule base constructed by power experts. Each slave expert model corresponds to a power expert. The master expert model is used for unified updating and management to further improve the optimization effect of the power dispatching scheme.

所述S23包括以下分步骤：The S23 comprises the following sub-steps:

其中，专家规则库由电力专家设置，其包括电网各类特征信息，如：电网故障发生时电网实际运行参数、运行环境、警示信息和故障设备信息，专家规则库用于对电力调度方案进行故障业务分析，判断故障发生时的异常电网运行参数、异常环境、异常警示信息以及异常设备，在满足任意异常信息造成电网设备故障的情况下，预测电力调度方案是否会形成电网故障。Among them, the expert rule base is set up by power experts, which includes various characteristic information of the power grid, such as: the actual operating parameters, operating environment, warning information and faulty equipment information of the power grid when a power grid fault occurs. The expert rule base is used to perform fault business analysis on the power dispatching plan, and judge the abnormal power grid operating parameters, abnormal environment, abnormal warning information and abnormal equipment when the fault occurs. When any abnormal information causes a power grid equipment failure, it is predicted whether the power dispatching plan will cause a power grid failure.

在本实施例中，使用专家规则库交互式调优调整主专家模型，使其更符合电力专家的专业判断。使用环境反馈调整从专家模型，使其更符合电网运行的实际情况。定期将从专家模型的更新反馈到主专家模型，确保主专家模型能够逐渐吸收各个电力专家的知识和经验。In this embodiment, the master expert model is interactively tuned using the expert rule base to make it more consistent with the professional judgment of power experts. The slave expert model is adjusted using environmental feedback to make it more consistent with the actual situation of power grid operation. The updates of the slave expert model are regularly fed back to the master expert model to ensure that the master expert model can gradually absorb the knowledge and experience of various power experts.

所述S234中，对主专家模型进行训练的损失函数Loss的表达式具体为：In S234, the expression of the loss function Loss for training the main expert model is specifically:

本实施例中，专家模型的训练目标是最大专家规则库所选择的确信度与专家模型的预测之间的一致性，通过最小化相关损失函数来实现。In this embodiment, the training goal of the expert model is to maximize the consistency between the confidence selected by the expert rule base and the prediction of the expert model, which is achieved by minimizing the relevant loss function.

所述S3包括以下分步骤：The S3 comprises the following sub-steps:

在本实施例中，采用基于专家模型优化调度模型输出的电力调度方案，将优化后的电力调度方案通过专家规则库进行故障业务分析。如果合格，则该方案成为最终的电力调度方案。如果不合格，则根据反馈结果进一步训练和优化专家模型，以提高其在未来任务中的表现。In this embodiment, the power dispatching scheme outputted by the expert model optimization dispatching model is adopted, and the optimized power dispatching scheme is subjected to fault business analysis through the expert rule base. If qualified, the scheme becomes the final power dispatching scheme. If unqualified, the expert model is further trained and optimized according to the feedback results to improve its performance in future tasks.

本发明的有益效果为：本发明提供了基于深度强化学习的安全电力调度方法，通过对电力调度相关知识的学习，并且结合深度强化学习的特点和缺点，以有效地应对电力调度中的各种安全问题，本发明基于 Rainbow DQN 算法，提出了两种新颖的训练方式，分别是以基于对抗学习的方式训练对抗模型和调度模型来进行安全约束保障学习，以基于专家规则库交互式调优的方式训练得到主专家模型。本发明提出了一种基于深度强化学习的安全电力调度算法，马尔可夫决策过程的安全成本将电压安全成本、电流安全成本、频率安全成本考虑在内，提升了电网运行的安全性，理论分析表明该方法在安全性和调度性能方面具有潜在优势。The beneficial effects of the present invention are as follows: the present invention provides a safe power dispatching method based on deep reinforcement learning, which effectively addresses various safety issues in power dispatching by learning relevant knowledge about power dispatching and combining the characteristics and disadvantages of deep reinforcement learning. The present invention is based on the Rainbow DQN algorithm and proposes two novel training methods, namely, training the adversarial model and the dispatching model based on adversarial learning to perform safety constraint assurance learning, and training the main expert model based on the interactive tuning of the expert rule base. The present invention proposes a safe power dispatching algorithm based on deep reinforcement learning, and the safety cost of the Markov decision process Taking voltage safety cost, current safety cost and frequency safety cost into consideration improves the safety of power grid operation. Theoretical analysis shows that this method has potential advantages in terms of safety and dispatch performance.

本发明的目的在于基于电力调度中的各种安全标准与规则，结合电网系统的特点，将问题拓展为电力安全调度，包括负荷管理、故障检测与隔离、应急预案等，最后建立一个基于深度强化学习的安全电力调度算法。用于在用电高峰期通过需求响应、负荷转移和削峰填谷等措施，平衡供需，减少高峰负荷对系统的压力；还用于快速检测并隔离故障区域，防止故障扩展，减少对电力系统的影响；还用于制定应急预案，确保在突发事件发生时能够快速响应和恢复供电。The purpose of this invention is to expand the problem into safe power dispatching, including load management, fault detection and isolation, emergency plans, etc., based on various safety standards and rules in power dispatching, combined with the characteristics of the power grid system, and finally establish a safe power dispatching algorithm based on deep reinforcement learning. It is used to balance supply and demand and reduce the pressure of peak load on the system through demand response, load transfer and peak-to-valley filling during peak power consumption; it is also used to quickly detect and isolate fault areas, prevent fault expansion, and reduce the impact on the power system; it is also used to formulate emergency plans to ensure rapid response and power restoration in the event of an emergency.

在本发明的描述中，需要理解的是，术语“中心”、“厚度”、“上”、“下”、“水平”、“顶”、“底”、“内”、“外”、“径向”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的设备或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性或隐含指明的技术特征的数量。因此，限定由“第一”、“第二”、“第三”的特征可以明示或隐含地包括一个或者更多个该特征。In the description of the present invention, it is necessary to understand that the orientation or positional relationship indicated by the terms "center", "thickness", "upper", "lower", "horizontal", "top", "bottom", "inner", "outer", "radial", etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation on the present invention. In addition, the terms "first", "second", and "third" are used only for descriptive purposes, and cannot be understood as indicating or implying the relative importance or the number of implicitly specified technical features. Therefore, the features defined by "first", "second", and "third" may explicitly or implicitly include one or more of the features.

Claims

1. The safe power scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:

S1, training a scheduling model according to countermeasure learning by adopting a Markov decision process;

s2, training a main expert model based on an expert rule base interactive tuning mode;

And S3, outputting a final power dispatching scheme according to the main expert model optimization dispatching model.

2. The safe power scheduling method based on deep reinforcement learning according to claim 1, wherein S1 specifically is:

Defining action space, state space and rewarding function through Markov decision process, simulating the behavior of the scheduling model and the countermeasure model in virtual simulation power grid environment, updating network parameters of the scheduling model and the countermeasure model through Rainbow DQN algorithm, and performing countermeasure learning training on the scheduling model;

The strategy defined by the Markov decision process is specifically: generating a power dispatching scheme through a dispatching model or an countermeasure model based on observed data in the virtual simulation power grid environment;

The observation space defined by the Markov decision process comprises the voltage value of the node, the current value of the line, the output power of the generator and the grid information of the power demand of the load, and the state space The expression of (2) is specifically:

in the formula, Representing the voltage values of all the nodes,Representing the current values of all the lines,Representing the output power of all of the generators,Representing the power requirements of all loads;

the action space defined by the Markov decision process comprises a permutation and combination of all power dispatching actions, and the action space The expression of (2) is specifically:

in the formula, Representing an adjustment value of the output power of the generator,Indicating the state of operation of the switch,Indicating load adjustment;

The markov decision process defines a reward function, specifically a grid-based goal, which includes power reliability, economic benefit, and robustness, and the expression of the reward function R _t is specifically:

in the formula, Represents the power supply reliability, is used for measuring voltage stability and frequency deviation,Representing the voltage at the i-th node,Representing the rated voltage of each node in the power grid, N representing the total number of nodes in the power grid,Representing the current of the j-th line,Represents the maximum allowable current of the jth line, M represents the total number of lines in the grid,Represents economic benefit, is used for measuring the power generation cost,And G represents the total number of generators,Represents an adjustment value of the output power of the kth generator,Represents environmental benefit, is used for measuring pollutant emission,Is the emission coefficient of the kth generator,Robustness is represented, which is used to measure the resistance to faults,For the impact evaluation value of the first failure, F represents the total number of failures,A first index weight is represented as a first index weight,A second index weight is represented as a function of the second index weight,A third index weight is represented as a function of the first index weight,Representing a fourth index weight.

3. The safe power scheduling method based on deep reinforcement learning according to claim 2, wherein the expression for updating the network parameters of the scheduling model and the countermeasure model by the Rainbow DQN algorithm is specifically:

in the formula, To have the following characteristicsVector of atomsThe projection onto the surface of the lens,Representing the number of atoms on the discrete support,In order to achieve a distribution of the objects,For the KL divergence, the average value of the power supply is calculated,As a probability of a time difference error,To adjustFor a pair ofA control parameter of the degree of influence of (a).

4. A method of secure power scheduling based on deep reinforcement learning according to claim 3, wherein in S1 the security cost of the markov decision processAccording toSetting a state;

safety cost The expression of (2) is specifically:

in the formula, A first weight coefficient is represented and a second weight coefficient is represented,A second weight coefficient is represented and is used to represent,A third weight coefficient is represented and is used to represent,Representing a voltage safety cost, for reflecting the extent to which the voltage deviates from the nominal value,Indicating the range of voltage deviations allowed,Represents a current safety cost, for reflecting whether the line current exceeds the limit of safe operation,Represents the frequency safety cost, is used for measuring the frequency deviation condition of the power grid,Representing the current grid frequency,Representing the nominal frequency of the electrical network,Indicating the range of allowed frequency deviations,Representing the maximum function.

5. The deep reinforcement learning-based safe power scheduling method according to claim 4, wherein the objective function for performing the countermeasure learning training on the scheduling model is specifically the following formula:

in the formula, The current state is indicated and the current state is indicated,The actions of the scheduling model are represented,The disturbance is indicated as such,Indicating the state to which the system transitions after performing scheduling and perturbation,The policy of the scheduling model is represented,The strategy of the challenge model is represented,AndAre all acquired through the state space,Acquiring through an action space;

6. The safe power scheduling method based on deep reinforcement learning according to claim 1, wherein S2 comprises the following sub-steps:

S21, setting a master expert model and a slave expert model, wherein the master expert model comprises basic knowledge and preliminary optimization capability of power dispatching, and the slave expert model is generated according to a copy of the master expert model;

s22, initializing a master expert model and a slave expert model;

S23, optimizing output of the slave expert model through an expert rule base, and further training the master expert model.

7. The deep reinforcement learning-based safe power scheduling method according to claim 6, wherein S23 comprises the following sub-steps:

S231, outputting an initial power dispatching scheme through a dispatching model, and inputting the initial power dispatching scheme into a slave expert model to obtain an adjusted power dispatching scheme;

S232, performing fault service analysis on the adjusted power dispatching scheme through an expert rule base, and generating a feedback result of the adjusted power dispatching scheme;

S233, sending a feedback result to the slave expert model through the virtual simulation power grid environment, and updating model parameters of the slave expert model according to the feedback result;

s234, feeding back model parameters updated from the expert model to the main expert model according to a preset period, and training the main expert model.

8. The safe power scheduling method based on deep reinforcement learning according to claim 7, wherein in S234, the expression of the Loss function Loss for training the main expert model is specifically:

in the formula, Scoring expert certainty of the ith adjusted power scheduling scheme for an expert rule base,For the predictive scoring of the ith adjusted power schedule from the expert model, N is the number of schedule.

9. The deep reinforcement learning-based safe power scheduling method according to claim 7, wherein S3 comprises the following sub-steps:

s31, inputting an observed value in a virtual simulation power grid environment into a scheduling model to generate a power scheduling scheme, wherein the observed value comprises node voltage, line current, generator output power and load demand;

S32, sending the power dispatching scheme to a main expert model, and generating an optimized power dispatching scheme;

s33, performing fault service analysis on the optimized power dispatching scheme through an expert rule base, and generating a feedback result of the optimized power dispatching scheme;

S34, judging whether the optimized power dispatching scheme is qualified according to a feedback result of the optimized power dispatching scheme, and if so, taking the optimized power dispatching scheme as a final power dispatching scheme; if not, the main expert model is optimized according to the feedback result of the optimized power dispatching scheme, and the S32 is returned.