CN118554555A

CN118554555A - MASAC algorithm-based distributed photovoltaic voltage reactive power control method for power distribution network

Info

Publication number: CN118554555A
Application number: CN202410467035.XA
Authority: CN
Inventors: 吴浩; 邹斌; 杨金明; 陶金; 戴亮; 董庆森; 韩禹; 李季; 鞠秋萍
Original assignee: Taizhou Kaitai Electric Power Design Co ltd; Jiangsu Xiangtai Electric Power Industry Co ltd; Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Taizhou Kaitai Electric Power Design Co ltd; Jiangsu Xiangtai Electric Power Industry Co ltd; Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-08-27

Abstract

The invention discloses a distributed photovoltaic voltage reactive power control method for a distribution network based on a MASAC algorithm, which belongs to the field of power system automation technology and artificial intelligence reinforcement learning. First, a distribution network voltage reactive power decentralized control framework taking into account distributed PV is constructed, and the voltage reactive power decentralized control problem of the distribution network is converted into a Markov game model; then a MASAC algorithm is constructed to solve the Markov game model, and an Actor and Critic neural network is constructed for each intelligent agent, and the neural network is trained in a centralized training manner to obtain a distributed photovoltaic voltage reactive power control model for the distribution network, and then the model is used to realize online control of the distribution network voltage, complete online control of the distribution network voltage, and realize decentralized regulation of photovoltaic inverters. The invention can reduce communication requirements and computing burdens, improve grid voltage stability, and has wide applicability and flexibility.

Description

A distributed photovoltaic voltage and reactive power control method for distribution network based on MASAC algorithm

技术领域Technical Field

本发明属于电力系统自动化技术和人工智能强化学习领域，具体涉及一种基于MASAC算法的配电网分布式光伏电压无功控制方法。The present invention belongs to the field of power system automation technology and artificial intelligence reinforcement learning, and specifically relates to a distributed photovoltaic voltage reactive power control method for a distribution network based on a MASAC algorithm.

背景技术Background Art

随着分布式能源资源特别是PV(photovoltaic，光伏)的渗透率不断增加，它们对于满足日益增长的电力需求和环境保护愈发重要。然而，这些资源的不可预测性和波动性给系统运营商带来了诸多技术挑战。特别是在低负载情况下，由于PV过度渗透引起的反向电流导致的过电压问题，这一点尤其值得关注。为了改善配电网络的电压曲线，无功电压控制(VVC)是一种有效的工具，可用于控制传统无功电源或新加入逆变器的无功功率设定点，以调节配电网的电压。VVC方法通过利用电容器组和智能逆变器的无功功率吸收/注入能力来调节电压。从控制策略的角度来看，电压调节方法可以分为四类：集中控制、本地控制和分布式控制以及去中心化控制。集中控制需要建立快速的通信渠道，成本较高；本地控制和分布式控制则受制于智能体间协调的需求，而在许多应用中这可能是不可行的。最后，去中心化控制通过结合分布式和集中控制方法的优点，利用分区控制以及区间协调。传统的去中心化控制模型需要系统拓扑和参数，这在实际的配电系统中尤其是在大量屋内光伏单元普及的情况下难以获得。因此，开发一种不依赖配电网精确参数的去中心化控制方法将克服上述挑战。As the penetration of distributed energy resources, especially PV (photovoltaic), continues to increase, they are becoming increasingly important for meeting the growing electricity demand and protecting the environment. However, the unpredictability and volatility of these resources pose many technical challenges to system operators. In particular, the overvoltage problem caused by reverse current caused by excessive PV penetration is of particular concern under low load conditions. In order to improve the voltage profile of the distribution network, reactive voltage control (VVC) is an effective tool that can be used to control the reactive power set point of traditional reactive sources or newly added inverters to regulate the voltage of the distribution network. The VVC method regulates the voltage by utilizing the reactive power absorption/injection capabilities of capacitor banks and smart inverters. From the perspective of control strategies, voltage regulation methods can be divided into four categories: centralized control, local control, distributed control, and decentralized control. Centralized control requires the establishment of fast communication channels, which is costly; local control and distributed control are subject to the need for coordination between agents, which may not be feasible in many applications. Finally, decentralized control combines the advantages of distributed and centralized control methods and utilizes partition control and interval coordination. Traditional decentralized control models require system topology and parameters, which are difficult to obtain in actual distribution systems, especially when a large number of indoor photovoltaic units are popular. Therefore, developing a decentralized control method that does not rely on the precise parameters of the distribution network will overcome the above challenges.

深度强化学习作为最常用的基于机器学习的方法之一，因为它们可以学习到最优控制策略而广受关注。在深度强化学习方法中，通过与环境的持续互动获得行动与状态之间的相关性。因此，它减少了对获取系统参数完整信息的依赖。一个训练有素的深度强化学习智能体可以为现实世界中的任何新动态提供适应性强的行动。已有研究使用基于深度强化学习的无功电压控制框架进行配电网络的电压调节。然而，这些方法需要通信链接和集中处理以根据系统状态做出决策，因此它们可能不适用于具有成千上万PV的大规模电网。有研究提出了一种多智能体深度强化学习(MADRL)方法，使用多智能体深度确定性策略梯度(MADDPG)算法进行电压调节，该方法仅在集中训练阶段进行，而在去中心化执行阶段减少了众多分布式资源之间实时通信的必要性。然而，它们探索效率不佳，不适合解决大规模电力系统中的多智能体决策场景。Deep reinforcement learning, as one of the most commonly used machine learning-based methods, has attracted widespread attention because they can learn optimal control strategies. In deep reinforcement learning methods, the correlation between actions and states is obtained through continuous interaction with the environment. Therefore, it reduces the dependence on obtaining complete information about system parameters. A well-trained deep reinforcement learning agent can provide adaptable actions for any new dynamics in the real world. Studies have used a reactive voltage control framework based on deep reinforcement learning for voltage regulation in distribution networks. However, these methods require communication links and centralized processing to make decisions based on system status, so they may not be suitable for large-scale power grids with thousands of PVs. A multi-agent deep reinforcement learning (MADRL) method was proposed, which uses the multi-agent deep deterministic policy gradient (MADDPG) algorithm for voltage regulation. This method is only performed in the centralized training phase, while the decentralized execution phase reduces the necessity of real-time communication between numerous distributed resources. However, they have poor exploration efficiency and are not suitable for solving multi-agent decision-making scenarios in large-scale power systems.

发明内容Summary of the invention

本发明针对现有技术中存在的问题，提供了一种基于MASAC算法的配电网分布式光伏电压无功控制方法，更好地应用于大规模电力系统光伏调控。In view of the problems existing in the prior art, the present invention provides a distributed photovoltaic voltage reactive power control method for a distribution network based on a MASAC algorithm, which is better applied to photovoltaic regulation of large-scale power systems.

为解决以上技术问题，本发明提供如下技术方案：一种基于MASAC算法的配电网分布式光伏电压无功控制方法，包括如下步骤：In order to solve the above technical problems, the present invention provides the following technical solutions: a distributed photovoltaic voltage reactive power control method for distribution network based on MASAC algorithm, comprising the following steps:

S1、构建记及分布式光伏的配电网电压无功去中心化控制框架，将配电网电压无功去中心化控制问题转化为马尔科夫博弈模型；S1. Construct a decentralized control framework for voltage and reactive power of distribution network with distributed photovoltaics, and transform the problem of decentralized control of voltage and reactive power of distribution network into a Markov game model.

配电网电压无功去中心化控制框架包括：以最小化一段时间内配电网的有功功率损耗为目标、以分布式的光伏逆变器为决策变量、以及以预设电压范围为约束条件；The framework of decentralized control of voltage and reactive power in distribution network includes: minimizing the active power loss of distribution network over a period of time as the goal, using distributed photovoltaic inverters as decision variables, and using a preset voltage range as a constraint;

马尔科夫博弈模型包括：状态空间：各个智能体所包括的光伏逆变器的有功/无功功率的净注入量和光伏逆变器电压幅值构成的局部观测值集合，动作空间：所有智能体控制的光伏逆变器的无功输出量构成的动作集合；奖励函数：由有功损耗成本、以及电压越限惩罚构建；状态转移过程：智能体的状态遵循配电网的潮流计算约束并且根据状态转移概率分布进行更新；The Markov game model includes: state space: a set of local observation values consisting of the net injection of active/reactive power of the photovoltaic inverters and the voltage amplitude of the photovoltaic inverters included in each agent, action space: a set of actions consisting of the reactive output of the photovoltaic inverters controlled by all agents; reward function: constructed by the active loss cost and the voltage limit penalty; state transfer process: the state of the agent follows the power flow calculation constraints of the distribution network and is updated according to the state transfer probability distribution;

S2、构建MASAC算法求解马尔科夫博弈模型，为每个智能体构建Actor和Critic神经网络，Actor神经网络决定智能体的策略，Critic神经网络用于判定策略的价值；S2. Construct the MASAC algorithm to solve the Markov game model, and construct Actor and Critic neural networks for each agent. The Actor neural network determines the strategy of the agent, and the Critic neural network is used to determine the value of the strategy.

采用集中式训练的方式神经网络进行训练，得到配电网分布式光伏电压无功控制模型，然后利用该模型实现配电网电压的在线控制，完成配电网电压的在线控制，实现光伏逆变器的去中心化调控。The neural network is trained in a centralized training manner to obtain a distributed photovoltaic voltage reactive control model for the distribution network. This model is then used to realize online control of the distribution network voltage and achieve decentralized regulation of the photovoltaic inverter.

进一步地，前述的步骤S1中，以最小化一段时间内配电网的有功功率损耗为目标，具体是构建目标函数如下式：Furthermore, in the aforementioned step S1, the objective is to minimize the active power loss of the distribution network within a period of time, and specifically, the objective function is constructed as follows:

其中，T表示优化时间段，P_loss(t)表示时刻t的有功网络损耗。Wherein, T represents the optimization time period, and P _loss (t) represents the active network loss at time t.

进一步地，前述的步骤S1中，约束条件为：Furthermore, in the aforementioned step S1, the constraint condition is:

其中，V_k(t)是光伏逆变器k在t时刻的电压，V和分别为预设的电压下限和上限。Where V _k (t) is the voltage of PV inverter k at time t, V and They are the preset voltage lower and upper limits respectively.

进一步地，前述的步骤S1中，状态空间S为所有光伏逆变器在t时刻局部观测值s_i,t的集合，s_i,t为智能体i在t时刻的局部观测值，s_i,t＝(p_i,q_i,v_i)，p_i,q_i和v_i分别代表智能体i所在的光伏逆变器的有功/无功功率的净注入量和节点电压幅值；Furthermore, in the aforementioned step S1, the state space S is a set of local observation values si _,t of all photovoltaic inverters at time t, si _,t is a local observation value of agent i at time t, _si,t = ( _pi , _qi , _vi ), _pi , _qi and _vi represent the net injection amount of active/reactive power and the node voltage amplitude of the photovoltaic inverter where agent i is located, respectively;

动作空间A为所有智能体在t时刻控制的光伏逆变器的无功输出量构成的动作a_i,t的集合，Q_PV,i,t是智能体i在时间t时刻所控制的PV逆变器的无功输出量。The action space A is the set of actions a _i,t consisting of the reactive output of the PV inverters controlled by all agents at time t, and Q _PV,i,t is the reactive output of the PV inverter controlled by agent i at time t.

进一步地，前述的步骤S1中，奖励函数如下式：Furthermore, in the aforementioned step S1, the reward function is as follows:

式中，R(t)为t时刻的奖励值，P_loss(t)为t时刻的有功损耗，函数为0-1判别函数，当k节点的电压V_k(t)满足上下限V,时函数f为0，否则为1，σ₁为单位有功损耗成本，σ₂为电压越限惩罚因子。In the formula, R(t) is the reward value at time t, P _loss (t) is the active power loss at time t, and the function is a 0-1 discriminant function. When the voltage V _k (t) of the k node satisfies the upper and lower limits V , When function f is 0, otherwise it is 1, σ ₁ is the unit active power loss cost, and σ ₂ is the voltage limit penalty factor.

进一步地，前述的步骤S1中，状态转移过程利用PYPOWER潮流计算工具建立配电网的环境，使用runpf函数进行潮流计算，潮流计算约束包括功率平衡约束和潮流约束；智能体的状态转移概率分布为P(s′|s,a)，表示智能体根据当前状态S_t采取动作a_t后，环境在动作a_t作用下，由S_t转移至S′_t的概率。Furthermore, in the aforementioned step S1, the state transfer process uses the PYPOWER power flow calculation tool to establish the distribution network environment, uses the runpf function to perform power flow calculation, and the power flow calculation constraints include power balance constraints and power flow constraints; the state transfer probability distribution of the intelligent agent is P(s′|s,a), which means that after the intelligent agent takes action a _t according to the current state S _t , the probability of the environment transferring from S _t to S′ _t under the action of action a _t .

进一步地，前述的步骤S2包括以下子步骤：Furthermore, the aforementioned step S2 includes the following sub-steps:

S201、基于Actor网络构建每个智能体的每个智能体的行动者网络，每个智能体的行动者网络的策略如下式：S201. Construct an actor network of each agent based on the Actor network. The strategy of the actor network of each agent is as follows:

其中，为每个智能体在特定时间点t采取的行动，由Actor网络决定；i代表智能体的索引，智能体i在时间t的状态向量表示为每个智能体的策略记为是基于压缩高斯分布的策略；in, For each agent’s action at a specific time point t, Determined by the Actor network; i represents the index of the agent, and the state vector of agent i at time t is expressed as The strategy of each agent is denoted as It is a strategy based on compressed Gaussian distribution;

S202、每个智能体基于最大化预期回报与策略的熵迭代更新，联合策略π(a_t|s_t)的熵H(π)，如下式：S202. Each agent iteratively updates the entropy of the joint strategy π(a _t |s _t ) based on maximizing the expected return and the strategy, and the entropy H(π) is as follows:

式中，H(π_i)为各局部策略的熵，代表策略的随机性，是系统中不确定性的量化；N是智能体的个数；Where H(π _i ) is the entropy of each local strategy, representing the randomness of the strategy and quantifying the uncertainty in the system; N is the number of agents;

S203、在策略评估阶段，对Critic网络参数θ进行训练，减少Bellman残差:S203, in the strategy evaluation stage, train the Critic network parameter θ to reduce the Bellman residual:

J_Q(θ)是Critic网络参数θ的目标函数，它通过最小化该函数来训练网络参数；表示对当前策略产生的状态-动作对的期望，是在当前状态s_t和动作a_t的分布下计算的，D是经验回放缓冲区，它存储了先前的用于训练；Q(s_t,a_t)代表动作值函数，γ是折扣因子，用于计算未来奖励的现值，它的值介于0和1之间；r(s_t,a_t)是在状态s_t下采取动作a_t所获得的即时奖励；V_θ是由参数θ参数化的价值函数网络对下一个状态s_t+1的价值估计；α表示温度参数是熵正则化系数，它权衡了奖励和熵之间的关系，以鼓励探索。J _Q (θ) is the objective function of the Critic network parameter θ, which trains the network parameters by minimizing this function; represents the expectation of the state-action pairs produced by the current policy, which is calculated under the distribution of the current state s _t and action a _t , D is the experience replay buffer, which stores the previous ones for training; Q(s _t , a _t ) represents the action value function, γ is the discount factor used to calculate the present value of future rewards, and its value is between 0 and 1; r(s _t , a _t ) is the immediate reward obtained by taking action a _t in state s _t ; V _θ is the value estimate of the value function network parameterized by parameter θ for the next state s _t+1 ; α represents the temperature parameter, which is the entropy regularization coefficient, which weighs the relationship between reward and entropy to encourage exploration.

利用随机策略梯度对Critic网络的参数进行优化，如下式：The parameters of the Critic network are optimized using stochastic policy gradient as follows:

式中：Where:

其中，r为即时奖励值，φ_i为每个智能体的策略参数，Among them, r is the instant reward value, φ _i is the strategy parameter of each agent,

S204、策略制定阶段，Actor网络目标函数如下式：S204, strategy formulation phase, the Actor network objective function is as follows:

式中，代表最佳联合策略，Q(s_t,a_t)代表动作值函数，α表示温度参数；π′是目标策略；In the formula, represents the optimal joint strategy, Q(s _t ,a _t ) represents the action value function, α represents the temperature parameter; π′ is the target strategy;

S205、每个智能体的策略通过最小化其行动者网络产生的动作的预期熵进行训练，如下式所示：S205. The strategy of each agent is trained by minimizing the expected entropy of the actions produced by its actor network, as shown in the following formula:

采用随机梯度下降法更新每个智能体的策略参数φ_i，α更新如下式：The stochastic gradient descent method is used to update the policy parameters φ _i of each agent, and α is updated as follows:

式中，H'为目标熵，目标熵是由超参数组成的等效向量；Where H' is the target entropy, which is an equivalent vector composed of hyperparameters;

相较于现有技术，本发明采用以上技术方案的有益技术效果如下：Compared with the prior art, the beneficial technical effects of the present invention using the above technical solution are as follows:

1、降低通信需求与计算负担：本发明的多智能体基于深度强化学习方法能够以去中心化的方式执行，显著减少了智能体网络间的通信需求。特别是在包含大量分布式能源资源的复杂电力系统中，这一优点减轻了集中式方法带来的计算负担，从而提高了系统的整体效率和可靠性。1. Reduce communication requirements and computational burden: The multi-agent deep reinforcement learning method of the present invention can be executed in a decentralized manner, significantly reducing the communication requirements between agent networks. Especially in complex power systems containing a large number of distributed energy resources, this advantage reduces the computational burden brought by centralized methods, thereby improving the overall efficiency and reliability of the system.

2、改善电网电压稳定性：通过协调控制光伏逆变器的无功功率设定点，本发明有效改善配电网络的电压曲线，提升电网的电压稳定性。这对于应对太阳能发电的不稳定性和波动性尤为重要。2. Improve grid voltage stability: By coordinating and controlling the reactive power set point of the photovoltaic inverter, the present invention effectively improves the voltage curve of the distribution network and enhances the voltage stability of the grid. This is particularly important for dealing with the instability and volatility of solar power generation.

3、适用性广泛且灵活：本发明不依赖于系统建模，使其能够灵活应用于各种不同的配电网络配置中，无需对系统拓扑或参数进行详细了解。这增加了方法的适用范围，尤其是对于那些难以获取准确系统数据的配电网络。3. Wide applicability and flexibility: The present invention does not rely on system modeling, which enables it to be flexibly applied to a variety of distribution network configurations without the need for detailed understanding of system topology or parameters. This increases the scope of application of the method, especially for distribution networks where it is difficult to obtain accurate system data.

4、强化学习算法的优化：所开发的MASAC算法具有强大的探索能力，能够有效地为智能体寻找最佳行动方案。与传统基于最大熵的软Q学习方法相比，本发明避免了潜在的复杂性和不稳定性问题，增强了算法的稳定性和可靠性。4. Optimization of reinforcement learning algorithm: The developed MASAC algorithm has a strong exploration capability and can effectively find the best action plan for the intelligent agent. Compared with the traditional maximum entropy-based soft Q learning method, this invention avoids the potential complexity and instability problems and enhances the stability and reliability of the algorithm.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2为一个实施例提供的MASAC算法的训练结果示意图。FIG. 2 is a schematic diagram of training results of a MASAC algorithm provided by an embodiment.

图3为一个实施例提供的MASAC算法的测试结果示意图。FIG3 is a schematic diagram of test results of a MASAC algorithm provided by an embodiment.

图4为一个实施例提供的无功电压控制鲁棒性测试结果示意图。FIG4 is a schematic diagram of reactive power voltage control robustness test results provided by an embodiment.

图5为本发明一个实施例提供的无功电压控制所有节点电压效果示意图。FIG5 is a schematic diagram showing the effect of reactive voltage control on all node voltages provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了更了解本发明的技术内容，特举具体实施例并配合所附图式说明如下。In order to better understand the technical content of the present invention, specific embodiments are given and described as follows in conjunction with the accompanying drawings.

在本发明中参照附图来描述本发明的各方面，附图中示出了许多说明性实施例。本发明的实施例不局限于附图所述。应当理解，本发明通过上面介绍的多种构思和实施例，以及下面详细描述的构思和实施方式中的任意一种来实现，这是因为本发明所公开的构思和实施例并不限于任何实施方式。另外，本发明公开的一些方面可以单独使用，或者与本发明公开的其他方面的任何适当组合来使用。Various aspects of the invention are described herein with reference to the accompanying drawings, in which many illustrative embodiments are shown. The embodiments of the invention are not limited to those described in the accompanying drawings. It should be understood that the invention is implemented by any of the various concepts and embodiments described above, as well as the concepts and embodiments described in detail below, because the concepts and embodiments disclosed in the invention are not limited to any implementation. In addition, some aspects disclosed in the invention may be used alone or in any appropriate combination with other aspects disclosed in the invention.

如图1所示，本发明提供一种基于MASAC算法的配电网分布式光伏电压无功控制方法，包括如下步骤：As shown in FIG1 , the present invention provides a distributed photovoltaic voltage reactive power control method for a distribution network based on a MASAC algorithm, comprising the following steps:

马尔科夫博弈模型包括：状态空间：各个智能体所包括的光伏逆变器的有功/无功功率的净注入量和光伏逆变器电压幅值构成的局部观测值集合，动作空间：所有智能体控制的光伏逆变器的无功输出量构成的动作集合；奖励函数：由有功损耗成本、以及电压越限惩罚构建；状态转移过程：智能体的状态遵循配电网的潮流计算约束并且根据状态转移概率分布进行更新。The Markov game model includes: state space: a set of local observation values consisting of the net injection of active/reactive power of the photovoltaic inverters included in each intelligent agent and the voltage amplitude of the photovoltaic inverter; action space: a set of actions consisting of the reactive output of the photovoltaic inverters controlled by all intelligent agents; reward function: constructed by the active loss cost and the voltage over-limit penalty; state transfer process: the state of the intelligent agent follows the power flow calculation constraints of the distribution network and is updated according to the state transfer probability distribution.

(1)目标函数:电压无功控制的目标是在确保节点电压不越限的情况下最小化一段时间内配电网的有功功率损耗，目标函数如下式：(1) Objective function: The goal of voltage and reactive power control is to minimize the active power loss of the distribution network within a period of time while ensuring that the node voltage does not exceed the limit. The objective function is as follows:

(2)决策变量：配电网的电压无功去中心化控制对象为分布式的PV逆变器，通过调节分布式PV逆变器的无功功率输出Q_PV来完成对配电网的去中心化控制。(2) Decision variables: The voltage and reactive power decentralized control object of the distribution network is the distributed PV inverter. The decentralized control of the distribution network is achieved by adjusting the reactive power output Q _PV of the distributed PV inverter.

(3)约束条件：配电网的电压无功去中心化控制最主要的是节点电压约束在控制的过程中要确保节点电压在规定的限制范围V和内。(3) Constraints: The most important factor in the decentralized control of voltage and reactive power in distribution networks is the node voltage constraint. During the control process, it is necessary to ensure that the node voltage is within the specified limit range V and Inside.

(4)状态空间：状态空间S表示所有智能体在t时刻的状态，而s_i,t表示智能体i在t时刻的局部观测值，s_i,t＝(p_i,q_i,v_i)，这里p_i,q_i和v_i分别代表智能体i所在局部节点的有功/无功功率的净注入量和节点电压幅值，S为所有节点状态的集合。(4) State space: The state space S represents the state of all agents at time t, and si _,t represents the local observation value of agent i at time t, _si,t = ( _pi , _qi , _vi ), where _pi , _qi and _vi represent the net injection of active/reactive power and the node voltage amplitude of the local node where agent i is located, respectively, and S is the set of all node states.

(5)动作空间：动作空间A表示在时间t时所有智能体的动作集合，a_i,t表示智能体i在时间t时刻所控制的PV逆变器的无功输出量Q_PV,i,t。(5) Action space: The action space A represents the set of actions of all agents at time t, and a _i,t represents the reactive output Q _PV,i,t of the PV inverter controlled by agent i at time t.

(6)奖励函数：奖励函数方程式来衡量智能体所做动作的，所有智能体共享相同的奖励函数，该方程式由有功损耗成本和电压越限惩罚两部分组成，如下：(6) Reward function: The reward function equation is used to measure the actions taken by the agent. All agents share the same reward function, which consists of two parts: active power loss cost and voltage limit penalty, as follows:

上式中，R(t)为t时刻的奖励值，P_loss(t)为t时刻的有功损耗，函数为0-1判别函数，当k节点的电压V_k(t)满足上下限V,时函数f为0，否则为1，电压幅值上下限被设置为1.05和0.95，σ₁为单位有功损耗成本，σ₂为电压越限惩罚因子。In the above formula, R(t) is the reward value at time t, P _loss (t) is the active power loss at time t, and the function is a 0-1 discriminant function. When the voltage V _k (t) of the k node satisfies the upper and lower limits V , When the function f is 0, otherwise it is 1. The upper and lower limits of the voltage amplitude are set to 1.05 and 0.95, _σ1 is the unit active power loss cost, and _σ2 is the voltage limit penalty factor.

(7)状态转移过程：智能体的动作送入环境后的状态转移过程要严格遵循配电网的潮流计算约束,包含功率平衡约束和潮流约束，本发明使用PYPOWER潮流计算工具来搭建配电网的环境，同时使用runpf函数进行潮流计算，该函数在计算潮流时会自动满足功率平衡约束和潮流约束，智能体的状态转移概率分布由P(s′|s,a)表示，代表智能体根据当前状态S_t采取动作a_t后，环境在动作a_t作用下，由S_t转移至S′_t的概率。(7) State transfer process: The state transfer process after the action of the intelligent agent is sent to the environment must strictly follow the power flow calculation constraints of the distribution network, including power balance constraints and power flow constraints. The present invention uses the PYPOWER power flow calculation tool to build the distribution network environment, and uses the runpf function to perform power flow calculation. This function will automatically meet the power balance constraints and power flow constraints when calculating the power flow. The state transfer probability distribution of the intelligent agent is represented by P(s′|s,a), which represents the probability that the environment will transfer from S _t to S′ _t under _the action of action a _{t after the intelligent agent takes action a t} _according to the current state S t.

本发明中，多智能体MASAC的主要创新之处在于其集中式训练与去中心化执行的过程。在训练阶段，Critic网络通过使用全局信息进行集中式训练，而在执行阶段，每个智能体则利用其本地观测作为个别输入，以去中心化的方式制定自己的控制策略，即使用压缩高斯分布函数来生成连续动作。本发明提出的方法中，策略被训练为最大化熵与预期回报之间的权衡。这有助于避免过早收敛的问题，这对于实现全局最优是必需的。MASAC框架中每个智能体的行动者网络的策略可按以下方式表示：In the present invention, the main innovation of multi-agent MASAC lies in its centralized training and decentralized execution process. In the training phase, the Critic network is centrally trained by using global information, while in the execution phase, each agent uses its local observations as individual inputs to formulate its own control strategy in a decentralized manner, that is, using a compressed Gaussian distribution function to generate continuous actions. In the method proposed in the present invention, the strategy is trained as a trade-off between maximizing entropy and expected returns. This helps to avoid the problem of premature convergence, which is necessary to achieve global optimality. The strategy of the actor network of each agent in the MASAC framework can be expressed as follows:

其中，为每个智能体在特定时间点t采取的行动，由Actor网络决定；i代表智能体的索引，智能体i在时间t的状态向量表示为每个智能体的策略记为是基于压缩高斯分布的策略。in, For each agent’s action at a specific time point t, Determined by the Actor network; i represents the index of the agent, and the state vector of agent i at time t is expressed as The strategy of each agent is denoted as It is a strategy based on compressed Gaussian distribution.

每个智能体都拥有其独立的策略，该策略在每次迭代中更新，以最大化预期回报与策略的熵之间的权衡。策略的熵代表策略的随机性，是系统中不确定性的量化。联合策略π(a_t|s_t)的熵H(π)可以按照以下方式表达：Each agent has its own independent policy, which is updated in each iteration to maximize the trade-off between expected reward and the entropy of the policy. The entropy of the policy represents the randomness of the policy and is a quantification of the uncertainty in the system. The entropy H(π) of the joint policy π(a _t |s _t ) can be expressed as follows:

式中，H(π_i)为各局部策略的熵，N是智能体的个数；Where H(π _i ) is the entropy of each local strategy, and N is the number of agents;

在策略评估阶段，对Critic网络参数θ进行训练，使其减少Bellman残差:In the strategy evaluation phase, the Critic network parameters θ are trained to reduce the Bellman residual:

式中，J_Q(θ)是Critic网络参数θ的目标函数，它通过最小化该函数来训练网络参数；表示对当前策略产生的状态-动作对的期望，是在当前状态s_t和动作a_t的分布下计算的，D是经验回放缓冲区，它存储了先前的经验(状态、动作、奖励等)来用于训练；Q(s_t,a_t)代表动作值函数，γ是折扣因子，用于计算未来奖励的现值，它的值介于0和1之间；r(s_t,a_t)是在状态s_t下采取动作a_t所获得的即时奖励；V_θ是由参数θ参数化的价值函数网络对下一个状态s_t+1的价值估计；α表示温度参数是熵正则化系数，它权衡了奖励和熵之间的关系，以鼓励探索。Where J _Q (θ) is the objective function of the Critic network parameter θ, which trains the network parameters by minimizing this function; represents the expectation of the state-action pairs produced by the current policy, which is calculated under the distribution of the current state s _t and action a _t . D is the experience replay buffer, which stores previous experience (state, action, reward, etc.) for training; Q(s _t , a _t ) represents the action value function, γ is the discount factor used to calculate the present value of future rewards, and its value is between 0 and 1; r(s _t , a _t ) is the immediate reward obtained by taking action a _t in state s _t ; V _θ is the value estimate of the next state s _t+1 by the value function network parameterized by parameter θ; α represents the temperature parameter, which is the entropy regularization coefficient, which weighs the relationship between reward and entropy to encourage exploration.

在优化过程中，利用随机策略梯度对Critic网络的参数进行优化：During the optimization process, the stochastic policy gradient is used to optimize the parameters of the Critic network:

式中：Where:

式中，r为即时奖励值，φ_i为每个智能体的策略参数，In the formula, r is the instant reward value, φ _i is the strategy parameter of each agent,

在策略制定阶段，MASAC算法的Actor网络目标可以表示为:In the strategy formulation phase, the Actor network goal of the MASAC algorithm can be expressed as:

式中，π′是目标策略，代表最佳联合策略，Q(s_t,a_t)代表动作值函数，α表示温度参数，每个智能体的策略参数化为φ_i，旨在通过训练来降低预期熵。Where π′ is the target strategy, represents the optimal joint strategy, Q(s _t ,a _t ) represents the action-value function, α represents the temperature parameter, and the strategy of each agent is parameterized as φ _i , which aims to reduce the expected entropy through training.

具体来说，每个智能体的策略通过以下目标进行训练：最小化其行动者网络产生的动作的预期熵，如下式所示：Specifically, each agent’s policy is trained with the objective of minimizing the expected entropy of actions produced by its actor network, as shown below:

采用随机梯度下降法更新每个智能体的策略参数φ_i，最后，α可以用下式来更新：The stochastic gradient descent method is used to update the policy parameters φ _i of each agent. Finally, α can be updated as follows:

式中H'为目标熵，目标熵是由超参数组成的等效向量。针对所有智能体训练Actor和Critic神经网络，并在目标函数中考虑Q函数的最小值，以最小化对状态值的高估。Where H' is the target entropy, which is an equivalent vector composed of hyperparameters. Actor and Critic neural networks are trained for all agents, and the minimum value of the Q function is considered in the objective function to minimize the overestimation of the state value.

本发明提出的MASAC方法旨在优化智能体的无功功率输出，以调节配电网节点的电压，将每个PV看做一个智能体进行控制。在训练阶段，每个智能体的动作被提供给集中式Critic网络，以计算奖励并将其发送给智能体用于策略更新，其离线训练流程如下表所示：The MASAC method proposed in this paper aims to optimize the reactive power output of the agent to adjust the voltage of the distribution network node, and treats each PV as an agent for control. In the training phase, the action of each agent is provided to the centralized Critic network to calculate the reward and send it to the agent for strategy update. The offline training process is shown in the following table:

步骤4：部署步骤3中训练好的强化学习智能体，采用分布式执行的方式完成配电网电压的在线控制，实现各PV的去中心化调控；Step 4: Deploy the reinforcement learning agent trained in step 3, and use distributed execution to complete the online control of the distribution network voltage, thereby achieving decentralized control of each PV.

当智能体训练得当后，它们仅以局部状态作为输入并在不与集中控制器通信的情况下做出行动，下表为去中心化的分布式在线执行流程：When the agents are properly trained, they only take local states as input and take actions without communicating with a centralized controller. The following table shows the decentralized distributed online execution process:

为了评估本发明提出的电压控制框架的性能，进行了在修改后的IEEE 34母线测试系统上的仿真实验。在不同节点上添加了十二个聚合光伏逆变器，总发电容量为1576kW。由于最大负载需求为1756kW，最大太阳能光伏发电量约占总峰值负载的90％。去中心化智能体的性能在与训练数据集不同的负载和光伏曲线下进行了测试。此外，控制逆变器的无功功率以确保其运行功率因数不低于制造商推荐的0.9。作为电力流求解器，使用了PYPOWER，并与Python接口相连，作为学习和测试环境。In order to evaluate the performance of the proposed voltage control framework, simulation experiments on a modified IEEE 34 bus test system were conducted. Twelve aggregated PV inverters were added at different nodes with a total power generation capacity of 1576kW. Since the maximum load demand was 1756kW, the maximum solar PV power generation accounted for about 90% of the total peak load. The performance of the decentralized agent was tested under different load and PV curves than the training dataset. In addition, the reactive power of the inverter was controlled to ensure that its operating power factor was not less than 0.9 recommended by the manufacturer. As a power flow solver, PYPOWER was used, connected with a Python interface as a learning and testing environment.

对各智能体进行500集的训练，学习最优控制策略，以找到应对电压违规场景的最优行为。Actor网络和Critic网络都由全连接的神经网络组成，全连接的神经网络由输入层、输出层和隐藏层组成，其参数如下表所示。Each agent is trained for 500 episodes to learn the optimal control strategy to find the optimal behavior to deal with the voltage violation scenario. Both the Actor network and the Critic network are composed of a fully connected neural network, which consists of an input layer, an output layer, and a hidden layer. Its parameters are shown in the following table.

在训练的初级阶段，个体随机探索环境的决策空间，最终如图2所示，在特定事件发生后收敛并提供最优行动。训练阶段结束后，每个被训练的智能体只需要其局部状态来提供解决调压问题的最优动作。本发明通过这些仿真实验验证了去中心化智能体在不同负载和光伏曲线条件下的有效性和适应性。In the initial stage of training, individuals randomly explore the decision space of the environment, and eventually converge and provide the optimal action after a specific event occurs, as shown in Figure 2. After the training stage, each trained agent only needs its local state to provide the optimal action to solve the voltage regulation problem. Through these simulation experiments, the present invention verifies the effectiveness and adaptability of decentralized agents under different load and photovoltaic curve conditions.

图3描述了训练模型和基本情况下测试系统某节点处的电压波动情况。可以观察到，在基本情况场景中，根据电压标准限值存在电压违规，而在提出的MASAC算法的控制方式下没有电压违规。此外，在基本情况下，电压的变化比所提出的训练模型要大。Figure 3 depicts the voltage fluctuation at a node of the test system under the training model and the base case. It can be observed that in the base case scenario, there is a voltage violation according to the voltage standard limit, while there is no voltage violation under the control of the proposed MASAC algorithm. In addition, in the base case, the voltage variation is larger than that of the proposed training model.

图4显示了训练模型和基本场景下所测试节点的电压波动情况。结果表明，所提出的方法具有较好的性能。最后，图5描述了所有34个节点在第20分钟电压的变化。结果表明，所提出的训练模型在电压变化和违规方面比基本情况有更好的电压分布。Figure 4 shows the voltage fluctuations of the tested nodes under the trained model and the basic scenario. The results show that the proposed method has a better performance. Finally, Figure 5 describes the voltage changes of all 34 nodes at the 20th minute. The results show that the proposed trained model has a better voltage distribution in terms of voltage changes and violations than the basic case.

虽然本发明已以较佳实施例阐述如上，然其并非用以限定本发明。本发明所属技术领域中具有通常知识者，在不脱离本发明的精神和范围内，当可作各种的更动与润饰。因此，本发明的保护范围当视权利要求书所界定者为准。Although the present invention has been described above with preferred embodiments, it is not intended to limit the present invention. A person skilled in the art of the present invention may make various modifications and improvements without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be determined by the definition of the claims.

Claims

1. A power distribution network distributed photovoltaic voltage reactive power control method based on MASAC algorithm is characterized by comprising the following steps:

S1, constructing a voltage reactive power decentralization control framework of a power distribution network of a record and distributed photovoltaic, and converting the voltage reactive power decentralization control problem of the power distribution network into a Markov game model;

the reactive power decentralization control frame of distribution network voltage includes: the method comprises the steps of taking active power loss of a power distribution network in a period of time as a target, taking a distributed photovoltaic inverter as a decision variable and taking a preset voltage range as a constraint condition;

The Markov game model comprises: state space: and the intelligent agent comprises a local observation value set formed by the net injection quantity of active/reactive power of the photovoltaic inverter and the voltage amplitude of the photovoltaic inverter, and an action space: an action set formed by reactive output quantity of all the photovoltaic inverters controlled by the intelligent agent; bonus function: constructed from active loss cost, and voltage out-of-limit penalty; state transition process: the state of the intelligent agent follows the load flow calculation constraint of the power distribution network and is updated according to the state transition probability distribution;

S2, constructing MASAC algorithm to solve a Markov game model, constructing an Actor and a Critic neural network for each agent, wherein the Actor neural network determines the strategy of the agent, and the Critic neural network is used for judging the value of the strategy;

training is carried out by adopting a neural network in a centralized training mode to obtain a distributed photovoltaic voltage reactive control model of the power distribution network, then the model is utilized to realize the on-line control of the power distribution network voltage, the on-line control of the power distribution network voltage is completed, and the decentralization regulation and control of the photovoltaic inverter are realized.

2. The distributed photovoltaic voltage reactive power control method of a power distribution network based on MASAC algorithm as claimed in claim 1, wherein in step S1, the objective of minimizing active power loss of the power distribution network in a period of time is to construct an objective function as follows:

Where T represents the optimization time period and P _loss (T) represents the active network loss at time T.

3. The distributed photovoltaic voltage reactive power control method of a power distribution network based on MASAC algorithm as claimed in claim 1, wherein in step S1, the constraint condition is:

Wherein V _k (t) is the voltage of the photovoltaic inverter k at time t, V and Respectively a preset lower voltage limit and an upper voltage limit.

4. The method for distributed photovoltaic voltage reactive power control of a power distribution network based on MASAC algorithm as claimed in claim 1, wherein in step S1, the state space S is a set of local observations S _i,t of all photovoltaic inverters at time t, S _i,t is a local observation of an agent i at time t, and S _i,t＝(p_i,q_i,v_i),p_i,q_i and v _i respectively represent the net injection amount and the node voltage amplitude of active/reactive power of the photovoltaic inverter in which the agent i is located;

The action space a is a set of actions a _i,t composed of the reactive output of the photovoltaic inverter controlled by all agents at time t, and Q _PV,i,t is the reactive output of the PV inverter controlled by agent i at time t.

5. The distributed photovoltaic voltage reactive power control method of a power distribution network based on MASAC algorithm as claimed in claim 1, wherein in step S1, the reward function is as follows:

Wherein R (t) is a reward value at time t, P _loss (t) is an active loss at time t, a function Is a 0-1 discriminant function, when the voltage V _k (t) of the k node meets the upper limit V and the lower limit V,The time function f is 0, otherwise is 1, sigma ₁ is the unit active loss cost, and sigma ₂ is the voltage out-of-limit penalty factor.

6. The method for reactive power control of power distribution network distributed photovoltaic voltage based on MASAC algorithm according to claim 1, wherein in step S1, the state transition process uses PYPOWER power flow calculation tool to build the environment of the power distribution network, and runpf function is used to perform power flow calculation, and the power flow calculation constraint comprises power balance constraint and power flow constraint; the state transition probability distribution of the agent is P (S '|s, a), which indicates the probability that the environment transitions from S _t to S' _t under the action of action a _t after the agent takes action a _t according to the current state S _t.

7. A distributed photovoltaic voltage reactive control method of a power distribution network based on MASAC algorithm as claimed in claim 1, wherein step S2 comprises the following sub-steps:

S201, constructing an Actor network of each intelligent agent based on the Actor network, wherein the strategy of the Actor network of each intelligent agent is as follows:

wherein, For each action taken by the agent at a particular point in time t,Determining by an Actor network; i represents the index of agent, and the state vector of agent i at time t is expressed asThe policy of each agent is noted asIs a strategy based on compressed gaussian distribution;

S202, each agent combines entropy H (pi) of a strategy pi (a _t|s_t) based on the entropy iteration update of the maximum expected return and the strategy, and the following formula is shown in the specification:

Wherein H (pi _i) is entropy of each local strategy, represents randomness of the strategy and is the quantification of uncertainty in the system; n is the number of the intelligent agents;

S203, training Critic network parameters theta in a strategy evaluation stage, and reducing Bellman residual errors:

j _Q (θ) is an objective function of Critic network parameter θ that trains network parameters by minimizing the function; representing the desire for state-action pairs generated by the current strategy, calculated under the distribution of current state s _t and action a _t, D is an empirical playback buffer that stores previous training; q (s _t,a_t) represents an action value function, gamma is a discount factor, and is used for calculating the present value of future rewards, and the value of the present value is between 0 and 1; r (s _t,a_t) is the immediate prize obtained by taking action a _t in state s _t; v _θ is the value estimate of the next state s _t+1 by the network of cost functions parameterized by the parameter θ; alpha represents that the temperature parameter is an entropy regularization coefficient that balances the relationship between rewards and entropy to encourage exploration;

parameters of the Critic network are optimized by using random strategy gradients, and the following formula is adopted:

Wherein:

Wherein r is an instant rewarding value, phi _i is a policy parameter of each agent,

S204, in a strategy making stage, an Actor network objective function is expressed as follows:

In the method, in the process of the invention, Representing the optimal combination strategy, Q (s _t,a_t) representing an action value function, and alpha representing a temperature parameter; pi' is the target policy;

S205, the policy of each agent is trained by minimizing the expected entropy of actions generated by its actor network, as shown in the following formula:

And updating the strategy parameters phi _i of each agent by adopting a random gradient descent method, wherein alpha is updated as follows:

Wherein H' is a target entropy, and the target entropy is an equivalent vector consisting of super parameters;

s206, training an Actor and Critic neural network for all agents, and taking the minimum value of the Q function in the objective function.