CN117318169A

CN117318169A - Active distribution network dispatching method based on deep reinforcement learning and new energy consumption

Info

Publication number: CN117318169A
Application number: CN202311228253.XA
Authority: CN
Inventors: 闫凯文; 李武; 杨慧; 高奇; 谢筱囡; 李家辉; 张大龙; 田瑛; 刘喆男; 马振华; 惠得材; 马越; 刘海南; 刘江; 段逸斐; 赵彧; 韩彦军
Original assignee: State Grid Ningxia Electric Power Co Ltd; Shizuishan Power Supply Co of State Grid Ningxia Electric Power Co Ltd
Current assignee: State Grid Ningxia Electric Power Co Ltd; Shizuishan Power Supply Co of State Grid Ningxia Electric Power Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-29

Abstract

本发明提供基于深度强化学习计及新能源消纳的主动配电网调度方法，属于电力系统能源调度技术领域。包括：根据区域配电网结构建立仿真模型，获取区域配电网结构中光伏出力历史数据、风力出力历史数据、村庄用电负荷历史数据以及工厂用电负荷历史数据，应用VMD数据预处理技术、PCC相关性分析算法和BiLSTM模型，预测出未来24小时的数据作为训练数据集；定义强化学习中马尔科夫决策所需的仿真模型；对仿真模型进行训练，应用24小时的实时预测数据反复训练；将训练好的智能体应用于配电网调度，在每一个调度时刻，将对应的状态空间S(t)输入Rainbow算法中的Q神经网络，得出当前状态空间下每一个可执行动作的Q函数，控制储能设备的动作策略。The present invention provides an active distribution network dispatching method that takes into account new energy consumption based on deep reinforcement learning, and belongs to the technical field of power system energy dispatching. Including: establishing a simulation model based on the regional distribution network structure, obtaining historical photovoltaic output data, wind power output historical data, village power load historical data and factory power load historical data in the regional distribution network structure, applying VMD data preprocessing technology, PCC correlation analysis algorithm and BiLSTM model predict the next 24 hours of data as a training data set; define the simulation model required for Markov decision-making in reinforcement learning; train the simulation model and repeatedly train using 24 hours of real-time prediction data ; Apply the trained agent to distribution network dispatching. At each dispatching moment, input the corresponding state space S(t) into the Q neural network in the Rainbow algorithm to obtain the value of each executable action in the current state space. Q function controls the action strategy of energy storage equipment.

Description

Active distribution network dispatching method based on deep reinforcement learning and new energy consumption

技术领域Technical field

本发明涉及电力系统能源调度技术领域，尤其涉及一种基于深度强化学习计及新能源消纳的主动配电网调度方法。The present invention relates to the technical field of power system energy dispatching, and in particular to an active distribution network dispatching method based on deep reinforcement learning and considering new energy consumption.

背景技术Background technique

随着社会经济的快速发展，对能源的需求日益增大。而煤炭等传统能源的大量使用，不仅使其存量减少，也会造成严重的环境污染。应用新能源的分布式发电可以有效降低对传统能源的依赖，并承担部分供电压力。新能源的主要来源是风能、光伏等，在具有环保优势的同时，也存在天气变化等因素导致的发电功率不稳定等问题。区域配电网将分布式能源系统、储能系统和负荷集于一体，并与主动配电网连接形成供需平衡。常用的区域配电网调度方案有传统方法、启发式方法以及基于强化学习的方法。传统方法包括非线性规划、二次规划、牛顿法等，该类方法计算简单，但是较难处理复杂问题；启发式方法包括遗传算法、模拟退火算法、粒子群算法等，这类算法存在算法复杂、模型依赖、易陷入局部最优等问题。而基于强化学习的调度策略，模拟人类学习过程，通过与环境的不断交互来寻找最优策略，目前取得较好的应用效果。但是以上方法往往受电力系统动态性和新能源间歇性影响较大，造成调度结果与实际状况存在一定的误差。同时，由于新能源存在出力不稳定等问题，“弃风弃光”现象日益严重，因此在区域配电网调度时，不仅要考虑电价等的成本效益，还需要考虑新能源消纳问题。With the rapid development of social economy, the demand for energy is increasing day by day. The extensive use of traditional energy sources such as coal not only reduces their stock, but also causes serious environmental pollution. Distributed power generation using new energy can effectively reduce dependence on traditional energy and bear part of the power supply pressure. The main sources of new energy are wind energy, photovoltaics, etc. While they have environmental protection advantages, they also have problems such as unstable power generation caused by weather changes and other factors. The regional distribution network integrates distributed energy systems, energy storage systems and loads, and is connected to the active distribution network to form a balance between supply and demand. Commonly used regional distribution network dispatching schemes include traditional methods, heuristic methods and methods based on reinforcement learning. Traditional methods include nonlinear programming, quadratic programming, Newton's method, etc. These methods are simple to calculate, but difficult to deal with complex problems; heuristic methods include genetic algorithms, simulated annealing algorithms, particle swarm algorithms, etc., which have complex algorithms. , model dependence, easy to fall into local optimality and other problems. The scheduling strategy based on reinforcement learning simulates the human learning process and finds the optimal strategy through continuous interaction with the environment, and currently achieves good application results. However, the above methods are often greatly affected by the dynamics of the power system and the intermittency of new energy sources, resulting in certain errors between the dispatch results and the actual conditions. At the same time, due to problems such as unstable output of new energy, the phenomenon of "abandoning wind and light" is becoming increasingly serious. Therefore, when dispatching regional distribution networks, we must not only consider the cost-effectiveness of electricity prices, but also consider the issue of new energy consumption.

发明内容Contents of the invention

有鉴于此，本发明提供一种基于深度强化学习计及新能源消纳的主动配电网调度方法，构建学习模型用于实施储能设备的优化调度，减小新能源间歇性、不确定性带来的影响，实现新能源产出电量的充分、合理地消纳。In view of this, the present invention provides an active distribution network dispatching method based on deep reinforcement learning that takes into account new energy consumption, constructs a learning model to implement optimal dispatching of energy storage equipment, and reduces the intermittency and uncertainty of new energy. The impact brought by new energy sources will be fully and reasonably consumed.

本发明实施例解决其技术问题所采用的技术方案是：The technical solutions adopted by the embodiments of the present invention to solve the technical problems are:

一种基于深度强化学习计及新能源消纳的主动配电网调度方法，包括：An active distribution network dispatching method based on deep reinforcement learning and considering new energy consumption, including:

步骤S1，根据区域配电网结构建立仿真模型，获取所述区域配电网结构中光伏出力历史数据、风力出力历史数据、村庄用电负荷历史数据以及工厂用电负荷历史数据，应用VMD数据预处理技术、PCC相关性分析算法和BiLSTM模型，预测出未来24小时的光伏出力预测数据、风力出力预测数据、村庄用电负荷预测数据以及工厂用电负荷预测数据，作为强化学习调度模型的训练数据集；Step S1, establish a simulation model based on the regional distribution network structure, obtain historical photovoltaic output data, wind power output historical data, village power load historical data and factory power load historical data in the regional distribution network structure, and apply VMD data to predict The processing technology, PCC correlation analysis algorithm and BiLSTM model predict the photovoltaic output forecast data, wind power output forecast data, village power load forecast data and factory power load forecast data in the next 24 hours as training data for the reinforcement learning dispatch model. set;

步骤S2，定义强化学习中马尔科夫决策所需的仿真模型；Step S2, define the simulation model required for Markov decision-making in reinforcement learning;

步骤S3，采用Rainbow算法，对所述步骤S2定义的仿真模型进行训练，将24小时分为24个调度时刻，应用24小时的实时预测数据反复训练，直至最终的奖励函数达到收敛；Step S3: Use the Rainbow algorithm to train the simulation model defined in step S2, divide 24 hours into 24 scheduling moments, and apply 24 hours of real-time prediction data to repeatedly train until the final reward function reaches convergence;

步骤S4，将所述步骤S3训练好的智能体应用于区域配电网调度，在每一个调度时刻，将对应的状态空间S(t)输入Rainbow算法中的Q神经网络，得出当前状态空间下每一个可执行动作的Q函数；通过比较每个动作的Q函数选出最优动作，从而控制所述储能设备的动作策略。Step S4: Apply the agent trained in step S3 to regional distribution network dispatch. At each dispatch moment, input the corresponding state space S(t) into the Q neural network in the Rainbow algorithm to obtain the current state space. The Q function of each executable action is compared; the optimal action is selected by comparing the Q function of each action, thereby controlling the action strategy of the energy storage device.

较优地，所述步骤S1包括：Preferably, the step S1 includes:

步骤S11，获取4种数据序列X(t)_l并进行数据清洗预处理，其中l∈{1,2,3,4}，l指示数据序列的类型，数据序列的类型包括光伏数据类型、风力数据类型、村庄用电数据类型、工厂用电数据类型；Step S11, obtain 4 types of data _sequences Data type, village electricity consumption data type, factory electricity consumption data type;

步骤S12，应用数据预处理技术对清洗后的数据序列X(t)_l进行VMD分解，得到K个固态模式分量其中k代表VMD分解后得到的第k个固态模式分量；Step S12, apply data preprocessing technology to perform VMD decomposition on the cleaned data sequence X(t) _l to obtain K solid-state mode components where k represents the kth solid-state mode component obtained after VMD decomposition;

步骤S13，步骤S13，对各所述固态模式分量进行PCC相关性分析，计算特征分量之间的相关系数，筛除相关程度低的IMF分量，以及提取出K'个相关性高的IMF分量；Step S13, step S13, for each solid-state mode component Perform PCC correlation analysis, calculate the correlation coefficient between feature components, filter out IMF components with low correlation, and extract K' IMF components with high correlation;

步骤S14，应用所述BiLSTM模型对所述K'个相关性高的IMF分量进行处理，提取特征并预测出预测分量 Step S14, apply the BiLSTM model to process the K' highly correlated IMF components, extract features and predict prediction components

步骤S15，叠加得出预测数据/> Step S15, superimpose Get forecast data/>

较优地，所述步骤S12包括：Preferably, the step S12 includes:

步骤S121，针对预处理之后的数据序列X(t)_l，假定每个模式具有一个中心频率的有限带宽，现在寻找K个模式，使每个模式的估计带宽之和最小，具体模型为：Step S121, for the preprocessed data _sequence

式中，K为需要分解的固态模式分量个数，{u_k}、{ω_k}分别对应分解后第k个模式分量和中心频率，δ(t)为狄拉克函数，*为卷积运算，表示对时刻t求微分，j表示虚数单位，||||₂表示二范式函数，s.t.表示约束条件；In the formula, K is the number of solid-state mode components that need to be decomposed, {u _k } and {ω _k } respectively correspond to the k-th mode component and the central frequency after decomposition, δ(t) is the Dirac function, * is the convolution operation , represents the differential at time t, j represents the imaginary unit, |||| ₂ represents the second normal form function, and st represents the constraint;

步骤S122，为求解所述步骤S121提出的模型，引入惩罚因子α和拉格朗日乘法算子λ，将约束问题转化为非约束问题，得到增广拉格朗日表达式：Step S122, in order to solve the model proposed in step S121, a penalty factor α and a Lagrangian multiplier operator λ are introduced to convert the constrained problem into an unconstrained problem and obtain an augmented Lagrangian expression:

通过乘法器交替方向法迭代更新参数u_k、ω_k和λ，更新公式为：The parameters u _k , ω _k and λ are iteratively updated through the multiplier alternating direction method. The update formula is:

其中和/>表示X(t)_l、u_i(t)、λ(t)和/>的傅立叶变换，n是迭代次数，γ为噪声容忍度；in and/> represents X(t) _l , u _i (t), λ(t) and/> The Fourier transform of , n is the number of iterations, γ is the noise tolerance;

步骤S123，对于给定的判断精度e>0，若满足则停止迭代，否则返回执行所述步骤S122；Step S123, for the given judgment accuracy e>0, if it satisfies Then stop iteration, otherwise return to step S122;

步骤S124，迭代结束，得到K个u分量，即所述IMF分量，记为 Step S124, the iteration ends, and K u components are obtained, which are the IMF components, denoted as

较优地，所述步骤S13包括：Preferably, the step S13 includes:

步骤S131，利用皮尔逊相关系数PCC，对各所述固态模式分量进行相关性分析，以所述数据序列X(t)_l为目标值，计算相关系数r_i，PCC的计算公式为Step S131, use Pearson correlation coefficient PCC to calculate each solid-state mode component Carry out correlation analysis and use the data sequence X(t) _l as the target value to calculate the correlation coefficient r _i . The calculation formula of PCC is:

其中，IMF_i ^t,l表示第i个分量，i＝1,2,…,K；Cov(IMF_i ^t,l,X(t)_l)为IMF_i ^t,l和X(t)_l的协方差，Ver[IMF_i ^t,l]和Ver[X(t)_l]分别为IMF_i ^t,l和X(t)_l的方差；Among them, IMF _i ^t,l represents the i-th component, i=1,2,...,K; Cov(IMF _i ^t,l ,X(t) _l ) is the relationship between IMF _i ^t,l and X(t) _l Covariance, Ver[IMF _i ^t,l ] and Ver[X(t) _l ] are the variances of IMF _i ^t,l and X(t) _l respectively;

步骤S132，根据所述步骤S131所得的相关性分析结果r_i筛选出K'个相关性高的IMF分量，K'≤K。Step S132: Screen out K' IMF components with high correlation according to the correlation analysis result r _i obtained in step S131, K' ≤ K.

较优地，所述步骤S132根据相关性分析结果r_i筛选出K'个相关性高的IMF分量的筛选方式包括：Preferably, the step S132 selects K' high-correlation IMF components based on the correlation analysis results r _i . The screening method includes:

按照r_i从高到低的顺序选取出前K'个r_i所对应的IMF分量。Select the IMF components corresponding to the first K' r _i in order from high to low r _i .

较优地，所述步骤S14应用BiLSTM模型对所述K'个相关性高的IMF分量进行处理，提取特征并得到预测分量过程中，对应t时刻的BiLSTM模型为：Preferably, step S14 applies the BiLSTM model to process the K' highly correlated IMF components, extract features and obtain predicted components. In the process, the BiLSTM model corresponding to time t is:

输入门： Input gate:

遗忘门： Forgetting Gate:

记忆单元： Memory unit:

输出门： Output gate:

隐藏状态： Hidden state:

式中，i_t表示输入门，f_t表示遗忘门，c_t表示记忆单元，o_t表示输入门，h_t表示隐藏状态；tanh、sigmoid表示激活函数；W_hi、W_hf、W_hc、W_ho分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中h_t-1的权重系数；W_xi、W_xf、W_xc、W_xo分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中的权重系数；b_i、b_f、b_c、b_o分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中的偏置值；h_t-1表示前一时刻的隐藏状态；c_t-1表示未更新时的记忆单元。In the formula, i _t represents the input gate, f _t represents the forgetting gate, c _t represents the memory unit, o _t represents the input gate, h _t represents the hidden state; tanh and sigmoid represent the activation function; W _hi , W _hf , W _hc , W _ho respectively represents the weight coefficient of h _t-1 in the feature extraction process of the input gate, forgetting gate, memory unit and output gate; W _xi , W _xf , W _xc and W _xo respectively represent the input gate, forgetting gate, memory unit and output. In the process of door feature extraction The weight coefficient; b _i , b _f , b _c , _{and bo} represent the bias values in the feature extraction process of the input gate, forget gate, memory unit, and output gate respectively; h _t-1 represents the hidden state at the previous moment; c _t-1 represents the memory unit when it is not updated.

较优地，所述步骤S2包括：Preferably, the step S2 includes:

步骤S21，定义环境状态空间S(t)，所述S(t)由24小时的村庄用电负荷数据、工厂用电负荷数据、风力出力数据、光伏出力数据以及储能设备的实时储电状态五个部分组成；Step S21, define the environmental state space S(t). The S(t) consists of 24-hour village power load data, factory power load data, wind power output data, photovoltaic output data and the real-time power storage status of energy storage equipment. Composed of five parts;

步骤S22，定义智能体的动作空间，所述动作空间包括所述储能设备的充电、放电、闲置三种动作；Step S22: Define the action space of the intelligent agent. The action space includes three actions: charging, discharging, and idle of the energy storage device;

步骤S23，定义用于控制储能设备的动作的奖励函数；Step S23, define a reward function for controlling the action of the energy storage device;

步骤S24，智能体与主电网交互，区域配电网通过公共连接点与所述主电网相连接，当所述区域配电网中全部新能源出力能够满足全部负荷需求且所述储能设备电量已满时，将新能源剩余的电量回馈给主电网；当新能源出力和所述储能设备不能满足全部负荷需求时，向所述主电网购电。Step S24, the agent interacts with the main power grid, and the regional distribution network is connected to the main power grid through a public connection point. When the output of all new energy sources in the regional distribution network can meet all load requirements and the energy storage device has When it is full, the remaining electricity from the new energy source is fed back to the main grid; when the output of the new energy source and the energy storage equipment cannot meet the full load demand, power is purchased from the main grid.

较优地，所述步骤S21中，在未来的调度时刻t，所述智能体从环境中获取t时刻的光伏出力预测数据风力出力预测数据/>村庄用电负荷预测数据/>工厂用电负荷预测数据/>储能设备t时刻电量状态E_t，这五个状态信息组成环境状态空间为未来24小时内的调度时刻；Preferably, in step S21, at the future scheduling time t, the agent obtains the photovoltaic output prediction data at time t from the environment. Wind power output forecast data/> Village electricity load forecast data/> Factory electricity load forecast data/> The power state E _t of the energy storage device at time t. These five state information constitute the environmental state space. It is the scheduling time within the next 24 hours;

所述步骤S22中，所述储能设备的动作策略集合A包括：In the step S22, the action strategy set A of the energy storage device includes:

其中，a_I表示储能设备的充电动作策略，具体为通过光伏出力、风力出力或主电网向所述储能设备充电，a_O表示所述储能设备的放电动作策略，具体为所述储能设备向村庄用电设备、工厂用电设备或者主电网放电，a_N表示所述储能设备的闲置动作策略；Among them, a _I represents the charging action strategy of the energy storage device, specifically charging the energy storage device through photovoltaic output, wind power output or the main grid, and a _O represents the discharge action strategy of the energy storage device, specifically the energy storage device. Energy equipment discharges to village electrical equipment, factory electrical equipment or the main grid, a _N represents the idle action strategy of the energy storage equipment;

进一步的，在满足物理约束条件下，采用动态模型表示所述储能设备，具体表示为：Further, under the condition that physical constraints are met, a dynamic model is used to represent the energy storage device, which is specifically expressed as:

其中，E_t表示所述储能设备的t时刻电量，满足E_min<E_t<E_max，这里E_min和E_max分别表示所述储能设备的最大容量和最小容量；P_t表示所述储能设备的充放电功率，P_t＜0代表所述储能设备处于放电状态，P_t＞0代表所述储能设备处于充电状态；ζ和η分别表示所述储能设备的充电效率和放电效率；Among them, E _t represents the power of the energy storage device at time t, which satisfies E _min < E _t < E _max , where E _min and E _max represent the maximum capacity and minimum capacity of the energy storage device respectively; P _t represents the The charging and discharging power of the energy storage device, P _t <0 means that the energy storage device is in a discharge state, P _t > 0 means that the energy storage device is in a charging state; ζ and η respectively represent the charging efficiency and Discharge efficiency;

所述步骤S23中，在满足储能设备物理约束条件下，所述奖励函数设定为：In step S23, under the condition that the physical constraints of the energy storage device are met, the reward function is set to:

式中，k_O为放电奖励因子，k_I为充电奖励因子，n为惩罚因子；In the formula, k _O is the discharge reward factor, k _I is the charging reward factor, and n is the penalty factor;

所述步骤S24中设定电网功率平衡限制，功率平衡关系为：In step S24, the grid power balance limit is set, and the power balance relationship is:

P_balance(t)＝P_renew(t)-P_load(t)P _balance (t)＝P _renew (t)-P _load (t)

P_grid(t)＝P_balance(t)+P_E(t)P _grid (t)=P _balance (t)+P _E (t)

式中，P_renew(t)为t时刻所述区域配电网中新能源总发电功率，P_load(t)为t时刻区域配电网中负荷的消耗总功率；P_balance(t)为所述新能源发电总功率与所述负荷的消耗总功率的供需差值，P_balance(t)>0表示所述区域配电网中新能源发电功率过剩，P_balance(t)<0表示所述区域配电网中新能源发电功率不足；P_E(t)>0时，P_E(t)表示储能设备放电功率，P_E(t)<0时，P_E(t)表示储能设备充电功率；P_grid(t)为所述区域配电网与所述主电网相互传输电功率，P_grid(t)为正则表示所述区域配电网向所述主电网输送，P_grid(t)为则负表示所述主电网向所述区域配电网输电。In the formula, P _renew (t) is the total power generation of new energy in the regional distribution network at time t, P _load (t) is the total power consumed by the loads in the regional distribution network at time t; P _balance (t) is the total power consumption of the loads in the regional distribution network at time t. The difference between the supply and demand of the total power generated by new energy and the total power consumed by the load. P _balance (t) > 0 means there is excess power generated by new energy in the regional distribution network. P _balance (t) < 0 means the There is insufficient new energy power generation in the regional distribution network; when P _E (t) > 0, P _E (t) represents the discharge power of the energy storage equipment; when P _E (t) < 0, P _E (t) represents the energy storage equipment Charging power; P _grid (t) is the mutual transmission of electric power between the regional distribution network and the main grid, P _grid (t) is a regular expression that the regional distribution network transmits to the main grid, P _grid (t) If it is negative, it means that the main grid transmits power to the regional distribution grid.

较优地，所述步骤S3包括：Preferably, the step S3 includes:

步骤S31，构建一个隐藏层和两个全连接层的神经网络，将所有全连接层参数加上一个高斯分布的噪声项进行干扰，替代DQN的ε-greedy(随机-贪婪)的探索方式。通过对所述全连接层参数增加噪声有效的加强了算法的探索能力，将原有前项计算公式y＝wx+b变体为：Step S31: Construct a neural network with a hidden layer and two fully connected layers, and interfere with all fully connected layer parameters by adding a Gaussian distributed noise term to replace the ε-greedy (random-greedy) exploration method of DQN. By adding noise to the parameters of the fully connected layer, the exploration ability of the algorithm is effectively enhanced, and the original calculation formula y=wx+b is modified into:

y＝(μ^w+σ^w⊙N^w)x+μ^b+σ^b⊙N^b y＝(μ ^w +σ ^w ⊙N ^w )x+μ ^b +σ ^b ⊙N ^b

变体公式中，将公式y＝wx+b中的权重w和误差b转化为服从于均值μ和方差σ的正态分布，且服从高斯分布的随机噪声ε，其中，ε是每一轮训练中产生的常量，N^b、μ^b、σ^w、N^w、μ^w均为参数；In the variant formula, the weight w and error b in the formula y=wx+b are converted into random noise ε that obeys the normal distribution of the mean μ and the variance σ and the Gaussian distribution, where ε is the random noise ε of each round of training. The constants generated in , N ^b , μ ^b , σ ^w , N ^w , and μ ^w are all parameters;

步骤S32，在输出层Q网络前加入一个竞争网络，将输出层的Q函数分解为价值函数V和优势函数H的和，即Q＝V+H，其中，V表示状态引起的奖励值，H表示所述储能设备执行充电、放电、闲置动作后得到的奖励值，由于Q网络有状态的约束，优先更新V值，再调整H值，Q函数公式为：Step S32, add a competition network in front of the output layer Q network, and decompose the Q function of the output layer into the sum of the value function V and the advantage function H, that is, Q = V + H, where V represents the reward value caused by the state, H Represents the reward value obtained after the energy storage device performs charging, discharging, and idle actions. Due to the state constraints of the Q network, the V value is updated first, and then the H value is adjusted. The Q function formula is:

式中，a_t是动作策略，θ是所述Q函数的网络层参数，ω是价值函数网络层参数，υ是优势函数网络层参数，/>为优势函数平均值，a′_t为在状态s_t中产生的所有可能动作；In the formula, a _t is the action strategy, θ is the network layer parameter of the Q function, ω is the value function network layer parameter, υ is the advantage function network layer parameter,/> is the average advantage function, a′ _t is all possible actions produced in state s _t ;

步骤S33，搭建两个Q网络作为所述神经网络的输出层，将动作选择a_t和选取动作的V值进行解耦，其中第一Q网络用于选择当前状态下的最佳动作，第二Q网络用于对充放电动作的评估；Step S33, build two Q networks as the output layer of the neural network to decouple the action selection a _t and the V value of the selected action. The first Q network is used to select the best action in the current state, and the second Q network is used to select the best action in the current state. Q network is used to evaluate charging and discharging actions;

步骤S34，使用多步学习策略，通过与环境交互得到即时奖励，奖励公式为：Step S34, use a multi-step learning strategy to obtain instant rewards by interacting with the environment. The reward formula is:

式中n为步幅长度，θ为神经网络参数，d为折扣率，R为回报值；In the formula, n is the stride length, θ is the neural network parameter, d is the discount rate, and R is the return value;

步骤S35，采用优先经验回放池PR，自定义一个固定容量的经验池，将所述智能体训练后的每一组数据(s_t,a_t,r_t,s_t+1)放入所述经验池中，同时计算训练数据误差δi，赋予不同的误差优先级重新送入所述神经网络中训练，具体采样优先级的公式如下：Step S35: Use the priority experience playback pool PR to customize a fixed-capacity experience pool, and put each set of data (s _t , a _t , r _t , s _t+1 ) after training of the agent into the In the experience pool, the training data error δi is calculated at the same time, and different error priorities are assigned to re-enter the neural network for training. The specific sampling priority formula is as follows:

ρ_i＝|δ_i|+ερ _i =|δ _i |+ε

δ_i＝|Q(s_t,a_t)-Q'(s_t,a_t)|δ _i =|Q(s _t ,a _t )-Q'(s _t ,a _t )|

式中，P_i是相关性值，ε是防止P_i为0的噪声因子，β为用于调节优先度的退火因子，δ_i为一组经验在训练时所造成的误差值。In the formula, _Pi is the correlation value, ε is the noise factor that prevents _Pi from being 0, β is the annealing factor used to adjust the priority, and δ _i is the error value caused by a set of experiences during training.

由上述技术方案可知，本发明实施例提供的基于深度强化学习计及新能源消纳的主动配电网调度方法，首先根据区域配电网结构建立仿真模型，获取区域配电网结构中光伏出力历史数据、风力出力历史数据、村庄用电负荷历史数据以及工厂用电负荷历史数据，应用VMD数据预处理技术、PCC相关性分析算法和BiLSTM模型预测出未来24小时的光伏出力预测数据、风力出力预测数据、村庄用电负荷预测数据以及工厂用电负荷预测数据作为强化学习调度模型的训练数据集；定义强化学习中马尔科夫决策所需的仿真模型；采用Rainbow算法，对步骤S2定义的仿真模型进行训练；将训练好的智能体应用于配电网调度，在每一个调度时刻，将对应的状态空间S(t)输入Rainbow算法中的Q神经网络，得出当前状态空间下每一个可执行动作的Q函数；通过比较每个动作的Q函数选出最优动作，从而控制储能设备的动作策略。本发明构建学习模型用于实施储能设备的优化调度，减小新能源间歇性、不确定性带来的影响，实现新能源产出电量的充分、合理地消纳。It can be seen from the above technical solutions that the active distribution network dispatching method based on deep reinforcement learning and new energy consumption provided by the embodiment of the present invention first establishes a simulation model based on the regional distribution network structure to obtain the photovoltaic output in the regional distribution network structure. Historical data, wind power output historical data, village power load historical data and factory power load historical data, apply VMD data preprocessing technology, PCC correlation analysis algorithm and BiLSTM model to predict photovoltaic output forecast data and wind power output in the next 24 hours Forecast data, village power load forecast data and factory power load forecast data are used as training data sets for the reinforcement learning scheduling model; define the simulation model required for Markov decision-making in reinforcement learning; use the Rainbow algorithm to simulate the simulation defined in step S2 The model is trained; the trained agent is applied to distribution network dispatching. At each dispatching moment, the corresponding state space S(t) is input into the Q neural network in the Rainbow algorithm, and each possible state in the current state space is obtained. Execute the Q function of the action; select the optimal action by comparing the Q function of each action to control the action strategy of the energy storage device. The present invention constructs a learning model to implement optimal scheduling of energy storage equipment, reduce the impact of the intermittency and uncertainty of new energy, and achieve full and reasonable consumption of the electricity produced by new energy.

附图说明Description of drawings

图1为本发明的基于深度强化学习计及新能源消纳的主动配电网调度方法的流程图。Figure 1 is a flow chart of the active distribution network dispatching method taking into account new energy consumption based on deep reinforcement learning according to the present invention.

图2为本发明中的区域配电网结构图。Figure 2 is a structural diagram of the regional distribution network in the present invention.

图3为本发明中深度强化学习的能源调度模型图。Figure 3 is a diagram of the energy scheduling model of deep reinforcement learning in the present invention.

图4为本发明中的基于VMD和BiLSTM预测方法的流程图。Figure 4 is a flow chart of the prediction method based on VMD and BiLSTM in the present invention.

图5为本发明中的新能源和负载的电量预测值示意图。Figure 5 is a schematic diagram of the predicted electric power values of new energy sources and loads in the present invention.

图6为本发明的基于深度强化学习计及新能源消纳的主动配电网调度方法的整体流程示意图。Figure 6 is a schematic flowchart of the overall flow of the active distribution network dispatching method based on deep reinforcement learning and taking into account new energy consumption according to the present invention.

图7为发明中训练时获得的奖励的变化示意图。Figure 7 is a schematic diagram of changes in rewards obtained during training in the invention.

图8为本发明中储能设备电量状态变化示意图。Fig. 8 is a schematic diagram of the change in power state of the energy storage device in the present invention.

具体实施方式Detailed ways

下面对本发明实施例的技术方案进行解释和说明，但下述实施例仅为本发明的优选实施例，并非全部。基于实施方式中的实施例，本领域技术人员在没有做出创造性的劳动的前提下所获得的其他实施例，都属于本发明的保护范围。The technical solutions of the embodiments of the present invention are explained and described below, but the following embodiments are only preferred embodiments of the present invention and are not all of them. Based on the examples in the implementation mode, other embodiments obtained by those skilled in the art without making creative efforts shall fall within the protection scope of the present invention.

针对于背景技术提出的区域配电网存在的问题，本发明拟将预测算法融合到调度中，形成“预测+调度”的框架，可有效的提高未来时刻调度的准确性和稳定性。In view of the problems existing in the regional distribution network proposed by the background technology, the present invention intends to integrate the prediction algorithm into dispatching to form a "prediction + dispatching" framework, which can effectively improve the accuracy and stability of dispatching in the future.

本发明提供一种基于深度强化学习计及新能源消纳的储能调度策略，学习出一种最优策略，保证电网中各个参与的部分的正常运转，并将新能源发出的电量得到合理的消纳。如图1所示，本发明提供的主动配电网调度方法，实施步骤包括：The present invention provides an energy storage dispatching strategy based on deep reinforcement learning that takes into account new energy consumption, learns an optimal strategy, ensures the normal operation of each participating part in the power grid, and obtains a reasonable amount of electricity generated by new energy sources. Consumptive. As shown in Figure 1, the implementation steps of the active distribution network dispatching method provided by the present invention include:

步骤S1，根据区域配电网结构建立仿真模型(如图2所示)，获取区域配电网结构中光伏出力历史数据、风力出力历史数据、村庄用电负荷历史数据以及工厂用电负荷历史数据，应用VMD数据预处理技术、PCC相关性分析算法和BiLSTM模型预测出未来24小时的光伏出力预测数据、风力出力预测数据、村庄用电负荷预测数据以及工厂用电负荷预测数据作为强化学习调度模型的训练数据集(参考图5)；Step S1, establish a simulation model based on the regional distribution network structure (as shown in Figure 2), and obtain historical photovoltaic output data, wind power output historical data, village power load historical data, and factory power load historical data in the regional distribution network structure. , applying VMD data preprocessing technology, PCC correlation analysis algorithm and BiLSTM model to predict the photovoltaic output forecast data, wind power output forecast data, village power load forecast data and factory power load forecast data in the next 24 hours as a reinforcement learning dispatch model The training data set (refer to Figure 5);

步骤S3，采用Rainbow算法，对步骤S2定义的仿真模型进行训练，将24小时分为24个调度时刻，每一小时调度一次，应用24小时的实时预测数据反复训练，直至最终的奖励函数达到收敛；Step S3: Use the Rainbow algorithm to train the simulation model defined in step S2. Divide 24 hours into 24 scheduling moments, schedule once per hour, and use 24 hours of real-time prediction data to train repeatedly until the final reward function reaches convergence. ;

步骤S4，将步骤S3训练好的智能体应用于区域配电网调度，在每一个调度时刻，将对应的状态空间S(t)输入Rainbow算法中的Q神经网络，得出当前状态空间下每一个可执行动作的Q函数；通过比较每个动作的Q函数选出最优动作，从而控制储能设备的动作策略。Step S4: Apply the agent trained in step S3 to regional distribution network dispatching. At each dispatching moment, input the corresponding state space S(t) into the Q neural network in the Rainbow algorithm to obtain each state space under the current state space. A Q function that can perform actions; by comparing the Q function of each action, the optimal action is selected to control the action strategy of the energy storage device.

参考图4，步骤S1建立训练数据集的步骤包括：Referring to Figure 4, the steps of establishing a training data set in step S1 include:

步骤S11，获取4种数据序列X(t)_l并进行数据清洗预处理(对数据序列X(t)_l进行数据清理的方式为：对数据序列X(t)_l中的缺失数据、重复数据和跳变数据，利用其附近的若干项数据的平均值进行替换)，其中l∈{1,2,3,4}，l指示数据序列的类型，数据序列的类型包括光伏数据类型、风力数据类型、村庄用电数据类型、工厂用电数据类型；Step S11, obtain 4 kinds of data sequences X(t ₎ _l and perform data cleaning preprocessing ₍ the method of data cleaning for data sequence and jump data, replaced by the average value of several nearby data), where l∈{1,2,3,4}, l indicates the type of data sequence, and the type of data sequence includes photovoltaic data type, wind data Type, village power consumption data type, factory power consumption data type;

步骤S12，应用VMD模型对预处理后的数据序列X(t)_ll进行VMD分解，得到K个固态模式分量其中k代表VMD分解后得到的第k个固态模式分量；Step S12, apply the VMD model to perform VMD decomposition on the preprocessed data sequence X(t) _ll to obtain K solid-state mode components where k represents the kth solid-state mode component obtained after VMD decomposition;

式中，K为需要分解的固态模式分量个数(为正整数个)，{u_k}、{ω_k}分别对应分解后第k个模式分量和中心频率，δ(t)为狄拉克函数，*为卷积运算，表示对时刻t求微分，j表示虚数单位，||||₂表示二范式函数，s.t.表示约束条件；In the formula, K is the number of solid-state mode components that need to be decomposed (a positive integer), {u _k } and {ω _k } respectively correspond to the k-th mode component and the center frequency after decomposition, and δ(t) is the Dirac function. , * is the convolution operation, represents the differential at time t, j represents the imaginary unit, |||| ₂ represents the second normal form function, and st represents the constraint;

步骤S122，为求解步骤S121提出的模型，引入惩罚因子α(用于降低高斯噪声的影响)和拉格朗日乘法算子λ，将约束问题转化为非约束问题，得到增广拉格朗日表达式：Step S122: In order to solve the model proposed in step S121, the penalty factor α (used to reduce the influence of Gaussian noise) and the Lagrangian multiplier operator λ are introduced to convert the constrained problem into an unconstrained problem and obtain the augmented Lagrangian expression:

其中和/>表示X(t)_l、u_i(t)、λ(t)和/>的傅立叶变换，n是迭代次数，γ为噪声容忍度，满足信号分解的保真度要求；in and/> represents X(t) _l , u _i (t), λ(t) and/> The Fourier transform of , n is the number of iterations, γ is the noise tolerance, which meets the fidelity requirements of signal decomposition;

步骤S123，对于给定的判断精度e>0，若满足则停止迭代，否则返回执行步骤S122；Step S123, for the given judgment accuracy e>0, if it satisfies Then stop iteration, otherwise return to step S122;

步骤S124，迭代结束，得到K个(是K×l个)u分量，即IMF分量，记为 Step S124, the iteration ends, and K (K×l) u components are obtained, that is, IMF components, denoted as

步骤S13，对各固态模式分量进行PCC相关性分析，计算特征分量之间的相关系数，筛除相关程度低的IMF分量、以及提取出K'个相关性高的IMF分量；具体为：Step S13, for each solid-state mode component Perform PCC correlation analysis, calculate the correlation coefficient between feature components, screen out IMF components with low correlation, and extract K' IMF components with high correlation; specifically:

步骤S131，利用皮尔逊相关系数PCC，对各固态模式分量进行相关性分析，以原始数据信号量X(t)_l目标值，计算相关系数r_i，PCC的计算公式为：Step S131, use Pearson correlation coefficient PCC to calculate each solid-state mode component Carry out correlation analysis and calculate the correlation coefficient r _i based on the target value of the original data signal quantity X(t) _l . The calculation formula of PCC is:

步骤S132，根据步骤S131所得的相关性分析结果r_i筛选出K'个相关性高的IMF分量输入到BiLSTM预测模型中，其中，K'≤K。这里，筛选的方式可以采用：按照r_i从高到低的顺序选取出前K'个r_i所对应的IMF分量；Step S132: According to the correlation analysis result r _i obtained in step S131, K' IMF components with high correlation are selected and input into the BiLSTM prediction model, where K'≤K. Here, the screening method can be used: select the IMF components corresponding to the first K' _{r i} _in order from high to low;

步骤S14，应用BiLSTM模型对K'个相关性高的IMF分量进行处理，提取特征并预测出预测分量其中，对应t时刻的BiLSTM模型为：Step S14, apply the BiLSTM model to process K' highly correlated IMF components, extract features and predict the predicted components Among them, the BiLSTM model corresponding to time t is:

输入门： Input gate:

遗忘门： Forgetting Gate:

记忆单元： Memory unit:

输出门： Output gate:

隐藏状态： Hidden state:

式中，i_t表示输入门，f_t表示遗忘门，c_t表示记忆单元，o_t表示输入门，h_t表示隐藏状态；tanh、sigmoid表示激活函数；W_hi、W_hf、W_hc、W_ho分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中h_t-1的权重系数；W_xi、W_xf、W_xc、W_xo分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中IMF_k ^t,l的权重系数；b_i、b_f、b_c、b_o分别表示输入门、遗忘门、记忆单元、输出门的特征提取过程中的偏置值；h_t-1表示前一时刻的隐藏状态；c_t-1表示未更新时的记忆单元。In the formula, i _t represents the input gate, f _t represents the forgetting gate, c _t represents the memory unit, o _t represents the input gate, h _t represents the hidden state; tanh and sigmoid represent the activation function; W _hi , W _hf , W _hc , W _ho respectively represents the weight coefficient of h _t-1 in the feature extraction process of the input gate, forgetting gate, memory unit and output gate; W _xi , W _xf , W _xc and W _xo respectively represent the input gate, forgetting gate, memory unit and output. The weight coefficient of IMF _k ^t,l during the feature extraction process of the gate; b _i , b _f , b _c , and _bo respectively represent the offset values in the feature extraction process of the input gate, forgetting gate, memory unit, and output gate; h _t-1 represents the hidden state at the previous moment; c _t-1 represents the memory unit before updating.

参考图3所示，步骤S2定义仿真模型的具体实施包括：Referring to Figure 3, the specific implementation of defining the simulation model in step S2 includes:

步骤S21，定义环境状态空间S(t)，S(t)由24小时的村庄用电负荷数据、工厂用电负荷数据、风力出力数据、光伏出力数据以及储能设备的实时储电状态五个部分组成；Step S21, define the environmental state space S(t). S(t) consists of five elements: 24-hour village power load data, factory power load data, wind power output data, photovoltaic power output data, and real-time power storage status of energy storage equipment. Partly composed;

步骤S21中，在未来的调度时刻t，智能体从环境中获取t时刻的光伏出力预测数据风力出力预测数据/>村庄用电负荷预测数据/>工厂用电负荷预测数据/>储能设备t时刻电量状态E_t，这五个状态信息组成环境状态空间为未来24小时内的调度时刻；In step S21, at the future scheduling time t, the agent obtains the photovoltaic output prediction data at time t from the environment. Wind power output forecast data/> Village electricity load forecast data/> Factory electricity load forecast data/> The power state E _t of the energy storage device at time t. These five state information constitute the environmental state space. It is the scheduling time within the next 24 hours;

步骤S22中，每个时刻的动作空间包括储能设备的充电、放电、闲置三种状态，储能设备的动作策略集合A包括：In step S22, the action space at each moment includes the three states of charging, discharging, and idle of the energy storage device. The action strategy set A of the energy storage device includes:

其中，a_I表示储能设备的充电动作策略，具体为通过光伏出力(支路1)、风力出力(支路2)或主电网向储能设备充电，a_O表示储能设备的放电动作策略，具体为储能设备向村庄用电设备(支路1)、工厂用电设备(支路2)或者主电网放电，a_N表示储能设备的闲置动作策略；Among them, a _I represents the charging action strategy of the energy storage device, specifically charging the energy storage device through photovoltaic output (branch 1), wind power output (branch 2) or the main grid, and a _O represents the discharge action strategy of the energy storage device. , specifically the energy storage equipment discharges to the village power equipment (branch 1), factory power equipment (branch 2) or the main grid, a _N represents the idle action strategy of the energy storage equipment;

进一步的，在满足物理约束条件下，对储能设备的动作策略需要优化储能设备的充放电时间和充放电量，采用动态模型表示储能设备，具体表示为：Furthermore, under the condition that the physical constraints are met, the action strategy for the energy storage device needs to optimize the charge and discharge time and charge and discharge amount of the energy storage device. A dynamic model is used to represent the energy storage device, which is specifically expressed as:

其中，E_t表示储能设备的t时刻电量，满足E_min<E_t<E_max，这里E_min和E_max分别表示储能设备的最大容量和最小容量；P_t表示储能设备的充放电功率，P_t＜0代表储能设备处于放电状态，P_t＞0代表储能设备处于充电状态；ζ和η分别表示储能设备的充电效率和放电效率；Among them, E _t represents the power of the energy storage device at time t, which satisfies E _min < E _t < E _max , where E _min and E _max represent the maximum capacity and minimum capacity of the energy storage device respectively; P _t represents the charge and discharge of the energy storage device. Power, P _t <0 represents the energy storage device in the discharge state, P _t >0 represents the energy storage device in the charging state; ζ and η represent the charging efficiency and discharge efficiency of the energy storage device respectively;

步骤S23，定义用于控制储能设备的动作的奖励函数；智能体通过奖励函数选择储能设备的动作，奖励函数主要考虑动作奖励，在满足储能设备物理约束条件下，奖励函数设定为：Step S23, define a reward function for controlling the action of the energy storage device; the agent selects the action of the energy storage device through the reward function. The reward function mainly considers the action reward. Under the condition that the physical constraints of the energy storage device are met, the reward function is set as :

步骤S24，智能体与主电网交互，区域配电网通过公共连接点与主电网相连接，当区域配电网中全部新能源(风力和光伏)出力能够满足全部负荷需求且储能设备电量已满时，将新能源剩余的电量回馈给主电网；当新能源出力和储能设备不能满足全部负荷需求时，向主电网购电。步骤S24中设定电网功率平衡限制，功率平衡关系为：Step S24, the agent interacts with the main grid, and the regional distribution network is connected to the main grid through public connection points. When the output of all new energy sources (wind power and photovoltaic) in the regional distribution network can meet all load demands and the energy storage equipment has When full, the remaining electricity from new energy sources is fed back to the main grid; when the output of new energy sources and energy storage equipment cannot meet the full load demand, power is purchased from the main grid. In step S24, the grid power balance limit is set, and the power balance relationship is:

P_balance(t)＝P_renew(t)-P_load(t) (10)P _balance (t)＝P _renew (t)-P _load (t) (10)

P_grid(t)＝P_balance(t)+P_E(t) (11)P _grid (t)＝P _balance (t)+P _E (t) (11)

步骤S3的具体实施包括：The specific implementation of step S3 includes:

步骤S31，构建一个隐藏层和两个全连接层的神经网络，将所有全连接层参数加上一个高斯分布的噪声项进行干扰来替代DQN的ε-greedy(随机-贪婪)的探索方式，通过对全连接层参数增加噪声有效的加强了算法的探索能力，将原有前项计算公式y＝wx+b变体为：Step S31, construct a neural network with a hidden layer and two fully connected layers, and interfere with all fully connected layer parameters plus a Gaussian distributed noise term to replace DQN's ε-greedy (random-greedy) exploration method. Adding noise to the parameters of the fully connected layer effectively strengthens the exploration ability of the algorithm. The original calculation formula y=wx+b is modified into:

y＝(μ^w+σ^w⊙N^w)x+μ^b+σ^b⊙N^b (12)y＝(μ ^w +σ ^w ⊙N ^w )x+μ ^b +σ ^b ⊙N ^b (12)

变体公式中，将公式y＝wx+b中的权重w和误差b转化为服从于均值μ和方差σ的正态分布，且服从高斯分布的随机噪声ε，其中，ε是每一轮训练中产生的常量，N^b、μ^b、σ^w、N^w、μ^w均为参数；In the variant formula, the weight w and error b in the formula y=wx+b are converted into random noise ε that obeys the normal distribution of the mean μ and the variance σ, and obeys the Gaussian distribution, where ε is the random noise ε of each round of training. The constants generated in , N ^b , μ ^b , σ ^w , N ^w , and μ ^w are all parameters;

步骤S32，在输出层Q网络前加入一个竞争网络，将输出层的Q函数分解为价值函数V和优势函数H的和，即Q＝V+H，其中，V表示状态引起的奖励值，H表示所述储能设备执行充电、放电、闲置动作后得到的奖励值，由于Q网络有状态的约束，优先更新V值，再调整H值，在实际操作中，为了限制H和V的值，使得其保持在一个合理的范围内，一般将H的平均值限制为0。Q函数公式为：Step S32, add a competition network in front of the output layer Q network, and decompose the Q function of the output layer into the sum of the value function V and the advantage function H, that is, Q = V + H, where V represents the reward value caused by the state, H Indicates the reward value obtained after the energy storage device performs charging, discharging, and idle actions. Due to the state constraints of the Q network, the V value is updated first, and then the H value is adjusted. In actual operation, in order to limit the values of H and V, To keep it within a reasonable range, the average value of H is generally limited to 0. The Q function formula is:

式中，a_t是动作策略，θ是Q函数的网络层参数，ω是价值函数网络层参数，υ是优势函数网络层参数，/>为优势函数平均值，a′_t为在状态s_t中产生的所有可能动作；In the formula, a _t is the action strategy, θ is the network layer parameter of the Q function, ω is the network layer parameter of the value function, υ is the network layer parameter of the advantage function,/> is the average advantage function, a′ _t is all possible actions produced in state s _t ;

步骤S33，搭建两个Q网络作为神经网络的输出层，将动作选择a_t和选取动作的V值进行解耦，其中第一Q网络用于选择当前状态下的最佳动作，第二Q网络用于对充放电动作的评估，通过利用两个网络输出结果可以有效解决过估计误差的问题。Step S33, build two Q networks as the output layer of the neural network, and decouple the action selection a _t and the V value of the selected action. The first Q network is used to select the best action in the current state, and the second Q network is used to select the best action in the current state. Used to evaluate charging and discharging actions, the problem of overestimation error can be effectively solved by using the output results of two networks.

步骤S34，使用多步学习(multi-step learning)策略，通过与环境交互得到即时奖励，因此训练前期的Q值可以得到更准确的估计，从而加快训练的速度。奖励公式为：In step S34, a multi-step learning strategy is used to obtain instant rewards by interacting with the environment. Therefore, the Q value in the early stage of training can be more accurately estimated, thereby speeding up training. The reward formula is:

式中n为步幅长度，θ为神经网络参数，d为折扣率，R为回报值；奖励情况参考图7；In the formula, n is the stride length, θ is the neural network parameter, d is the discount rate, and R is the reward value; refer to Figure 7 for the reward situation;

步骤S35，采用优先经验回放池PR(Prioritized Reply)，自定义一个固定容量的经验池，将所述智能体训练后的每一组数据(s_t,a_t,r_t,s_t+1)放入所述经验池中，同时计算训练数据误差δ_i，赋予不同的误差优先级重新送入所述神经网络中训练，(赋予误差优先级的意义是用于排序，按照误差值所在的区间赋予优先级，再按照优先级进行数据的顺序排列，选取具有前几优先级的数据作为训练数据，使训练数据量降低，提高训练效率和质量)具体采样优先级的公式如下：Step S35: Use the prioritized experience playback pool PR (Prioritized Reply) to customize a fixed-capacity experience pool, and combine each set of data (s _t , a _t , r _t , s _t+1 ) after training of the agent Put it into the experience pool, calculate the training data error δ _i at the same time, assign different error priorities and re-enter the neural network for training. (The meaning of assigning error priorities is for sorting, according to the interval where the error value is located. Give priority, then arrange the data according to the priority, and select the data with the first few priorities as training data to reduce the amount of training data and improve training efficiency and quality.) The specific sampling priority formula is as follows:

δ_i＝|Q(s_t,a_t)-Q'(s_t,a_t)| (16)δ _i =|Q(s _t ,a _t )-Q'(s _t ,a _t )| (16)

与现有技术相比，本发明具有以下的优点：Compared with the prior art, the present invention has the following advantages:

(1)将VMD分解和深度学习模型BiLSTM结合，并采用PCC相关性分析，精简IMF分量，有效提升计算效率，并解决光伏和风力发电的不稳定性，提高了预测精度。(1) Combine VMD decomposition with the deep learning model BiLSTM, and use PCC correlation analysis to streamline the IMF components, effectively improve calculation efficiency, and solve the instability of photovoltaic and wind power generation, improving prediction accuracy.

(2)将预测算法融合到调度中，形成“预测+调度”的框架，可有效提高未来时刻调度的准确性和稳定性。(2) Integrate prediction algorithms into scheduling to form a "prediction + scheduling" framework, which can effectively improve the accuracy and stability of scheduling in the future.

(3)将深度强化学习算法Rainbow实施储能设备的优化调度，有效的将新能源出力的电量合理消纳。(3) Implement the deep reinforcement learning algorithm Rainbow to optimize the scheduling of energy storage equipment, effectively absorbing the power generated by new energy sources.

本发明利用深度强化学习的储能调度策略，深度学习模型从高维度、连续的空间中提取高阶数据特征，同时结合VMD数据预处理技术，对含不确定性的风力发电和光伏发电有着更好的表达能力和特征挖掘能力。强化学习解决连续的决策问题，可以在动态、不确定的环境中通过反复试错探索的方式优化调度策略。其中，本发明中针对新能源发电的有效消纳问题，应用VMD分解和深度学习模型BiLSTM(双向长短期记忆网络)，对风电、光伏、负荷的预测；将深度强化学习Rainbow算法用于储能系统的优化调度，可以有效解决由于可再生能源的不确定性、能量流动和负荷多样性导致的能源利用效率过低、调度性能下降和运行成本过高等问题。This invention uses the energy storage scheduling strategy of deep reinforcement learning. The deep learning model extracts high-order data features from high-dimensional, continuous space. At the same time, combined with VMD data preprocessing technology, it has a better understanding of wind power generation and photovoltaic power generation containing uncertainty. Good expression ability and feature mining ability. Reinforcement learning solves continuous decision-making problems and can optimize scheduling strategies through repeated trial and error exploration in dynamic and uncertain environments. Among them, in this invention, for the effective consumption of new energy power generation, VMD decomposition and deep learning model BiLSTM (bidirectional long short-term memory network) are applied to predict wind power, photovoltaic, and load; the deep reinforcement learning Rainbow algorithm is used for energy storage The optimal dispatch of the system can effectively solve the problems of low energy utilization efficiency, reduced dispatch performance and high operating costs due to the uncertainty of renewable energy, energy flow and load diversity.

在预测过程中应用到了PCC相关性分析，机器学习和数据挖掘任务中，使用PCC可以帮助排除与目标变量相关性较弱的特征，从而减少占用特征空间的数据分析资源，提高模型的分析效率和准确性。In the prediction process, PCC correlation analysis, machine learning and data mining tasks are applied. The use of PCC can help exclude features that are weakly correlated with the target variable, thereby reducing the data analysis resources occupying the feature space and improving the analysis efficiency of the model. accuracy.

以上所揭露的仅为本发明较佳实施例而已，当然不能以此来限定本发明之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本发明权利要求所作的等同变化，仍属于发明所涵盖的范围。What is disclosed above is only the preferred embodiment of the present invention. Of course, it cannot be used to limit the scope of the present invention. Those of ordinary skill in the art can understand all or part of the processes for implementing the above embodiments and make decisions according to the claims of the present invention. Equivalent changes still fall within the scope of the invention.

Claims

1. An active power distribution network scheduling method based on deep reinforcement learning and new energy consumption is characterized by comprising the following steps:

step S1, a simulation model is established according to a regional power distribution network structure, photovoltaic output historical data, wind output historical data, village power utilization load historical data and factory power utilization load historical data in the regional power distribution network structure are obtained, VMD data preprocessing technology, PCC correlation analysis algorithm and BiLSTM model are applied, photovoltaic output prediction data, wind output prediction data, village power utilization load prediction data and factory power utilization load prediction data in the future 24 hours are predicted, and the photovoltaic output prediction data, the wind output prediction data, the village power utilization load prediction data and the factory power utilization load prediction data are used as training data sets of a reinforcement learning scheduling model;

step S2, defining a simulation model required by Markov decision in reinforcement learning;

step S3, training the simulation model defined in the step S2 by adopting a Rainbow algorithm, dividing 24 hours into 24 scheduling moments, and repeatedly training by applying 24 hours of real-time prediction data until the final reward function is converged;

step S4, applying the agent trained in the step S3 to regional power distribution network scheduling, and inputting a corresponding state space S (t) into a Q neural network in a Rainbow algorithm at each scheduling moment to obtain a Q function of each executable action in the current state space; and selecting an optimal action by comparing the Q functions of each action, thereby controlling the action strategy of the energy storage equipment.

2. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 1, wherein the step S1 comprises:

step S11, obtaining 4 data sequences X (t) _l Performing data cleaning pretreatment, wherein l is {1,2,3,4}, and l indicates the type of a data sequence, wherein the type of the data sequence comprises a photovoltaic data type, a wind power data type, a village power data type and a factory power data type;

step S12, applying data preprocessing technique to the cleaned data sequence X (t) _l VMD decomposition is performed to obtain K solid state mode componentsWherein k represents a kth solid state mode component obtained after decomposition of the VMD;

step S13, for each of the solid mode componentsPerforming PCC correlation analysis, calculating correlation between characteristic componentsScreening IMF components with low correlation degree, and extracting K' IMF components with high correlation degree;

s14, processing the K' IMF components with high correlation by using the BiLSTM model, extracting features and predicting predicted components

Step S15, superpositionDeriving predictive data->

3. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 2, wherein the step S12 includes:

step S121, for the data sequence X (t) after preprocessing _l Assuming each mode has a limited bandwidth with a center frequency, K modes are now sought, minimizing the sum of the estimated bandwidths for each mode, with the specific model:

wherein K is the number of solid mode components to be decomposed, { u _k }、{ω _k The k mode component and the center frequency after decomposition are respectively corresponding, delta (t) is a dirac function, x is convolution operation,represents differentiating the time tJ represents an imaginary unit, I ₂ Representing a two-norm function, and s.t. representing constraint conditions;

step S122, introducing a penalty factor alpha and a Lagrange multiplier lambda to solve the model proposed in the step S121, and converting the constraint problem into a non-constraint problem to obtain an augmented Lagrange expression:

iterative updating of parameter u by multiplier alternating direction method _k 、ω _k And lambda, updating the formula:

wherein the method comprises the steps ofAnd->Representing X (t) _l 、u _i (t), lambda (t) and +.>N is the number of iterations and γ is the noise tolerance;

step S123, for a given judgment precision e>0, if it meetsStopping the iteration, otherwise returning to execute the step S122;

step S124, after iteration, obtaining K u components, namely the IMF components, which are recorded as

4. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 3, wherein the step S13 includes:

step S131, for each of the solid state mode components, using the Pearson correlation coefficient PCCPerforming a correlation analysis with said data sequence X (t) _l For the target value, calculate the correlation coefficient r _i The calculation formula of PCC is:

wherein, IMF _i ^t,l Representing the i-th component, i=1, 2, …, K; cov (IMF) _i ^t,l ,X(t) _l ) Is IMF _i ^t,l And X (t) _l Covariance of Ver [ IMF ] _i ^t,l ]And Ver [ X (t) _l ]IMF respectively _i ^t,l And X (t) _l Is a variance of (2);

step S132, according to the correlation analysis result r obtained in the step S131 _i And screening out K 'IMF components with high correlation, wherein K' is less than or equal to K.

5. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 4, wherein the step S132 is based on the correlation analysis result r _i Screening K' IMF components with high correlationThe selection method comprises the following steps:

according to r _i The first K' r are selected from the order from high to low _i The corresponding IMF component.

6. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 5, wherein the step S14 processes the K' IMF components with high correlation by using a BiLSTM model, extracts features and obtains predicted componentsIn the process, the BiLSTM model corresponding to the time t is as follows:

an input door:

forgetting the door:

a memory unit:

output door:

hidden state:

wherein i is _t Representing the input gate, f _t Indicating forgetful door c _t Representing a memory cell, o _t Represents the input gate, h _t Representing a hidden state; tanh, sigmoid represents an activation function; w (W) _hi 、W _hf 、W _hc 、W _ho Respectively represent h in the characteristic extraction process of the input gate, the forgetting gate, the memory unit and the output gate _t-1 Weight coefficient of (2); w (W) _xi 、W _xf 、W _xc 、W _xo Respectively represent IMF in the characteristic extraction process of input gate, forget gate, memory unit and output gate _k ^t,l Weight coefficient of (2); b _i 、b _f 、b _c 、b _o Respectively representing bias values in the characteristic extraction process of the input gate, the forgetting gate, the memory unit and the output gate; h is a _t-1 Representing the hidden state of the previous moment; c _t-1 Representing the memory cells when not updated.

7. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 6, wherein the step S2 comprises:

step S21, defining an environment state space S (t), wherein the S (t) consists of five parts, namely village power load data, factory power load data, wind power output data, photovoltaic output data and real-time power storage state of energy storage equipment for 24 hours;

step S22, defining an action space of the intelligent agent, wherein the action space comprises three actions of charging, discharging and idling of the energy storage equipment;

step S23, defining a reward function for controlling the action of the energy storage device;

step S24, the intelligent agent interacts with a main power grid, the regional power distribution network is connected with the main power grid through a public connection point, and when all new energy output in the regional power distribution network can meet all load demands and the energy storage equipment has full electric quantity, the residual electric quantity of the new energy is fed back to the main power grid; and purchasing electricity from the main power grid when the new energy output and the energy storage equipment cannot meet all load requirements.

8. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 7, wherein the method comprises the following steps:

in step S21, at a future scheduling time t, the agent acquires photovoltaic output prediction data at time t from the environmentWind power output prediction data->Village electricity load prediction data +.>Factory electrical load prediction dataEnergy storage device t moment state of charge E _t These five state information make up the environmental state spaceScheduling time within 24 hours in the future;

in the step S22, the action policy set a of the energy storage device includes:

wherein a is _I Representing a charging action strategy of the energy storage device, in particular charging the energy storage device by photovoltaic output, wind output or a main power grid, a _O Representing a discharging action strategy of the energy storage equipment, in particular that the energy storage equipment discharges to village electric equipment, factory electric equipment or a main power grid, a _N Representing an idle action strategy of the energy storage device;

further, under the condition that the physical constraint condition is met, the energy storage equipment is represented by adopting a dynamic model, which is specifically represented as follows:

wherein E is _t Representing the electric quantity of the energy storage equipment at the t moment, and meeting E _min <E _t <E _max Here E _min And E is _max Representing a maximum capacity and a minimum capacity of the energy storage device, respectively; p (P) _t Representing the charge and discharge power of the energy storage device, P _t < 0 represents that the energy storage device is in a discharge state, P _t > 0 represents that the energy storage device is in a charged state; ζ and η represent a charging efficiency and a discharging efficiency of the energy storage device, respectively;

in the step S23, the reward function is set to:

wherein k is _O To discharge the reward factor, k _I A charging reward factor, n is a penalty factor;

in the step S24, a power balance limit of the power grid is set, and the power balance relationship is as follows:

P _balance (t)＝P _renew (t)-P _load (t)

P _grid (t)＝P _balance (t)+P _E (t)

wherein P is _renew (t) is the total power generation power of new energy sources in the regional power distribution network at the moment t, P _load (t) the total power consumption of the load in the power distribution network in the region at the moment t; p (P) _balance (t) is the difference between the total power of the new energy power generation and the total power of the load power consumption, P _balance (t)>0 represents surplus power of new energy generation in the regional power distribution network, and P _balance (t)<0 represents that the new energy power generation power in the regional power distribution network is insufficient; p (P) _E (t)>At 0, P _E (t) represents the discharge power of the energy storage device, P _E (t)<At 0, P _E (t) represents energy storage device charging power; p (P) _grid (t) transmitting electric power for the regional distribution network and the main power grid, P _grid (t) regular representing the delivery of the regional distribution network to the main grid, P _grid (t) negative indicating that the main power grid distributes power to the areaAnd (5) network power transmission.

9. The active power distribution network scheduling method based on deep reinforcement learning and new energy consumption according to claim 8, wherein the step S3 comprises:

step S31, constructing a neural network of a hidden layer and two full-connection layers, and adding a Gaussian-distributed noise item to all full-connection layer parameters to interfere so as to replace an exploration mode of epsilon-greedy (random-greedy) of the DQN; the exploring capability of an algorithm is effectively enhanced by adding noise to the full-connection layer parameters, and the original front-term calculation formula y=wx+b is changed into the following formula:

y＝(μ ^w +σ ^w ⊙N ^w )x+μ ^b +σ ^b ⊙N ^b

in the variant formula, the weights w and the errors b in the formula y=wx+b are converted into a normal distribution obeying the mean μ and the variance σ, and a random noise epsilon obeying the gaussian distribution, where epsilon is a constant produced in each training round, N ^b 、μ ^b 、σ ^w 、N ^w 、μ ^w All are parameters;

step S32, adding a competing network before the Q network of the output layer, decomposing the Q function of the output layer into a sum of a cost function V and a dominance function H, i.e., q=v+h, where V represents a reward value caused by a state, H represents a reward value obtained after the energy storage device performs charging, discharging, and idle actions, and because of a constraint of the Q network in a state, updating the V value preferentially, and readjusting the H value, where the Q function formula is:

in the method, in the process of the invention,a _t is an action strategy, θ is a network layer parameter of the Q function, ω is a cost function network layer parameter, v is a dominance function network layer parameter, +.>As the mean value of the dominance function, a' _t To be in state s _t All possible actions generated in (a);

step S33, two Q networks are built as output layers of the neural network, and actions are selected as a _t The V value of the action is selected for decoupling, wherein the first Q network is used for selecting the optimal action in the current state, and the second Q network is used for evaluating the charge and discharge actions;

step S34, using a multi-step learning strategy, obtaining instant rewards through interaction with the environment, wherein a rewards formula is as follows:

wherein n is stride length, θ is neural network parameter, d is discount rate, and R is return value;

step S35, custom defining a fixed-capacity experience pool by using the priority experience playback pool PR, and training each group of data (S _t ,a _t ,r _t ,s _t+1 ) Put into the experience pool and calculate training data error delta _i Different error priorities are given to be re-sent into the neural network for training, and the specific sampling priority is given by the following formula:

ρ _i ＝|δ _i |+ε

δ _i ＝|Q(s _t ,a _t )-Q'(s _t ,a _t )|

wherein P is _i Is a correlation value, ε is the prevention of P _i A noise factor of 0, β is an annealing factor for adjusting priority, δ _i Is an error value caused by a group of experience in training.