CN115618497A

CN115618497A - An airfoil optimization design method based on deep reinforcement learning

Info

Publication number: CN115618497A
Application number: CN202211374735.1A
Authority: CN
Inventors: 屈峰; 段少凯; 孙迪; 惠心雨; 白俊强
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-17
Anticipated expiration: 2042-11-04
Also published as: CN115618497B

Abstract

The present invention proposes an airfoil optimization design method based on deep reinforcement learning. This method is different from supervised learning, but has the characteristics of independent learning strategy and the largest long-term reward. It is an optimization method closer to intelligence, and It has the portability and the universality of the strategy. If the design conditions are changed within a certain range, such as incoming flow Mach number, Reynolds number, etc., the strategy obtained from the original optimization can still provide the initial optimization direction, and the optimization goal can be significantly improved within a very short number of steps.

Description

An airfoil optimization design method based on deep reinforcement learning

技术领域technical field

本发明属于飞机设计领域，提出一种基于深度强化学习的翼型优化设计方法。The invention belongs to the field of aircraft design, and proposes an airfoil optimization design method based on deep reinforcement learning.

背景技术Background technique

翼型作为飞行器的主要组成部分，不仅提升飞行所需的升力，还保证飞行器的安定性和操作性。随着对翼型气动性能的了解与探索，多个翼型库也相应建立。目前尚不存在一种完全满足所有设计状态的超级翼型，需要根据设计用途和工况等因素，从翼型库中初步选择符合要求的基准翼型，基于基准翼型，结合设计要求不断改进基准翼型。对于改进基准翼型，最初的方法是使用风洞进行重复实验，这会消耗巨大的物力人力。As the main component of the aircraft, the airfoil not only improves the lift required for flight, but also ensures the stability and operability of the aircraft. With the understanding and exploration of the aerodynamic performance of the airfoil, multiple airfoil libraries have also been established accordingly. At present, there is no super airfoil that fully satisfies all design states. It is necessary to initially select a benchmark airfoil that meets the requirements from the airfoil library based on factors such as design use and working conditions. Based on the benchmark airfoil, it is continuously improved in combination with design requirements. base airfoil. For improving the reference airfoil, the initial method is to use the wind tunnel to conduct repeated experiments, which will consume huge material and manpower.

进入21世纪以来，科学技术的进步为计算流体力学(CFD)技术注入了新的活力，CFD技术很快成为了分析和解决流体问题的主要手段。使用计算流体力学技术可以大幅缩短优化周期，节省大量的人力物力，由此可以让设计师进行多次重复计算，直至达到获得预期的翼型，例如进化和遗传算法在气动优化问题的广泛应用。然而进化和遗传算法对优化过程中产生的大量计算数据的利用率较低。Since the beginning of the 21st century, the advancement of science and technology has injected new vitality into Computational Fluid Dynamics (CFD) technology, and CFD technology has quickly become the main means of analyzing and solving fluid problems. The use of computational fluid dynamics technology can greatly shorten the optimization cycle and save a lot of manpower and material resources, which allows designers to perform repeated calculations until the desired airfoil is achieved, such as the extensive application of evolutionary and genetic algorithms in aerodynamic optimization problems. However, the evolutionary and genetic algorithms have a low utilization rate for the large amount of computational data generated in the optimization process.

机器学习在流体力学中的应用在近几年内飞速发展，目前机器学习在优化设计中最为常见的是基于响应面的性能快速预测方法，利用已有数据构建输入输出之间的映射加速优化过程，部分取代CFD技术在优化过程中担任的“角色”，属于监督性学习。而与监督性学习不同，强化学习构建的是环境的状态参数到动作参数的映射，其目的是迭代更新与环境交互的策略以获得最大累积奖励。此时不同于监督学习中要求预测的真实输出，而是对应到某一动作的奖励。翼型设计问题并不是单独追求某个性能的最优，而是在反复权衡比较之后的整体性能最优，是一项关联复杂的系统工程，往往需要结合工程师自身的经验和理解进行手动修型。而强化学习的训练过程与工程师积累经验的过程十分相似。类比于传统的试错法，强化学习中负责做出动作的智能体通过训练过程中持续地采取不同的“动作”并观察设计结果在未来一段时间内(或采取一系列动作之后)的收益，更新采取动作的策略。随着收益的增加，可以认为该智能体在一定程度上获得与工程师相同的设计经验。在翼型优化方面，李润泽使用强化学习对翼型压力分布进行优化以减小跨声速翼型的阻力，Viquerat进行了带约束和不带约束的优化尝试，强化学习在翼型优化上仍处于起步阶段，存在着智能体找不到方向、陷入局部最优的可能性。总的来说，在飞行器气动优化设计中将强化学习应用到翼型优化中可以有效提高优化效率，具有广泛的应用场景。The application of machine learning in fluid mechanics has developed rapidly in recent years. At present, the most common method of machine learning in optimization design is the rapid performance prediction method based on response surface, which uses existing data to construct the mapping between input and output to speed up the optimization process. Partially replacing the "role" played by CFD technology in the optimization process belongs to supervised learning. Unlike supervised learning, reinforcement learning builds a mapping from the state parameters of the environment to action parameters, and its purpose is to iteratively update the strategy for interacting with the environment to obtain the maximum cumulative reward. At this time, it is different from the real output that is required to be predicted in supervised learning, but corresponds to the reward of a certain action. The problem of airfoil design is not to pursue the optimum of a certain performance alone, but to achieve the optimum of the overall performance after repeated weighing and comparison. It is a complex system engineering, which often needs to be manually modified by combining the experience and understanding of the engineer himself. . The training process of reinforcement learning is very similar to the process of engineers accumulating experience. Analogous to the traditional trial-and-error method, the agent responsible for making actions in reinforcement learning continuously takes different "actions" during the training process and observes the benefits of the design results for a period of time in the future (or after taking a series of actions), Update the policy to take action. As the payoff increases, the agent can be considered to gain the same design experience as the engineer to a certain extent. In terms of airfoil optimization, Li Runze used reinforcement learning to optimize the pressure distribution of the airfoil to reduce the resistance of the transonic airfoil. Viquerat conducted optimization attempts with constraints and without constraints. In the initial stage, there is a possibility that the agent cannot find its direction and falls into a local optimum. In general, applying reinforcement learning to airfoil optimization in aircraft aerodynamic optimization design can effectively improve optimization efficiency and has a wide range of application scenarios.

发明内容Contents of the invention

由于当前基于CFD的方法优化效率较低，为提升翼型优化效率，本发明提出了一种基于深度强化学习的翼型优化设计方法，该方法不同于监督性学习，而是有着自主学习策略，长期奖励最大的特性，是一种更接近智能化的优化方法，且具有可迁移性，或者说策略的普适性。Due to the low optimization efficiency of the current CFD-based method, in order to improve the efficiency of airfoil optimization, the present invention proposes an airfoil optimization design method based on deep reinforcement learning. This method is different from supervised learning, but has an independent learning strategy. The biggest feature of long-term rewards is a more intelligent optimization method with transferability, or the universality of strategies.

本发明的技术方案为：Technical scheme of the present invention is:

所述一种基于深度强化学习的翼型优化设计方法，包括以下步骤：Described a kind of airfoil optimization design method based on depth reinforcement learning, comprises the following steps:

步骤1：用自由曲面变形法进行翼型的几何参数化：在基准翼型周围建立自由曲面变形控制框，建立控制框与翼型的映射关系，通过更改控制框点的位置，得到新的翼型；Step 1: Use the free-form surface deformation method to parameterize the geometry of the airfoil: establish a free-form surface deformation control frame around the reference airfoil, establish the mapping relationship between the control frame and the airfoil, and obtain a new airfoil by changing the position of the control frame points type;

步骤2：建立优化设计模型，根据飞行要求，确认单设计目标和约束条件，其中设计目标和约束条件为翼型的各种气动参数，例如升力系数、阻力系数和翼型厚度等，将设计目标和约束条件用数学表达式表示。一般的单目标优化问题可以写成以下数学形式：Step 2: Establish an optimized design model, and confirm the single design goals and constraints according to the flight requirements, where the design goals and constraints are various aerodynamic parameters of the airfoil, such as lift coefficient, drag coefficient and airfoil thickness, etc., and the design goals and constraints expressed in mathematical expressions. A general single-objective optimization problem can be written in the following mathematical form:

Minimize:f(x)Minimize: f(x)

subject to:g_w(x)≥0,w＝1,2,···,Wsubject to:g _w (x)≥0,w=1,2,···,W

h_r(x)＝0,r＝1,2,···,Rh _r (x)=0,r=1,2,...,R

其中x为设计变量，f(x)为目标函数，g_w(x)为不等式约束，共W个；h_r(x)为等式约束，共R个。Among them, x is the design variable, f(x) is the objective function, g _w (x) is the inequality constraint, there are W in total; h _r (x) is the equality constraint, there are R in total.

步骤3：根据优化目标和约束条件建立奖励函数，其中总奖励值是由各个气动参数奖励值线性和得到，其中达到目标增加奖励值，满足约束不增加奖励值，不满足约束减小奖励值，同时在目标奖励值和约束奖励值乘以不同的系数用以平衡目标和约束之间的量级差异。Step 3: Establish a reward function according to the optimization objective and constraint conditions, where the total reward value is obtained by the linear sum of the reward values of each aerodynamic parameter, where the reward value is increased when the target is reached, the reward value is not increased if the constraint is satisfied, and the reward value is reduced if the constraint is not satisfied. At the same time, the target reward value and constraint reward value are multiplied by different coefficients to balance the magnitude difference between the target and constraint.

步骤4：建立智能体，包含策略模型π和价值函数模型，策略函数模型可以输出动作策略，而价值函数模型可以输出优势估计和价值函数，策略模型和价值函数模型均使用含有两层隐藏层的人工神经网络模型，隐藏层节点数为64。初始化策略模型参数θ₀和价值函数模型参数

将翼型中有关设计目标和约束条件的气动参数作为状态，包括翼型的升力系数、阻力系数、最大厚度和力矩系数。将基准机翼的气动参数作为初始状态s₀。Step 4: Establish an agent, including a policy model π and a value function model. The policy function model can output an action strategy, and the value function model can output an advantage estimate and a value function. Both the policy model and the value function model use a two-layer hidden layer Artificial neural network model, the number of hidden layer nodes is 64. Initialize policy model parameters θ ₀ and value function model parameters

The aerodynamic parameters related to design goals and constraints in the airfoil are taken as the state, including the lift coefficient, drag coefficient, maximum thickness and moment coefficient of the airfoil. The aerodynamic parameters of the reference wing are taken as the initial state s ₀ .

步骤5：当前智能体由策略模型根据状态和奖励值给出动作a，也就是新的设计变量。Step 5: The current agent is given an action a by the policy model according to the state and reward value, which is the new design variable.

步骤6：对翼型实施动作得到新翼型。Step 6: Perform actions on the airfoil to get a new airfoil.

步骤7：对新翼型建立结构化网格，使用开源求解器CFL3D进行翼型绕流数值模拟，主控方程为雷诺平均N-S方程，湍流模型为k-ωSST模型，计算得到翼型的升力系数、阻力系数、最大厚度和力矩系数，确认新翼型的气动参数为新状态。Step 7: Establish a structured grid for the new airfoil, use the open source solver CFL3D to perform numerical simulation of the flow around the airfoil, the main control equation is the Reynolds average N-S equation, the turbulence model is the k-ωSST model, and the lift coefficient of the airfoil is calculated , drag coefficient, maximum thickness and moment coefficient, confirm that the aerodynamic parameters of the new airfoil are in the new state.

步骤8：由计算得到的气动参数根据步骤3中的奖励函数计算得到奖励值。Step 8: Calculate the reward value from the calculated aerodynamic parameters according to the reward function in step 3.

步骤9：由当前策略模型循环重复步骤5-8共e-1次，得到包含每次循环得到的状态和动作的轨迹和奖励值{r_e}，轨迹τ＝{s₀,a₀,···,s_e-1,a_e-1,s_e}，其中s₀和a₀为初始状态和动作，s_e为e步状态，s_e-1为e-1步状态，a_e-1为e-1步动作。Step 9: Repeat steps 5-8 for a total of e-1 times from the current policy model to obtain the trajectory and reward value {r _e } including the state and action obtained in each cycle, trajectory τ={s ₀ ,a ₀ ,· ··,s _e-1 ,a _e-1 ,s _e }, where s ₀ and a ₀ are the initial state and action, s _e is the e-step state, s _e-1 is the e-1 step state, a _{e- 1} is e-1 step action.

步骤10：基于当前策略模型重复步骤5-9共n-1次，得到n个的轨迹和奖励值。Step 10: Repeat steps 5-9 a total of n-1 times based on the current policy model to obtain n trajectories and reward values.

步骤11：根据得到的n条轨迹参数和奖励值，基于当前智能体的价值函数模型计算优势估计，也就是每一个动作a的期望奖励与该状态下所有可能的动作的期望奖励的均值的差值。Step 11: According to the obtained n trajectory parameters and reward values, calculate the advantage estimate based on the value function model of the current agent, that is, the difference between the expected reward of each action a and the mean value of the expected reward of all possible actions in this state value.

步骤12：根据优势估计、轨迹和奖励值构建损失函数，基于随机梯度下降算法优化策略模型参数θ和价值函数模型参数

优化目标为损失函数最小，用优化后的参数更新策略模型和价值函数模型，得到新的策略模型和价值函数模型。Step 12: Construct a loss function based on the advantage estimate, trajectory and reward value, and optimize the policy model parameters θ and value function model parameters based on the stochastic gradient descent algorithm

The optimization goal is to minimize the loss function, and the optimized parameters are used to update the strategy model and value function model to obtain a new strategy model and value function model.

步骤13：循环重复步骤5-12直至达到损失函数不再降低，完成训练。Step 13: Repeat steps 5-12 until the loss function is no longer reduced, and the training is completed.

有益效果Beneficial effect

1.本发明提出的基于深度强化学习的翼型优化设计方法，有自主学习策略和长期奖励最大的特性，相较遗传算法等算法仅仅将大量计算数据作为优化目标和约束的评价标准，深度强化学习尝试学习优化过程中得到的经验，可以提高数据的利用效率，减小计算量，提高优化效率。1. The airfoil optimization design method based on deep reinforcement learning proposed by the present invention has the characteristics of self-learning strategy and long-term reward maximum. Compared with genetic algorithm and other algorithms, only a large amount of calculation data is used as the evaluation standard of optimization goals and constraints, and the depth reinforcement Learning and trying to learn the experience gained in the optimization process can improve the efficiency of data utilization, reduce the amount of calculation, and improve the efficiency of optimization.

2.本发明提出的基于深度强化学习的翼型优化设计方法，深度强化学习优化得到的策略模型具有一定程度的普适性或者说可迁移性：若是在一定范围内改变设计条件，例如来流马赫数、雷诺数等。原来优化得到的策略依旧能够提供初始优化方向，优化目标在很短的步数内就可以有明显的提升，这是传统优化方法所不具备的特性和优势。2. The airfoil optimization design method based on deep reinforcement learning proposed by the present invention, the strategy model obtained by deep reinforcement learning optimization has a certain degree of universality or transferability: if the design conditions are changed within a certain range, such as incoming flow Mach number, Reynolds number, etc. The strategy obtained from the original optimization can still provide the initial optimization direction, and the optimization goal can be significantly improved within a short number of steps, which is a feature and advantage that traditional optimization methods do not have.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1为本发明翼型优化方法的整体流程图。Fig. 1 is the overall flow chart of the airfoil optimization method of the present invention.

图2为自由曲面变形法控制框。Figure 2 is the control frame of the free-form surface deformation method.

图3为翼型设计空间示意图。Figure 3 is a schematic diagram of the airfoil design space.

图4为全连接网络示意图。Figure 4 is a schematic diagram of a fully connected network.

图5为强化学习优化流程。Figure 5 shows the reinforcement learning optimization process.

图6为翼型的网格全貌与局部图。Figure 6 shows the overall view and partial view of the mesh of the airfoil.

图7为本文近端优化策略(PPO)的翼型优化阻力收敛曲线。Fig. 7 is the airfoil optimization resistance convergence curve of the proximal optimization strategy (PPO) in this paper.

图8为对比策略非支配排序遗传算法(NSGA-II)的翼型优化阻力收敛曲线。Fig. 8 is the airfoil optimization drag convergence curve of the comparison strategy non-dominated sorting genetic algorithm (NSGA-II).

图9为采用预训练的PPO策略的翼型优化阻力收敛曲线。Figure 9 is the airfoil optimization drag convergence curve using the pre-trained PPO strategy.

具体实施方式detailed description

下面详细描述本发明的实施例，所述实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, and the embodiments are exemplary and intended to explain the present invention, but should not be construed as limiting the present invention.

步骤1：本实例使用RAE2822作为基准翼型，采用自由曲面变形法(FFD)对基准翼型参数化，自由曲面变形法控制框如图2所示，设计变量由控制上翼面和下翼面的14个点组成，每个设计点的y方向的扰动范围为±0.02，翼型的设计空间如图3所示。Step 1: In this example, RAE2822 is used as the reference airfoil, and the free-form surface deformation method (FFD) is used to parameterize the reference airfoil. The control frame of the free-form surface deformation method is shown in Figure 2. The design variables are controlled by the upper and lower airfoils It consists of 14 points, and the perturbation range of each design point in the y direction is ±0.02. The design space of the airfoil is shown in Figure 3.

步骤2：设计目标是使翼型在设计状态下阻力最小，同时满足三个约束条件：(1)升力系数不减，(2)翼型的力矩系数的绝对值不增，(3)翼型的最大厚度不减。优化设计问题的数学表达式如下：Step 2: The design goal is to minimize the resistance of the airfoil in the design state, while satisfying three constraints: (1) the lift coefficient does not decrease, (2) the absolute value of the moment coefficient of the airfoil does not increase, (3) the airfoil The maximum thickness does not decrease. The mathematical expression of the optimization design problem is as follows:

Minimize:C_d Minimize: C _d

Subject to:C_l≥C_l0 Subject to: C _l ≥ C _l0

|C_m|≤|C_m0||C _m |≤|C _m0 |

t≥t₀ t≥t ₀

其中C_d为翼型阻力系数、C_l为翼型升力系数、C_l0为基准翼型的升力系数、C_m为翼型的力矩系数、C_m0为基准翼型的力矩系数、t为翼型的厚度、t₀为基准翼型的厚度。Among them, C _d is the drag coefficient of the airfoil, C _l is the lift coefficient of the airfoil, C _l0 is the lift coefficient of the reference airfoil, C _m is the moment coefficient of the airfoil, C _m0 is the moment coefficient of the reference airfoil, and t is the airfoil , t ₀ is the thickness of the reference airfoil.

步骤3：根据步骤2中的设计目标和约束条件构建奖励函数如下：Step 3: Construct the reward function according to the design goals and constraints in step 2 as follows:

如果C_d-C_d0<0则reward_cd＝100×(C_d-C_d0)。If C _d -C _d0 <0, reward _cd =100×(C _d -C _d0 ).

如果C_l-C_l0<0则reward_cl＝10×(C_l-C_l0)，否则令reward_cl＝0。If C _l -C _l0 <0 then reward _cl =10×(C _l -C _l0 ), otherwise let reward _cl =0.

如果C_m-C_m0<0则令reward_cm＝100×(C_m-C_m0)，否则令reward_cm＝0。If C _m -C _m0 <0, set reward _cm =100×(C _m -C _m0 ), otherwise set reward _cm =0.

如果t-t₀<0则令reward_t＝50×(t-t₀)，否则令reward_t＝0。If tt ₀ <0, set reward _t =50×(tt ₀ ), otherwise set reward _t =0.

最终奖励值reward＝reward_cd+reward_cl+reward_cm+100reward_t。The final reward value reward = reward _cd + reward _cl + reward _cm + 100 reward _t .

其中reward_cd为阻力系数奖励值，reward_cl为升力系数奖励值，reward_cm为力矩系数奖励值，reward_t为翼型厚度奖励值，10、50和100用于平衡目标和约束之间的量级差距。where reward _cd is the reward value of the drag coefficient, reward _cl is the reward value of the lift coefficient, reward _cm is the reward value of the moment coefficient, reward _t is the reward value of the airfoil thickness, and 10, 50 and 100 are used to balance the magnitude between the target and the constraint gap.

步骤4：建立智能体，包含策略模型π和价值函数模型，策略函数模型输出动作策略，而价值函数模型输出优势估计和价值函数，模型均使用含有两层隐藏层的人工神经网络模型，基于PyTorch编写，隐藏层节点数为64。Step 4: Establish an agent, including a policy model π and a value function model. The policy function model outputs an action strategy, while the value function model outputs an advantage estimate and a value function. The models all use an artificial neural network model with two hidden layers, based on PyTorch Write, the number of hidden layer nodes is 64.

人工神经网络是一种前馈神经网络，将多个神经元节点连接在一起，每一层的输出作为下一层的输入，通过权值运算，组成有分组层次的计算结构，全连接网络示意图如图4所示。初始化策略模型参数θ₀和价值函数模型参数

将翼型的升力系数、阻力系数、最大厚度和力矩系数作为状态，这里将基准机翼的升力系数、阻力系数、最大厚度和力矩系数作为初始状态s₀。强化学习优化流程中各部分关系如图5所示。Artificial neural network is a kind of feedforward neural network, which connects multiple neuron nodes together, and the output of each layer is used as the input of the next layer. Through weight calculation, it forms a computing structure with grouping level. Schematic diagram of fully connected network As shown in Figure 4. Initialize policy model parameters θ ₀ and value function model parameters

Take the lift coefficient, drag coefficient, maximum thickness and moment coefficient of the airfoil as the state, and here take the lift coefficient, drag coefficient, maximum thickness and moment coefficient of the reference wing as the initial state s ₀ . The relationship between various parts in the reinforcement learning optimization process is shown in Figure 5.

步骤5：当前智能体由策略模型根据状态和奖励值给出动作，也就是新的翼型设计变量。Step 5: The current agent is given an action by the policy model according to the state and reward value, which is the new airfoil design variable.

步骤6：对翼型实施动作得到新的翼型。Step 6: Perform actions on the airfoil to get a new airfoil.

步骤7：对二维新翼型建立C型结构化网格，图6给出了翼型的网格全貌与局部图。计算工况为Ma＝0.734,Re＝6.5×10⁶,α＝2.79°，其中Ma为来流马赫数，Re为来流雷诺数，α为来流攻角。在不考虑彻体力和外加热的影响时，使用开源求解器CFL3D进行翼型绕流数值模拟计算，主控方程为雷诺平均N-S方程，湍流模型为k-ωSST湍流模型，计算得到新翼型的阻力系数c_d和力矩系数c_m，升力系数c_l和翼型厚度t。确认新翼型的气动参数为新状态。Step 7: Establish a C-shaped structured grid for the two-dimensional new airfoil. Figure 6 shows the overall and partial grid of the airfoil. The calculation condition is Ma=0.734, Re=6.5×10 ⁶ , α=2.79°, where Ma is the Mach number of the incoming flow, Re is the Reynolds number of the incoming flow, and α is the angle of attack of the incoming flow. When the influence of physical force and external heating is not considered, the open source solver CFL3D is used for numerical simulation calculation of the flow around the airfoil. The main control equation is the Reynolds average NS equation, and the turbulence model is the k-ωSST turbulence model. Drag coefficient c _d and moment coefficient c _m , lift coefficient c _l and airfoil thickness t. Confirm that the aerodynamic parameters of the new airfoil are in the new state.

步骤8：由计算得到的c_d、c_m、c_l和t根据步骤3中的奖励函数计算得到奖励值。Step 8: Calculate the reward value from the calculated c _d , _cm , c _l and t according to the reward function in step 3.

步骤10：基于当前策略模型重复步骤5-9共得到n条轨迹和奖励值。Step 10: Repeat steps 5-9 based on the current policy model to obtain a total of n trajectories and reward values.

步骤11：根据得到的轨迹参数，基于当前价值函数模型计算优势估计，优势估计为每一个动作a的期望奖励与该状态下所有的动作的期望奖励的均值的插值，如果优势估计大于0则说明该动作a优于平均，否则次于平均。Step 11: According to the obtained trajectory parameters, calculate the advantage estimate based on the current value function model. The advantage estimate is the interpolation between the expected reward of each action a and the average value of the expected reward of all actions in this state. If the advantage estimate is greater than 0, it means The action a is better than average, otherwise inferior to average.

近端策略优化(PPO)是一种深度强化学习算法。图7给出了由本文近端优化策略的优化阻力收敛曲线，从中可以看出翼型阻力有明显改善，相比初始翼型RAE2822，阻力减少50框左右，约有30％的提升。为校验PPO在单目标约束气动优化设计中的优化效果，此处采用和非支配排序遗传算法(NSGA-II)作为对比算法，其中NSGA-II的种群规模为100，进化代数100，优化收敛曲线如图8所示；PPO的优化效果与NSGA-II相比仍有差距，于是采用预训练的方法提升PPO的优化效果。预训练就是基于类似问题中已训练完成的模型，这样可以为优化提供一个更好的初始点，预训练可以避免陷入局部最优，得到更好的优化结果。Proximal Policy Optimization (PPO) is a deep reinforcement learning algorithm. Figure 7 shows the optimal resistance convergence curve of the proximal optimization strategy in this paper. It can be seen that the airfoil resistance has been significantly improved. Compared with the initial airfoil RAE2822, the resistance is reduced by about 50 frames, which is about 30% improved. In order to verify the optimization effect of PPO in the single-objective constrained aerodynamic optimization design, the non-dominated sorting genetic algorithm (NSGA-II) is used here as a comparison algorithm. The population size of NSGA-II is 100, the evolution algebra is 100, and the optimization convergence The curve is shown in Figure 8; there is still a gap between the optimization effect of PPO and NSGA-II, so the pre-training method is used to improve the optimization effect of PPO. Pre-training is based on models that have been trained in similar problems, which can provide a better initial point for optimization. Pre-training can avoid falling into local optimum and get better optimization results.

对网络进行预训练，经过预训练的PPO的优化收敛曲线如图9所示，两种方法优化得到的优化结果与所需的计算量对比如表1所示。可以观察到优化的收敛趋势对比遗传算法明显更加稳定，两者收敛趋势相同，在阻力优化效果相近的情况下，本文经过预训练PPO策略的计算量为遗传算法的0.05倍，计算量得到了明显的减小。如表2，当固定计算量为500时，本文的PPO策略和经预训练后的PPO策略的阻力优化效果也远远好于遗传算法。可以看出本发明基于深度强化学习的翼型优化设计方法，相较遗传算法可以显著减小计算量，同时经过预训练后也不容易陷入局部最优。The network is pre-trained, and the optimization convergence curve of the pre-trained PPO is shown in Figure 9. The optimization results obtained by the two optimization methods and the required calculation amount are compared in Table 1. It can be observed that the optimized convergence trend is significantly more stable than the genetic algorithm, and the convergence trend of the two is the same. In the case of similar resistance optimization effects, the calculation amount of the pre-trained PPO strategy in this paper is 0.05 times that of the genetic algorithm, and the calculation amount has been significantly reduced. decrease. As shown in Table 2, when the fixed calculation amount is 500, the resistance optimization effect of the PPO strategy and the pre-trained PPO strategy in this paper is also far better than the genetic algorithm. It can be seen that the airfoil optimization design method based on deep reinforcement learning of the present invention can significantly reduce the amount of calculation compared with the genetic algorithm, and at the same time, it is not easy to fall into local optimum after pre-training.

表1Table 1

表2Table 2

接下来改变设计条件以说明强化学习得到的策略具有一定的普适性：保持基础翼型不变，设计状态由Ma＝0.734,Re＝6.5×10⁶,α＝2.79°分别变为1#：Ma＝0.72,Re＝1×10⁷,α＝2.79°，2#：Ma＝0.76,Re＝8×10⁶,α＝2.79°和3#：Ma＝0.6,Re＝6.5×10⁶,α＝2.79°。为了验证策略的有效性，直接使用已有策略对三个状态优化5步、10步、15步和20步，观察阻力的变化。结果如表3所示，可以看出在三个设计状态中，1#经过20步优化，阻力下降7.4％；2#阻力下降3.7％；3#阻力下降6.8％，且三个设计状态的优化结果约束均符合要求。总而言之，强化学习学习到了具有一定普适性的策略，可以为优化提供方向和初步优化结果。Next, change the design conditions to illustrate that the strategy obtained by reinforcement learning has certain universality: keep the basic airfoil unchanged, and the design state changes from Ma=0.734, Re=6.5×10 ⁶ , α=2.79° to 1# respectively: Ma=0.72, Re=1×10 ⁷ , α=2.79°, 2#: Ma=0.76, Re=8×10 ⁶ , α=2.79° and 3#: Ma=0.6, Re=6.5×10 ⁶ , α = 2.79°. In order to verify the effectiveness of the strategy, the existing strategy is directly used to optimize the three states for 5 steps, 10 steps, 15 steps and 20 steps, and observe the change of resistance. The results are shown in Table 3. It can be seen that in the three design states, after 20 steps of optimization, the resistance of 1# decreased by 7.4%; the resistance of 2# decreased by 3.7%; the resistance of 3# decreased by 6.8%, and the optimization of the three design states The resulting constraints all meet the requirements. All in all, reinforcement learning has learned a certain universal strategy, which can provide direction and preliminary optimization results for optimization.

表3table 3

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在不脱离本发明的原理和宗旨的情况下在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and cannot be construed as limitations to the present invention. Variations, modifications, substitutions, and modifications to the above-described embodiments are possible within the scope of the present invention.

Claims

1. A wing section optimization design method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

step 1: carrying out geometrical parameterization on the airfoil profile by using a free-form surface deformation method: establishing a free-form surface deformation control frame around the reference wing profile, establishing a mapping relation between the control frame and the wing profile, and obtaining a new wing profile by changing the position of a control frame point;

step 2: establishing an optimized design model, and confirming a single design target and a constraint condition according to flight requirements;

and step 3: establishing a reward function according to the optimization target and the constraint condition;

and 4, step 4: establishing an agent comprising a strategy model pi and a value function model, wherein the strategy function model outputs an action strategy, and the value function model can output advantage estimation and a value function; initializing a policy model parameter θ ₀ And cost function model parameters

Taking aerodynamic parameters of airfoil design target and constraint condition as states, wherein the aerodynamic parameters corresponding to the reference airfoil are taken as initial states s ₀ ；

And 5: the current intelligent agent gives actions according to the state and the reward value by a strategy model to obtain new wing section design variables;

and 6: performing action on the wing profile to obtain a new wing profile;

and 7: establishing a structured grid for the obtained new airfoil profile, carrying out numerical simulation on the flow around the new airfoil profile, and calculating to obtain the pneumatic parameters of the new airfoil profile as a new state;

and 8: calculating to obtain a reward value according to the reward function in the step 3 by using the pneumatic parameters obtained by calculation in the step 7;

and step 9: repeating the steps 5-8 for e-1 times by the current strategy model loop to obtain the track containing the state and the action obtained by each loop and the reward value { r } _e }, trace τ = { s = ₀ ，a ₀ ，…，s _e-1 ，a _e-1 ，s _e In which s is ₀ And a ₀ For initial states and actions, s _e Is in the state of e step, s _e-1 Is in the e-1 step state, a _e-1 Step e-1;

step 10: repeating the step 5-9 for n-1 times based on the current strategy model to obtain n tracks and reward values;

step 11: calculating advantage estimation based on a value function model of the current agent according to the obtained n track parameters and the reward values;

step 12: constructing a loss function based on the dominance estimates, the trajectories and the reward values,optimizing a policy model parameter θ and a cost function model parameter

The optimization target is that the loss function is minimum, and the strategy model and the value function model are updated by the optimized parameters to obtain a new strategy model and a new value function model;

step 13: and (5) circularly repeating the steps 5-12 until the loss function is not reduced any more, and finishing the training.

2. The airfoil profile optimization design method based on the deep reinforcement learning as claimed in claim 1, wherein: in step 2, the design objective and the constraint condition are aerodynamic parameters of the airfoil profile.

3. The airfoil optimization design method based on deep reinforcement learning according to claim 2, characterized in that: in step 2, the design goal is to minimize the resistance of the airfoil profile in the design state and simultaneously satisfy three constraint conditions: the lift coefficient is not reduced, (2) the absolute value of the moment coefficient of the airfoil is not increased, and (3) the maximum thickness of the airfoil is not reduced;

the mathematical expression for the optimization design problem is as follows:

Minimize：C _d

Subject to：

|C _m |≤|C _m0 |

t≥t ₀

wherein C _d Is the airfoil drag coefficient, C _l Is the wing profile lift coefficient,

Lift coefficient, C, of the reference airfoil _m Moment coefficient of airfoil profile, C _m0 As the moment coefficient of the reference airfoil profile, t as the thickness of the airfoil profile, t ₀ Is the thickness of the reference airfoil.

4. The airfoil optimization design method based on deep reinforcement learning according to claim 3, characterized in that: in step 3, the total reward value is obtained by linear summation of the reward values of the pneumatic parameters, wherein the reward value is increased when the target is reached, the reward value is not increased when the constraint is met, the reward value is reduced when the constraint is not met, and different coefficients are multiplied between the reward value of the target and the reward value of the constraint to balance the magnitude difference between the target and the constraint.

5. The airfoil profile optimization design method based on the deep reinforcement learning as claimed in claim 4, wherein: constructing the reward function according to the design goals and constraint conditions in step 2 is as follows:

if C is present _d -C _d0 If < 0, reward _cd ＝100×(C _d -C _d0 )；

If it is not

Then

Otherwise, the instruction reward _cl ＝0；

If C is _m -C _m0 If < 0, command reward _cm ＝100×(C _m -C _m0 ) Otherwise, command reward _cm ＝0；

If t-t ₀ If < 0, command reward _t ＝50×(t-t ₀ ) Otherwise, command reward _t ＝0；

Final prize value reward = reward _cd +reward _cl +reward _cm +100reward _t ；

Wherein reward _cd Reward value of drag coefficient _cl Reward value for coefficient of lift _cm Rewarded for the value of the moment coefficient reward _t The value is awarded for the airfoil thickness.

6. The airfoil profile optimization design method based on the deep reinforcement learning as claimed in claim 1, wherein: in step 4, the strategy model and the value function model both use an artificial neural network model with two hidden layers, and the number of nodes of the hidden layers is 64.

7. The airfoil optimization design method based on deep reinforcement learning according to claim 1, characterized in that: and 7, performing numerical simulation on the new airfoil profile by using an open source solver CFL3D, wherein the main control equation is a Reynolds average N-S equation, and the turbulence model is a k-omega SST model, and calculating to obtain the pneumatic parameters of the new airfoil profile.

8. The airfoil optimization design method based on deep reinforcement learning according to claim 1, characterized in that: the advantage estimate is the difference between the expected reward for each action a and the mean of the expected rewards for all possible actions in that state.