CN110096335B

CN110096335B - A prediction method of business concurrency for different types of virtual machines

Info

Publication number: CN110096335B
Application number: CN201910355147.5A
Authority: CN
Inventors: 郭军; 王馨悦; 张斌; 刘晨; 侯帅; 侯凯; 李薇; 柳波; 王嘉怡; 刘文凤; 张瀚铎; 张娅杰
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2022-06-21
Anticipated expiration: 2039-04-29
Also published as: WO2020220438A1; CN110096335A

Abstract

The invention provides a prediction method for different types of business concurrency of virtual machines, and relates to the technical field of cloud computing. A method for predicting business concurrency for different types of virtual machines. First, the historical business concurrency of virtual machines is collected and preprocessed, and then the virtual machine business concurrency is judged based on the improved 1-nearest neighbor-dynamic time adjustment method 1NN-DTW. Finally, a classification and regression tree is used to fit the business concurrency without periodic changes; Fourier series FS and a classification and regression tree CART are used to fit the business concurrency with periodic changes; It can provide a basis for the increase or decrease of virtual machines in the next step, and at the same time help to accurately estimate the software aging status of virtual machines, so as to improve the working virtual machine. performance and reliability purposes.

Description

A prediction method of business concurrency for different types of virtual machines

技术领域technical field

本发明涉及云计算技术领域，尤其涉及一种针对虚拟机不同类型的业务并发量预测方法。The invention relates to the technical field of cloud computing, and in particular, to a method for predicting business concurrency for different types of virtual machines.

背景技术Background technique

软件老化普遍存在于云服务系统中，在虚拟机处理业务并发请求的过程中，操作系统、应用软件等不断地积累错误，导致工作虚拟机的性能逐渐下降，进而影响到云服务系统的服务质量。云平台的高可伸缩、动态重构特性为确保不同并发条件下的云服务质量提供了技术基础，然而现有的虚拟资源动态调整方法仍存在很多缺陷。Software aging is common in cloud service systems. During the process of virtual machines processing concurrent business requests, operating systems and application software continue to accumulate errors, resulting in the gradual degradation of the performance of working virtual machines, which in turn affects the service quality of cloud service systems. . The highly scalable and dynamic reconfiguration characteristics of the cloud platform provide a technical basis for ensuring the quality of cloud services under different concurrency conditions. However, the existing dynamic adjustment methods for virtual resources still have many defects.

一般来说，虚拟机上部署着各种各样的业务，而且不同时间各业务并发量的变化趋势不同，例如，有的业务并发量在白天某段时间持续增加，在晚上某段时间持续减少，有的业务并发量持续循环波动，而有的业务并发量一直保持平稳。通过对云平台各业务的并发量进行预测，可以为下一步虚拟机的增加或者减少提供依据，同时有助于准确估计虚拟机的软件老化状况，以达到提高工作虚拟机性能和可靠性的目的。Generally speaking, a variety of services are deployed on virtual machines, and the change trend of the concurrency of each service at different times is different. For example, some services continue to increase during a certain period of time during the day, and continue to decrease during a certain period of time at night. , some business concurrency fluctuates continuously, while some business concurrency has remained stable. By predicting the concurrency of each business of the cloud platform, it can provide a basis for the increase or decrease of virtual machines in the next step, and at the same time help to accurately estimate the software aging status of virtual machines, so as to improve the performance and reliability of working virtual machines. .

由于用户操作、虚拟机业务以及其他不确定性因素在时刻改变，所以业务的并发访问量不仅会随着时间平稳变化，往往还具有上升、下降以及循环波动等趋势，传统的负载模型比如指数平滑模型只能大致刻画出业务并发量的变化趋势，不能很好地捕获其中的非线性变化特征。Because user operations, virtual machine services, and other uncertain factors change all the time, the concurrent access volume of services not only changes steadily with time, but also tends to rise, fall, and cyclic fluctuations. Traditional load models such as exponential smoothing The model can only roughly describe the change trend of business concurrency, and cannot capture the nonlinear change characteristics well.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是针对上述现有技术的不足，提供一种针对虚拟机不同类型的业务并发量预测方法，实现对虚拟机中不同类型的业务并发量进行预测。The technical problem to be solved by the present invention is to provide a method for predicting different types of business concurrency of virtual machines, so as to realize the prediction of different types of business concurrency in virtual machines.

一种针对虚拟机不同类型的业务并发量预测方法，包括以下步骤：A method for predicting business concurrency for different types of virtual machines, comprising the following steps:

步骤1：采集虚拟机的历史业务并发量，并进行预处理，具体方法为：Step 1: Collect historical business concurrency of virtual machines and preprocess them. The specific methods are as follows:

步骤1.1：扫描一段时间内虚拟机的业务并发量，发现业务并发量的缺失点；Step 1.1: Scan the business concurrency of virtual machines for a period of time to find the missing points of business concurrency;

步骤1.2：对扫描到的业务并发量缺失点进行处理；Step 1.2: Process the scanned missing points of business concurrency;

步骤1.2.1：对于个别采样点缺失的情况，采用前一周期和后一周期业务并发量的平均值进行填补，虚拟机第t个时间段的业务并发量con(t)缺失的计算如下公式所示：Step 1.2.1: For the case of missing individual sampling points, use the average value of the business concurrency in the previous cycle and the next cycle to fill in. The calculation formula for the missing business concurrency con(t) in the t-th time period of the virtual machine is as follows: shown:

步骤1.2.2：对于样本缺失达到百分九十以上的情况，舍弃全部样本并且将该段时间内业务并发量的值置为零；Step 1.2.2: For the case where the sample is missing more than 90%, discard all samples and set the value of business concurrency within this period of time to zero;

步骤1.3：对于采集到的业务并发量中存在异常波动的极大极小样本进行异常值调整；Step 1.3: Perform outlier adjustment for the extremely small samples with abnormal fluctuations in the collected business concurrency;

步骤1.3.1：结合四分位数计算t时间内虚拟机业务并发量正常取值的上限H和下限L，如下公式所示：Step 1.3.1: Combine the quartiles to calculate the upper limit H and lower limit L of the normal value of the virtual machine business concurrency in time t, as shown in the following formula:

H＝Q3+k*(Q3-Q1) (2)H=Q3+k*(Q3-Q1) (2)

L＝Q1-k*(Q3-Q1) (3)L=Q1-k*(Q3-Q1) (3)

其中，Q1表示下四分位数，即t时间内业务并发量升序数列的百分之二十五位点，Q3表示上四分位数，即t时间内业务并发量升序数列的百分之七十五位点，k用于描述不合理采样点的异常程度，一般取1.5和3，分别代表中度和极度；Among them, Q1 represents the lower quartile, that is, the 25th percentile of the ascending sequence of business concurrency in time t, and Q3 represents the upper quartile, that is, the percent of the ascending sequence of business concurrency in time t Seventy-five points, k is used to describe the abnormal degree of unreasonable sampling points, generally 1.5 and 3, representing moderate and extreme respectively;

步骤1.3.2：通过图基检验方法判定各采样点数据是否正常，并对异常值进行调整；Step 1.3.2: Determine whether the data of each sampling point is normal through the Tukey test method, and adjust the abnormal values;

如果采样点数据值被判定为错误业务并发量样本，则先将错误值丢弃，再用均值填补法补充；If the data value of the sampling point is determined to be a sample of wrong business concurrency, the wrong value will be discarded first, and then supplemented by the mean value filling method;

如果采样点数据值被判定为正常业务并发量样本，则不做任何调整；If the data value of the sampling point is determined to be a sample of normal business concurrency, no adjustment will be made;

步骤1.4：对从日志数据库或者打点日志中采集到的业务并发量和CPU利用率数据进行数据间隔调整，对采集的数据以秒、分钟或小时为单位进行合并；Step 1.4: Adjust the data interval for the business concurrency and CPU utilization data collected from the log database or management log, and merge the collected data in seconds, minutes or hours;

步骤1.5：采用最大最小值归一法将步骤1.4处理后的数据进行归一化；Step 1.5: Use the maximum and minimum normalization method to normalize the data processed in step 1.4;

步骤2：基于改进的1最近邻-动态时间调整(1-NearestNeighbor-Dynamic TimeWarping，即1NN-DTW)方法判断虚拟机业务并发量的类型，具体方法为：Step 2: Determine the type of virtual machine business concurrency based on the improved 1-Nearest Neighbor-Dynamic TimeWarping (1-NearestNeighbor-Dynamic TimeWarping, ie 1NN-DTW) method. The specific method is as follows:

步骤2.1：对虚拟机的各业务并发量进行分类，分为上升型、下降型、二次型、随机型、周期波动型、周期上升型和周期下降型；Step 2.1: Classify the concurrency of each service of the virtual machine into rising, falling, quadratic, random, cyclically fluctuating, cyclically rising and cyclically falling;

步骤2.2：针对各种类型的业务并发量，提前选取带标签的业务并发量数列作为已知样本；Step 2.2: For various types of business concurrency, select the labeled business concurrency sequence as a known sample in advance;

步骤2.3：对每一个待分类的业务并发量数列，依次扫描所有已知样本并通过临近算法计算出最相近的一条已知样本，则该已知样本的类型即为待分类业务并发量的类型；Step 2.3: For each sequence of business concurrency to be classified, scan all known samples in turn and calculate the closest known sample through the proximity algorithm, then the type of the known sample is the type of business concurrency to be classified ;

步骤2.4：将所有业务并发量归为两大类以简化1最近邻模型；Step 2.4: Classify all business concurrency into two categories to simplify the 1-nearest neighbor model;

将随机型、上升型、下降型和二次型业务并发量归为不具有周期变化类；Classify random, rising, falling, and quadratic business concurrency into categories that do not have periodic changes;

将周期波动型、周期上升型和周期下降型业务并发量归为具有周期变化类；Classify the cyclically fluctuating, cyclically rising, and cyclically falling business concurrency as categories with cyclical changes;

步骤2.5：构造n×m矩阵，使待分类的业务并发量数列{x₁，x₂，...，x_n}和一条已知的业务并发量数列{y₁，y₂，...，y_m}对齐，其中，n为待分类的业务并发量总数量，m为已知的业务并发量总数量；Step 2.5: Construct an n×m matrix so that the sequence of business concurrency to be classified {x ₁ , x ₂ , ..., x _n } and a known sequence of business concurrency {y ₁ , y ₂ , ... , y _m } are aligned, where n is the total number of business concurrency to be classified, and m is the known total number of business concurrency;

步骤2.6：将待分类的第i个业务并发量x_i和已知的第j个业务并发量y_j两点偏差作为矩阵中(i，j)位置的值d_i，j，同时使用欧式距离和两点导数差的平方的方法，计算待分类的业务并发量数列{x₁，x₂，...，x_n}和已知的业务并发量数列{y₁，y₂，...，y_m}对齐后各点的偏差d_i，j，如下公式所示：Step 2.6: Take the two-point deviation of the i-th business concurrency x _i to be classified and the known j-th business concurrency amount y _j as the value d _{i, j} of the (i, j) position in the matrix, and use the Euclidean distance at the same time And the method of the square of the difference of the two-point derivative to calculate the sequence of business concurrency to be classified {x ₁ , x ₂ , ..., x _n } and the known business concurrency sequence {y ₁ , y ₂ , ... , y _m } The deviation d _{i, j} of each point after alignment is shown in the following formula:

d_i，j＝(x_i-y_j)²+(x′_i-y′_j)² (4)d _i,j =(x _i -y _j ) ² +(x' _i -y' _j ) ² (4)

其中，x′_i、y′_j分别为x_i、y_j的导数，业务并发量x_i的导数x′_i的估计如下公式所示：Among them, x′ _i and y′ _j are the derivatives of x _i and y _j respectively, and the estimation of the derivative x′ _i of the service concurrency x _i is shown in the following formula:

步骤2.7：在矩阵中从位置(1，1)开始，根据除边界值外规定每个位置只能到达其上方、右方或者右上方的位置的约束条件迭代寻找出一条累积偏差最小的路径，直到位置(n，m)结束；Step 2.7: Starting from position (1, 1) in the matrix, iteratively find a path with the smallest cumulative deviation according to the constraints that each position can only reach the position above, right or upper right except for the boundary value, until the end of position (n, m);

步骤3：预测虚拟机不同变化类型的业务并发量，具体方法为：Step 3: Predict the business concurrency of different types of changes in the virtual machine. The specific methods are:

步骤3.1：采用分类回归树(Classification And Regression Tree，即CART)拟合不具有周期变化的业务并发量；Step 3.1: Use a Classification and Regression Tree (CART) to fit the business concurrency without periodic changes;

步骤3.1.1：遍历样本业务并发量数列的每个特征F的任意取值f，以(F，f)作为条件分割样本数据，确定平方误差最小的分割位置，从业务并发量数列中选择最好的切割点；Step 3.1.1: Traverse the arbitrary value f of each feature F of the sample business concurrency sequence, divide the sample data with (F, f) as the condition, determine the division position with the smallest square error, and select the highest value from the business concurrency sequence. good cutting point;

所述平方误差error的计算公式如下：The calculation formula of the squared error error is as follows:

其中，

代表样本x中第i’个业务并发量的特征，y_i，代表分割前的第i’个序列样本，

代表分割后的第i’个子序列样本的拟合结果；in,

Represents the feature of the i'th business concurrency in the sample x, y _i, represents the i'th sequence sample before splitting,

represents the fitting result of the i'th subsequence sample after segmentation;

步骤3.1.2：保存作为切割点的业务并发量值，并对业务并发量数列执行切分；Step 3.1.2: Save the business concurrency value as the cutting point, and perform segmentation on the business concurrency sequence;

步骤3.1.3：依次构建特征F大于f的子树和小于f的子树，进一步迭代对当前分割点左边和右边的业务并发量数列分割拟合，直到无法再分记为叶子节点；Step 3.1.3: Construct subtrees with feature F greater than f and subtrees smaller than f in turn, and further iteratively segment and fit the sequence of business concurrency on the left and right of the current split point until it can no longer be divided into leaf nodes;

步骤3.1.4：从下而上重新遍历样本数据，对所有业务并发量数列检查每个分割点，判断分割之前与分割之后并发量数列的拟合误差，Step 3.1.4: Re-traverse the sample data from bottom to top, check each split point for all business concurrency series, and judge the fitting error of the concurrency series before and after split.

若分割之后并发量数列的拟合误差降低，则保留该分割点；If the fitting error of the concurrency sequence decreases after the split, the split point is retained;

若分割之后并发量数列的拟合误差升高，则取消该分割点并合并左右数列；If the fitting error of the concurrent series increases after the split, cancel the split point and merge the left and right series;

步骤3.2：采用傅里叶级数FS和分类回归树CART拟合具有周期变化的业务并发量；Step 3.2: Use Fourier series FS and classification regression tree CART to fit business concurrency with periodic changes;

步骤3.2.1：利用分类回归树CART拟合{t₁，t₂，...，t_n’}时刻的业务并发量得到拟合值{y(0)，...y(n’-1)，y(n’)}，刻画出业务并发量的上升或者下降趋势；Step 3.2.1: Use the classification and regression tree CART to fit the business concurrency at {t ₁ , t ₂ , ..., t _n' } to obtain the fitted values {y(0), ... y(n'- 1), y(n')}, depicts the rising or falling trend of business concurrency;

步骤3.2.2：把步骤3.2.1中所得到的业务并发量与真实业务并发量比较得到残差序列{e(0)，e(1)，...，e(n)}；Step 3.2.2: Compare the business concurrency obtained in step 3.2.1 with the real business concurrency to obtain the residual sequence {e(0), e(1),...,e(n)};

步骤3.2.3：利用分类回归树CART预测{t_n+1，t_n+2...，t_m’}时刻的业务并发量为{y(n+1)，y(n+2)，...，y(m’)；Step 3.2.3: Use the classification and regression tree CART to predict the business concurrency at the moment {t _n+1 , t _n+2 ..., t _m' } is {y(n+1), y(n+2), ..., y(m');

步骤3.2.4：利用傅里叶级数FS拟合残差序列{e(0)，e(1)，...，e(n)}，刻画出业务并发量的周期趋势，求得{t_n’+1，t_n’+2，...，t_m’}时刻业务并发量的残差值{e(n’+1)，e(n’+2)，...，e(m’)}；Step 3.2.4: Use the Fourier series FS to fit the residual sequence {e(0), e(1), ..., e(n)}, characterize the cyclical trend of business concurrency, and obtain { Residual value of business concurrency at time t _n'+1 , t _n'+2 ,..., t _m' } {e(n'+1), e(n'+2),...,e (m')};

步骤3.2.4.1：使用函数w(t)拟合残差序列e(0)，e(1)，...，e(n’)，函数w(t)如下公式所示：Step 3.2.4.1: Use the function w(t) to fit the residual sequence e(0), e(1), ..., e(n'), the function w(t) is as follows:

其中，a₀、a_j’和b_j’均为变量，P＝n’，

表示向下取整，t＝1，2，...n’；Among them, a ₀ , a _j' and b _j' are all variables, P=n',

Indicates rounding down, t=1, 2,...n';

步骤3.2.4.2：通过最小二乘法计算变量a_j’和b_j’的值，如下公式所示：Step 3.2.4.2: Calculate the values of variables a _j' and b _j' by the least squares method, as shown in the following formulas:

其中，w_j’为第j’个用于拟合残差的函数；Among them, w _j' is the j'th function used to fit the residual;

步骤3.2.5：将{t_n’+1，t_n’+2，...，t_m’}时刻的业务并发量与其对应的残差值相加，得到{t_n’+1，t_n’+2，...，t_m’}时刻业务并发量的预测值，即{y(n’+1)+e(n’+1)，y(n’+2)+e(n’+2)，...，y(m’)+e(m’)}。Step 3.2.5: Add the business concurrency at {t _n'+1 , t _n'+2 , ..., t _m' } to its corresponding residual value to obtain {t _n'+1 , t _n'+2 , ..., t _m' } Predicted value of business concurrency, namely {y(n'+1)+e(n'+1), y(n'+2)+e(n '+2), ..., y(m')+e(m')}.

采用上述技术方案所产生的有益效果在于：本发明提供的一种针对虚拟机不同类型的业务并发量预测方法，将业务并发访问量分为周期型、上升型、下降型、二次型和随机型，不同类型的业务并发量所适用的预测方法不同，在预测之前对各业务并发量进行分类，不仅可以有针对性地训练业务并发量模型，而且在对相同类型的业务并发量建模时还可以实现参数的共享。通过本发明方法对虚拟机各业务的并发量进行预测，可以为下一步虚拟机的增加或者减少提供依据，同时有助于准确估计虚拟机的软件老化状况，以达到提高工作虚拟机性能和可靠性的目的。The beneficial effects of adopting the above technical solutions are: the present invention provides a method for predicting the concurrent traffic volume of different types of virtual machines, which divides the traffic concurrent traffic volume into periodic, ascending, descending, quadratic and random Different types of business concurrency have different prediction methods. Classifying each business concurrency before forecasting can not only train the business concurrency model in a targeted manner, but also model the same type of business concurrency. Parameters can also be shared. The method of the present invention predicts the concurrency of each business of the virtual machine, which can provide a basis for the increase or decrease of the virtual machine in the next step, and at the same time help to accurately estimate the software aging status of the virtual machine, so as to improve the performance and reliability of the working virtual machine. sexual purpose.

附图说明Description of drawings

图1为本发明实施例提供的飞机票在线订购系统的实例拓扑图；Fig. 1 is the example topology diagram of the air ticket online ordering system provided by the embodiment of the present invention;

图2为本发明实施例提供的一种针对虚拟机不同类型的业务并发量预测方法的流程图；2 is a flowchart of a method for predicting different types of service concurrency for virtual machines according to an embodiment of the present invention;

图3为本发明实施例提供的二次型业务并发量预测结果的示意图；3 is a schematic diagram of a prediction result of a secondary service concurrency provided by an embodiment of the present invention;

图4为本发明实施例提供的周期上升型业务并发量预测结果的示意图。FIG. 4 is a schematic diagram of a prediction result of the concurrency of a periodically rising type of service provided by an embodiment of the present invention.

图中，1、客户端；2、负载均衡Nginx；3、交换机；4、服务端；5、业务数据库MySQL。In the figure, 1, client; 2, load balancing Nginx; 3, switch; 4, server; 5, business database MySQL.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The following examples are intended to illustrate the present invention, but not to limit the scope of the present invention.

本实施例以飞机票在线订购系统模拟PC端用户应用，在曙光服务器上搭建该服务系统，通过对飞机票在线订购系统加压模拟真实的业务并发场景，并采集不同的业务并发量数据为例，使用本发明的一种针对虚拟机不同类型的业务并发量预测方法进行业务并发量的预测。实例拓扑图如图1所示，客户端1使用LoadRunner软件产生业务并发访问，它可以模拟大量的用户同时点击飞机票订购系统页面，LoadRunner发送页面请求后，由负载均衡Nginx2实现业务请求的接收和分配，最后服务端4安装Tomcat并部署飞机票在线预订系统，负责读写业务数据库MySQL5，处理LoadRunner发送的请求。In this embodiment, the online air ticket ordering system simulates a PC-side user application, builds the service system on the Dawning server, simulates real business concurrency scenarios by pressurizing the air ticket online ordering system, and collects different business concurrency data as an example , using a method for predicting different types of business concurrency of virtual machines of the present invention to predict the business concurrency. The example topology is shown in Figure 1. Client 1 uses LoadRunner software to generate concurrent business access. It can simulate a large number of users clicking on the air ticket ordering system page at the same time. After LoadRunner sends page requests, load balancing Nginx2 realizes the reception and processing of business requests. Assignment, and finally the server 4 installs Tomcat and deploys the air ticket online booking system, which is responsible for reading and writing the business database MySQL5, and processing the requests sent by LoadRunner.

一种针对虚拟机不同类型的业务并发量预测方法，如图2所示，包括以下步骤：A method for predicting business concurrency for different types of virtual machines, as shown in Figure 2, includes the following steps:

步骤1.2.2：对于样本缺失达到百分九十以上舍弃全部样本并且将该段时间内业务并发量的值置为零；例如在20个连续采样周期中，只有2个周期采集到业务并发量值，甚至全部数据为空，那么可以认为这段时间采集到的业务并发量都是不可信的，不能纳入历史数列进行预测；Step 1.2.2: For samples missing more than 90%, discard all samples and set the value of business concurrency during this period to zero; for example, in 20 consecutive sampling cycles, only 2 cycles collect business concurrency value, or even if all the data is empty, it can be considered that the business concurrency collected during this period is unreliable and cannot be included in the historical sequence for prediction;

H＝Q3+k*(Q3-Q1) (2)H=Q3+k*(Q3-Q1) (2)

L＝Q1-k*(Q3-Q1) (3)L=Q1-k*(Q3-Q1) (3)

在业务并发访问量建模时，以1秒为时间间隔采样的业务并发量波动频繁，趋势变化不明显，无法挖掘变化的特征，而且过密采样使得模型计算量加大，训练更加迟缓；因此，在本实施例中以15秒为间隔取平均值进行整理数据都是，虚拟机的其他数据也是以15秒为间隔；When modeling business concurrent access, the business concurrency sampled at 1-second intervals fluctuates frequently, and the trend does not change significantly, so it is impossible to mine the changing features, and over-sampling makes the model more computationally expensive and slower to train; therefore , in this embodiment, the average value is taken as an interval of 15 seconds to sort out the data, and other data of the virtual machine is also at an interval of 15 seconds;

将周期波动型、周期上升型和周期下降型业务并发量归为具有周期变化类；Classify the cyclically fluctuating, cyclically rising, and cyclically declining business concurrency as categories with cyclical changes;

其中，

代表样本x中第i’个业务并发量的特征，y_i’代表分割前的第i’个序列样本，

代表分割后的第i’个子序列样本的拟合结果；in,

represents the feature of the i'th business concurrency in the sample x, y _i' represents the i'th sequence sample before splitting,

步骤3.1.4：从下而上重新遍历样本数据，对所有业务并发量数列检查每个分割点，判断分割之前与分割之后并发量数列的拟合误差，Step 3.1.4: Retraverse the sample data from bottom to top, check each split point for all business concurrency series, and judge the fitting error of the concurrency series before and after split,

步骤3.2：采用傅里叶级数FS和分类回归树CART拟合具有周期变化的业务并发量；Step 3.2: Use Fourier series FS and classification and regression tree CART to fit business concurrency with periodic changes;

步骤3.2.3：利用分类回归树CART预测{t_n+1，t_n+2，...，t_m’}时刻的业务并发量为{y(n+1)，y(n+2)，...，y(m’)；Step 3.2.3: Use the classification and regression tree CART to predict the business concurrency at {t _n+1 , t _n+2 , ..., t _m' } as {y(n+1), y(n+2) , ..., y(m');

步骤3.2.4：利用傅里叶级数FS拟合残差序列{e(0)，e(1)，...，e(n)}，刻画出业务并发量的周期趋势，求得{t_n’+1，t_n’+2，...，t_m’}时刻业务并发量的残差值{e(n’+1)，e(n′+2)，...，e(m’)}；Step 3.2.4: Use the Fourier series FS to fit the residual sequence {e(0), e(1), ..., e(n)}, characterize the cyclical trend of business concurrency, and obtain { Residual value of business concurrency at time t _n'+1 , t _n'+2 ,..., t _m' } {e(n'+1), e(n'+2),...,e (m')};

其中，a₀、a_j’和b_j’均为变量，P＝n’，

Indicates rounding down, t=1, 2,...n';

其中，w_i'为第j’个用于拟合残差的函数；Among them, wi _' is the j'th function used to fit the residual;

本实施例还提供了使用改进的1NN-DTW算法进行业务并发量的类型判断，并与改进前算法进行对比，验证改进后1NN-DTW的准确性，具体为：This embodiment also provides the use of the improved 1NN-DTW algorithm to determine the type of service concurrency, and compares it with the algorithm before the improvement to verify the accuracy of the improved 1NN-DTW, specifically:

首先使用LoadRunner对服务端应用的浏览、查询、退票等各类业务的访问行为进行记录。然后对服务端虚拟机持续加压一小时并采集业务并发量，按照预处理的方法处理缺失和异常的业务并发量值，并以15秒为间隔调整并发量数据。First, use LoadRunner to record the access behavior of various services such as browsing, querying, and refunding of the server application. Then pressurize the server virtual machine for one hour and collect business concurrency, process missing and abnormal business concurrency values according to the preprocessing method, and adjust the concurrency data at 15-second intervals.

利用改进的1NN-DTW算法判断业务并发访问量类型，并与1NN-DTW、1NN-DDTW对比，采用正确率Accuracy和F值F-measure来衡量各算法的好坏。将第一步得到的并发量分别按每80、120、160、200个采样点截取为一个子序列，并根据表1中列举的七种负载变化趋势打上类型标签作为一个样本序列，最后得到700个样本序列，选取其中420个作为类型判断的已知样本，剩下的280个作为测试样本。The improved 1NN-DTW algorithm is used to judge the type of concurrent traffic of the business, and compared with 1NN-DTW and 1NN-DDTW, the accuracy rate and the F value F-measure are used to measure the quality of each algorithm. The concurrency obtained in the first step is intercepted as a subsequence at every 80, 120, 160, and 200 sampling points, and the type labels are marked as a sample sequence according to the seven load variation trends listed in Table 1, and finally 700 420 sample sequences are selected as known samples for type judgment, and the remaining 280 samples are used as test samples.

表1不同类型的业务并发访问量Table 1 Concurrent access to different types of services

采用本发明的改进的1NN-DTW和现有的1NN-DTW、1NN-DDTW这三种方法对业务并发量类型判断的对比结果如表2所示。从表2可以看出，本发明方法的Accuracy、F-measure明显高于另外两种方法，说明在判断业务并发量类型时，从业务并发量的取值和变化趋势两方面考虑效果要优于只关注其中一个方面。另外，虽然本发明方法同时计算相似点的欧式距离和导数差，但是所用时间并未大幅度增加。Table 2 shows the comparison results of the service concurrency type judgment using the improved 1NN-DTW of the present invention and the existing 1NN-DTW and 1NN-DDTW methods. It can be seen from Table 2 that the Accuracy and F-measure of the method of the present invention are significantly higher than those of the other two methods, indicating that when judging the type of business concurrency, the effect is better than the value and changing trend of the business concurrency. Just focus on one of them. In addition, although the method of the present invention calculates the Euclidean distance and the derivative difference of the similar points at the same time, the time used is not greatly increased.

表2不同方法的业务并发量分类情况Table 2 Classification of business concurrency by different methods

方法method AccuracyAccuracy F-measureF-measure Time(ms)Time(ms) 改进的1NN-DTWImproved 1NN-DTW 0.9420.942 0.8670.867 11201120 1NN-DTW1NN-DTW 0.8730.873 0.7510.751 984984 1NN-DDTW1NN-DDTW 0.9160.916 0.8340.834 10971097

本实施例还提供了使用本发明方法预测业务并发量，并与传统的ARIMA等方法进行对比，具体为：This embodiment also provides using the method of the present invention to predict the service concurrency, and compares it with traditional methods such as ARIMA, specifically:

首先使用LoadRunner对服务端应用的浏览、查询、退票等各类业务的访问行为进行记录。然后对服务端虚拟机持续加压一小时并采集业务并发量，按照预处理叙述的方法处理缺失和异常的业务并发量值，并以15秒为间隔调整并发量数据。First, use LoadRunner to record the access behavior of various services such as browsing, querying, and refunding of the server application. Then pressurize the server virtual machine for one hour and collect the business concurrency, process the missing and abnormal business concurrency values according to the method described in the preprocessing, and adjust the concurrency data at 15-second intervals.

选取二次型和周期上升型两类相对复杂的并发量进行预测。通过分析过去25分钟的业务并发量值，估计未来5分钟的业务并发量，并选取均方误差MSE、绝对误差MAE、用时Time三项评价标准，借助Python工具包将本方法与ARIMA、指数平滑Holt-Winters对比，验证本发明方法的准确性。Two types of relatively complex concurrency, quadratic type and periodic rising type, are selected for prediction. By analyzing the business concurrency value in the past 25 minutes, the business concurrency amount in the next 5 minutes is estimated, and three evaluation criteria of mean square error (MSE), absolute error (MAE), and time duration are selected. With the help of Python toolkit, this method is combined with ARIMA and exponential smoothing. Holt-Winters comparison verifies the accuracy of the method of the present invention.

三种方法的业务并发量预测结果如表3所示，三种方法的业务并发量预测结果与真实并发量之间的对照结果如图3和图4所示。从图中来看，在设定的两种情况下本发明方法与ARIMA、Holt-Winters相比，对真实的业务并发量序列拟合更好，说明本发明方法在对各类并发量预测时较为有效。根据表3中结果进一步分析，本发明方法与ARIMA、Holt-Winters相比，在两种类型的并发量场景下MSE和MAE最低。在二次型并发量场景下三种方法的MSE和MAE较为接近，但是在周期上升型并发量场景下本方法明显更优，ARIMA、Holt-Winters对这类复杂的并发量学习能力较差，这些表明在各种场景下本发明方法都具有可观的准确度。The business concurrency prediction results of the three methods are shown in Table 3, and the comparison results between the business concurrency prediction results of the three methods and the actual concurrency amount are shown in Figures 3 and 4. It can be seen from the figure that, compared with ARIMA and Holt-Winters, the method of the present invention fits the real business concurrency sequence better under the two set conditions, which shows that the method of the present invention can predict the concurrency of various types. more effective. According to further analysis of the results in Table 3, compared with ARIMA and Holt-Winters, the method of the present invention has the lowest MSE and MAE under the two types of concurrency scenarios. In the quadratic concurrency scenario, the MSE and MAE of the three methods are relatively close, but in the cyclically rising concurrency scenario, this method is significantly better. ARIMA and Holt-Winters have poor learning ability for such complex concurrency. These show that the method of the present invention has considerable accuracy in various scenarios.

表3不同方法的业务并发量预测结果Table 3 Business concurrency prediction results of different methods

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope defined by the claims of the present invention.

Claims

1. a kind of business concurrency prediction method for different types of virtual machines, is characterized in that: comprise the following steps:

Step 1: Collect historical business concurrency of virtual machines and preprocess them. The specific methods are as follows:

Step 1.1: Scan the business concurrency of virtual machines for a period of time to find the missing points of business concurrency;

Step 1.2: Process the scanned missing points of business concurrency;

Step 1.3: Perform outlier adjustment for the extremely small samples with abnormal fluctuations in the collected business concurrency;

Step 1.4: Adjust the data interval for the business concurrency and CPU utilization data collected from the log database or management log, and merge the collected data in seconds, minutes or hours;

Step 1.5: Use the maximum and minimum normalization method to normalize the data processed in step 1.4;

Step 2: Determine the type of virtual machine business concurrency based on the improved 1-nearest neighbor-dynamic time adjustment method 1NN-DTW. The specific method is:

Step 2.1: Classify the concurrency of each service of the virtual machine into rising, falling, quadratic, random, cyclically fluctuating, cyclically rising and cyclically falling;

Step 2.2: For various types of business concurrency, select the labeled business concurrency sequence as a known sample in advance;

Step 2.3: For each sequence of business concurrency to be classified, scan all known samples in turn and calculate the closest known sample through the proximity algorithm, then the type of the known sample is the type of business concurrency to be classified ;

Step 2.4: Classify all business concurrency into two categories to simplify the 1-nearest neighbor model;

Classify random, rising, falling, and quadratic business concurrency into categories that do not have periodic changes;

Classify the cyclically fluctuating, cyclically rising, and cyclically declining business concurrency as categories with cyclical changes;

Step 2.5: Construct an n×m matrix so that the sequence of business concurrency to be classified {x ₁ ,x ₂ ,…,x _n } and a known sequence of business concurrency {y ₁ ,y ₂ ,…,y _m } Alignment, where n is the total number of business concurrency to be classified, and m is the known total number of business concurrency;

Step 2.6: Take the two-point deviation of the i-th business concurrency x _i to be classified and the known j-th business concurrency amount y _j as the value d _{i, j} of the (i, j) position in the matrix, and use the Euclidean distance at the same time And the method of the square of the difference between the two-point derivatives to calculate the sequence of business concurrency to be classified {x ₁ ,x ₂ ,…,x _n } and the known business concurrency sequence {y ₁ ,y ₂ ,…,y _m } The deviation d _i,j of each point after alignment is shown in the following formula:

d _i,j =(x _i -y _j ) ² +(x _i ′-y _j ′) ² (1)

Among them, x _i ′ and y _j ′ are the derivatives of x _i and y _j respectively, and the estimation of the derivative x _i ′ of the service concurrency x _i is shown in the following formula:

Step 2.7: Starting from position (1,1) in the matrix, iteratively find a path with the smallest cumulative deviation according to the constraint that each position can only reach the position above, right or upper right except for the boundary value, until the end of position (n,m);

Step 3: Predict the business concurrency of different types of changes in the virtual machine. The specific methods are:

Step 3.1: Use the classification and regression tree CART to fit the business concurrency without periodic changes;

Step 3.2: Use Fourier series FS and classification and regression tree CART to fit the business concurrency with periodic variation.

2. a kind of business concurrency prediction method for different types of virtual machines according to claim 1, is characterized in that: the concrete method of described step 1.2 is:

Step 1.2.1: For the case of missing individual sampling points, use the average value of the business concurrency of the previous cycle and the next cycle to fill in. The calculation formula for the missing business concurrency con(t) of the virtual machine in the t-th time period is as follows: shown:

Step 1.2.2: For the case where the sample missing reaches more than 90%, discard all samples and set the value of the business concurrency in the scanning period to zero.

3. a kind of business concurrency prediction method for different types of virtual machines according to claim 1, is characterized in that: the concrete method of described step 1.3 is:

Step 1.3.1: Combine the quartiles to calculate the upper limit H and lower limit L of the normal value of the virtual machine business concurrency in time t, as shown in the following formula:

H=Q3+k*(Q3-Q1) (4)

L=Q1-k*(Q3-Q1) (5)

Among them, Q1 represents the lower quartile, that is, the 25th percentile of the ascending sequence of business concurrency in time t, and Q3 represents the upper quartile, that is, the percent of the ascending sequence of business concurrency in time t Seventy-five points, k is used to describe the abnormal degree of unreasonable sampling points, generally 1.5 and 3, representing moderate and extreme respectively;

Step 1.3.2: Determine whether the data of each sampling point is normal through the Tukey test method, and adjust the abnormal values;

If the data value of the sampling point is determined to be a sample of wrong business concurrency, the wrong value will be discarded first, and then supplemented by the mean value filling method;

If the data value of the sampling point is determined to be a sample of normal business concurrency, no adjustment will be made.

4. a kind of business concurrency prediction method for different types of virtual machines according to claim 1, is characterized in that: the concrete method of described step 3.1 is:

Step 3.1.1: Traverse the arbitrary value f of each feature F of the sample business concurrency sequence, divide the sample data with (F, f) as the condition, determine the segmentation position with the smallest square error, and select the highest value from the business concurrency sequence. good cutting point;

The calculation formula of the squared error error is as follows:

in,

Step 3.1.2: Save the business concurrency value as the cutting point, and perform segmentation on the business concurrency sequence;

Step 3.1.3: Construct subtrees with feature F greater than f and subtrees smaller than f in turn, and further iteratively segment and fit the sequence of business concurrency on the left and right of the current split point until it can no longer be divided into leaf nodes;

Step 3.1.4: Retraverse the sample data from bottom to top, check each split point for all business concurrency series, and judge the fitting error of the concurrency series before and after split,

If the fitting error of the concurrency sequence decreases after the split, the split point is retained;

If the fitting error of the concurrent series increases after the split, cancel the split point and merge the left and right series.

5. a kind of business concurrency prediction method for different types of virtual machines according to claim 4, is characterized in that: the concrete method of described step 3.2 is:

Step 3.2.1: Use the classification and regression tree CART to fit the business concurrency at {t ₁ ,t ₂ ,...,t _n' } to obtain the fitted value {y(0),...y(n'-1),y (n')}, depicting the rising or falling trend of business concurrency;

Step 3.2.2: Compare the business concurrency obtained in step 3.2.1 with the real business concurrency to obtain the residual sequence {e(0),e(1),...,e(n)};

Step 3.2.3: Use the classification and regression tree CART to predict the business concurrency at the moment {t _n+1 ,t _n+2 ,...,t _m' } as {y(n+1), y(n+2),... ,y(m');

Step 3.2.4: Use Fourier series FS to fit the residual sequence {e(0),e(1),...,e(n)}, characterize the periodic trend of business concurrency, and obtain {t _{n '+1} ,t _n'+2 ,...,t _m' } Residual value of business concurrency at moment {e(n'+1),e(n'+2),...,e(m')};

Step 3.2.5: Add the business concurrency at {t _n'+1 ,t _n'+2 ,...,t _m' } to its corresponding residual value to obtain {t _n'+1 ,t _{n' +2} ,...,t _m' } Predicted value of business concurrency at the moment, namely {y(n'+1)+e(n'+1), y(n'+2)+e(n'+2) ,…,y(m')+e(m')}.

6. a kind of business concurrency prediction method for different types of virtual machines according to claim 5, is characterized in that: the concrete method of described step 3.2.4 is:

Step 3.2.4.1: Use the function w(t) to fit the residual sequence e(0), e(1),...,e(n'), the function w(t) is as follows:

Among them, a ₀ , a _j' and b _j' are all variables, P=n',

Indicates rounding down, t=1,2,...n';

Step 3.2.4.2: Calculate the values of variables a _j' and b _j' by the least squares method, as shown in the following formulas:

where w _j' is the j'th function used to fit the residuals.