CN112241354B

CN112241354B - Application-oriented transaction load generation system and transaction load generation method

Info

Publication number: CN112241354B
Application number: CN201910800259.7A
Authority: CN
Inventors: 张蓉; 李宇明; 张舒燕; 舒科
Original assignee: Pingkai Star Beijing Technology Co ltd; East China Normal University
Current assignee: Pingkai Star Beijing Technology Co ltd; East China Normal University
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2022-07-29
Anticipated expiration: 2039-08-28
Also published as: CN112241354A

Abstract

The invention proposes an application-oriented synthetic load generation system, which can capture the load characteristics of real online transaction processing applications, and then generate a synthetic load that is highly similar to the actual application load performance index, while ensuring information concealment, tool scalability and load. Scalability. The invention proposes a new method for describing the load characteristics of online transaction processing applications: depicting the OLTP load characteristics from two dimensions of transaction logic and data access distribution; a method for extracting transaction logic and data access distribution from real application load traces, while ensuring The concealment of application information; from the perspective of controlling transaction conflicts and the proportion of distributed transactions, it proposes a method to ensure implicit transaction logic by controlling the dependencies of operating parameters; design and implement three angles to describe the distribution of data access: inclination, Dynamic and continuous; the first application-oriented OLTP load generator is implemented to ensure the authenticity and scalability of the generated load in performance evaluation.

Description

An application-oriented transaction load generation system and transaction load generation method

技术领域technical field

本发明涉及事务负载生成技术领域，尤其涉及一种面向应用的事务负载生成系统及事务负载生成方法。The invention relates to the technical field of transaction load generation, in particular to an application-oriented transaction load generation system and a transaction load generation method.

背景技术Background technique

在不同的应用领域，有许多用于数据库性能评估的基准。在OLAP应用领域，常用的基准测试有TPC-H、TPC-DS和SSB[14]，它们具有预定义的标准数据库模式和测试查询。还有TPC-C[1]、TPC-E和SmallBank[10]基准，用于评估数据库系统的事务处理能力。CH-benCHmark[15]和HTAPBench[16]可以为事务。分析混合处理(HTAP)系统提供统一的评估。此外，YCSB[11]通常用于测量云服务系统的吞吐量，其负载简单，但需要很高的可扩展性。然而，这些标准测试基准的评估负载是对一类应用程序的抽象，因此它们过于笼统，无法评估针对特定应用程序的数据库性能。There are many benchmarks for database performance evaluation in different application areas. In the OLAP application domain, commonly used benchmarks are TPC-H, TPC-DS and SSB [14], which have predefined standard database schemas and test queries. There are also TPC-C [1], TPC-E and SmallBank [10] benchmarks for evaluating the transaction processing capabilities of database systems. CH-benCHmark [15] and HTAPBench [16] can be transactions. Analytical Hybrid Processing (HTAP) systems provide unified assessments. Furthermore, YCSB [11] is often used to measure the throughput of cloud service systems with simple loads but high scalability. However, the evaluation workloads of these standard benchmarks are abstractions of a class of applications, so they are too general to evaluate database performance for a specific application.

为了获得与目标应用相似的负载，负载轨迹重放是一种可选方法。Microsoft SQLServer配备了两种工具，即SQL Server分析器[17]和SQLServer分布式重放[2]，用于基于SQL轨迹再现生产负载。Oracle数据库重放[3]、[18]使用户能够以最小的性能影响在生产系统上记录负载轨迹，然后重放一个与实际负载具有相同并发性和负载特性的完整负载。由于数据隐私问题，负载回放在实际的数据库性能评估中很难应用，因为它需要一个真实的数据库状态和原始负载轨迹[3]。此外，负载拓展(例如拓展并发性)也是当前重放技术难以解决的问题。In order to obtain a similar load to the target application, load trajectory replay is an optional method. Microsoft SQL Server is equipped with two tools, SQL Server Analyzer [17] and SQL Server Distributed Replay [2], for reproducing production workloads based on SQL traces. Oracle Database Replay [3], [18] enables users to record workload traces on a production system with minimal performance impact, and then replay a full workload with the same concurrency and load characteristics as the actual workload. Due to data privacy concerns, load replay is difficult to apply in practical database performance evaluation, as it requires a realistic database state and original load trajectory [3]. In addition, load expansion (such as expanding concurrency) is also a problem that current replay technology is difficult to solve.

因此，负载模拟是非常必要和迫切的。有负载感知的数据和查询生成器[4]、[5]、[19]、[20]用于评估OLAP应用程序的数据库性能。这些工作的输入一般包括数据库模式、基本数据特征和查询树中间结果的大小规范。输出是一个合成数据库实例和实例化的测试查询，符合指定的数据和负载特性。对于OLAP应用程序，有数据库扩展工作[6]，[7]，它可以扩展/缩小给定的数据库实例，支持特定于应用程序的数据库基准测试。负载分析器[8]、[9]旨在研究并更好地了解应用程序负载，但两者都无法生成合成负载。还有一些为数据库性能基准测试准备的负载生成器[12]、[13]。Jeong等人[12]提出了一种可以模拟真实硬件资源消耗状态的负载生成器。NoWog[13]介绍了一种用于生成测试NoSQL数据库合成负载的负载描述语言。这些工作都不能用于模拟真实OLTP应用程序的各种负载，以实现面向应用程序的数据库性能评估。Therefore, load simulation is very necessary and urgent. Load-aware data and query generators [4], [5], [19], [20] are used to evaluate database performance for OLAP applications. Inputs to these jobs typically include database schemas, basic data characteristics, and size specifications for intermediate results in query trees. The output is a synthetic database instance and instantiated test queries that match the specified data and load characteristics. For OLAP applications, there is database scaling work [6], [7], which scales up/down a given database instance, supporting application-specific database benchmarks. Load profilers [8], [9] are designed to study and better understand application load, but neither can generate synthetic load. There are also load generators [12], [13] prepared for database performance benchmarking. [12] proposed a load generator that can simulate real hardware resource consumption states. NoWog [13] introduced a workload description language for generating synthetic workloads for testing NoSQL databases. None of these works can be used to simulate various loads of real OLTP applications for application-oriented database performance evaluation.

当评测数据库管理系统(DBMS)的查询性能时，通常在DBMS上执行合成负载，然后观察系统的吞吐量和响应时间，这使得合成负载在DBMS性能评测的过程中至关重要。如果想针对某一具体应用进行DBMS性能评测，那么合成负载和真实负载的相似性就直接决定了评估结果是否可信。但是，当前的用作性能评测的负载很难和目标应用具有相似的负载特征，这导致评估结果并不准确。为了解决这个问题，本发明设计了一个面向事务应用的合成负载生成方法，用来捕捉真实在线事务处理(OLTP)应用的负载特征，然后生成和实际应用负载性能指标高度相似的合成负载。When evaluating the query performance of a database management system (DBMS), it is common to perform a synthetic workload on the DBMS, and then observe the throughput and response time of the system, which makes the synthetic workload critical in the process of DBMS performance evaluation. If you want to perform DBMS performance evaluation for a specific application, the similarity between the synthetic load and the real load directly determines whether the evaluation result is credible. However, it is difficult for the current workload used for performance evaluation to have similar load characteristics to the target application, which leads to inaccurate evaluation results. In order to solve this problem, the present invention designs a synthetic load generation method oriented to transaction applications, which is used to capture the load characteristics of real online transaction processing (OLTP) applications, and then generates a synthetic load that is highly similar to the actual application load performance index.

发明内容SUMMARY OF THE INVENTION

本发明为了解决背景技术中的缺陷，提出了一种面向应用的事务负载生成系统，包括：数据库生成模块和负载生成模块；其中，In order to solve the defects in the background technology, the present invention proposes an application-oriented transaction load generation system, including: a database generation module and a load generation module; wherein,

所述数据库生成模块用于通过获取数据库模式和数据特征，生成测试数据库；The database generation module is used to generate a test database by acquiring database schema and data features;

所述负载生成模块通过分析真实应用的负载轨迹，生成与真实负载相似的合成负载；所述负载生成模块包括：The load generation module generates a synthetic load similar to the real load by analyzing the load trajectory of the real application; the load generation module includes:

事务逻辑分析器，其通过分析短时间段内的完全负载轨迹，提取事务逻辑信息；A transaction logic analyzer, which extracts transaction logic information by analyzing the complete load trajectory in a short period of time;

数据访问分布分析器，其通过分析较长时间段内的部分负载轨迹，提取数据访问分布信息和吞吐信息；A data access distribution analyzer, which extracts data access distribution information and throughput information by analyzing partial load trajectories over a long period of time;

负载生成器，其利用之前被提取出的所述事务逻辑信息、所述数据访问分布信息和所述吞吐信息，将事务模板中的参数值实例化，生成合成负载。A load generator, which utilizes the transaction logic information, the data access distribution information and the throughput information extracted previously to instantiate the parameter values in the transaction template to generate a synthetic load.

本发明提出的面向应用的事务负载生成系统中，通过数据特征提取器来使用SQL查询自动化地获取数据特征。In the application-oriented transaction load generation system proposed by the present invention, the data feature is automatically obtained by using SQL query through the data feature extractor.

本发明提出的面向应用的事务负载生成系统中，所述负载生成器可以配置测试节点数和每个节点上的测试线程数以模拟并发性；对于每个测试线程，建立一个单独的数据库连接。In the application-oriented transaction load generation system proposed by the present invention, the load generator can configure the number of test nodes and the number of test threads on each node to simulate concurrency; for each test thread, a separate database connection is established.

本发明提出的面向应用的事务负载生成系统中，所述负载生成器在执行事务时，使用事务逻辑的结构信息来确定是否需要执行分支结构中的操作，以及循环结构中执行操作的次数；对于一个SQL操作的执行，首先将所有参数逐个实例化，然后把具有具体参数值的操作发送到测试数据库；在执行操作之后，结果集和参数被保存为中间状态，以便在同一事务实例内的后续操作中生成其它参数。In the application-oriented transaction load generation system proposed by the present invention, when executing the transaction, the load generator uses the structure information of the transaction logic to determine whether the operation in the branch structure needs to be executed, and the number of times to execute the operation in the loop structure; The execution of an SQL operation first instantiates all parameters one by one, and then sends the operation with specific parameter values to the test database; after the operation is executed, the result set and parameters are saved as an intermediate state for subsequent execution within the same transaction instance. Additional parameters are generated in the operation.

本发明提出的面向应用的事务负载生成系统中，所述负载生成器对于一个参数，如果只有依赖$1，则可以根据增量Δ和相关联的较小参数直接计算该参数的值；如果只有依赖$2，首先尝试通过根据依赖项的概率随机选择依赖项来实例化参数，当未选择中任何依赖项时，则使用数据访问分布来实例化此参数；如果同时存在依赖关系$2和$3，则相应的操作一定在循环结构中，循环第一次执行时，基于依赖$2和数据访问分布实例化参数，对于非第一次的循环执行，首先基于概率尝试使用依赖$3实例化参数，如果没有选中任何依赖项，则使用依赖$2和数据访问分布来实例化参数。In the application-oriented transaction load generation system proposed by the present invention, if the load generator only depends on $1 for a parameter, it can directly calculate the value of the parameter according to the increment Δ and the associated smaller parameter; $2, first try to instantiate the parameter by randomly selecting a dependency according to its probability, when none of the dependencies are selected, use the data access distribution to instantiate this parameter; if there are both dependencies $2 and $3, then the corresponding The operation must be in the loop structure. When the loop is executed for the first time, the instantiation parameters are based on the dependency $2 and the data access distribution. For non-first loop executions, first try to use the dependency $3 instantiation parameter based on the probability. If none is selected Dependencies, the parameters are instantiated with dependency$2 and the data access distribution.

基于以上系统，本发明还提出了一种面向应用的事务负载生方法，包括以下步骤：Based on the above system, the present invention also proposes an application-oriented transaction load generation method, which includes the following steps:

步骤A：通过获取数据库模式和数据特征，生成测试数据库；Step A: Generate a test database by acquiring the database schema and data features;

步骤B：通过分析真实应用的负载轨迹，生成与真实负载相似的合成负载，包括：Step B: Generate a synthetic load similar to the real load by analyzing the load trajectory of the real application, including:

步骤B1：通过分析短时间段内的完全负载轨迹，提取事务逻辑信息；Step B1: Extract transaction logic information by analyzing the complete load trajectory in a short period of time;

步骤B2：通过分析较长时间段内的部分负载轨迹，提取数据访问分布信息和吞吐信息；Step B2: Extract data access distribution information and throughput information by analyzing partial load trajectories in a long period of time;

步骤B3：利用之前被提取出的所述事务逻辑信息、所述数据访问分布信息和所述吞吐信息，将事务模板中的参数值实例化，生成合成负载。Step B3: Using the previously extracted transaction logic information, the data access distribution information and the throughput information, instantiate the parameter values in the transaction template to generate a synthetic load.

本发明提出的面向应用的事务负载生成方法，步骤A中，所述测试数据库为满足主键、外键约束和非键值属性数据特征的多张表；具体包括以下步骤：In the application-oriented transaction load generation method proposed by the present invention, in step A, the test database is a plurality of tables that satisfy the data characteristics of primary key, foreign key constraint and non-key-value attribute data; specifically, the following steps are included:

步骤A1：按顺序生成主键；Step A1: Generate primary keys in sequence;

步骤A2：生成外键时，在其所引用主键的值域内随机生成；Step A2: When generating a foreign key, randomly generate it within the value range of the primary key it refers to;

步骤A3：通过包含随机索引生成器和索引数值转化器的随机属性生成器生成非键值属性的值，同时满足期望的数据特征。Step A3: Generate the value of the non-key-valued attribute through the random attribute generator including the random index generator and the index value converter, while satisfying the expected data characteristics.

本发明提出的面向应用的事务负载生成方法，在生成键值之前，首先确定值域：第一步，如果所述主键只包括单独的一个属性，那么它的值域就是[1,s]，s是表的大小；第二步，所述外键属性的值域可以由它所引用的主键确定；第三步，处理复合主键中非外键属性的值域时，所述复合主键中只有一个非外键属性，那么这个属性的值域是

d_fk是复合主键中的外键属性之一的值域；如果涉及到级联引用，第二步和第三步执行多次。In the application-oriented transaction load generation method proposed by the present invention, before generating the key value, first determine the value range: in the first step, if the primary key only includes a single attribute, then its value range is [1, s], s is the size of the table; in the second step, the value domain of the foreign key attribute can be determined by the primary key it refers to; in the third step, when processing the value domain of the non-foreign key attribute in the composite primary key, only a non-foreign key attribute, then the value range of this attribute is

d _fk is the value field of one of the foreign key attributes in the composite primary key; if cascading references are involved, the second and third steps are performed multiple times.

本发明提出的面向应用的事务负载生成方法，随机索引生成器的输出是1到n的整数，其中n是属性的基数；给定一个索引，索引数值转化器确定性地将其映射到属性值域中的值；根据属性的数据类型，采用不同的索引数值转化器：对于数值类型，使用线性函数，该函数将索引均匀地映射到属性值域；对于字符串类型，将随机生成的满足长度要求的种子字符串；首先根据输入索引选择一个种子字符串，然后将索引和选定的种子字符串连接起来作为输出值。In the application-oriented transaction load generation method proposed by the present invention, the output of the random index generator is an integer from 1 to n, where n is the cardinality of the attribute; given an index, the index value converter deterministically maps it to the attribute value The value in the field; according to the data type of the attribute, different index value converters are used: for the numerical type, a linear function is used, which evenly maps the index to the attribute value field; for the string type, the randomly generated satisfying length is used The desired seed string; first select a seed string based on the input index, then concatenate the index and the selected seed string as the output value.

本发明提出的面向应用的事务负载生成方法，步骤B1中，事务逻辑提取算法包括：In the application-oriented transaction load generation method proposed by the present invention, in step B1, the transaction logic extraction algorithm includes:

步骤B11：通过遍历负载轨迹，计算事务模板内每个操作的执行次数，从而计算每个分支执行的可能性和循环操作的平均执行次数；Step B11: Calculate the execution times of each operation in the transaction template by traversing the load trajectory, thereby calculating the execution possibility of each branch and the average execution times of the loop operation;

步骤B12：识别事务模板中所有满足BR的参数对<pi,l,pi,j>，然后遍历负载轨迹以获得平均增量Δ；为pi,j构建依赖$1；Step B12: Identify all parameter pairs <pi,l,pi,j> that satisfy BR in the transaction template, and then traverse the load trajectory to obtain the average increment Δ; construct a dependency $1 for pi,j;

步骤B13：针对每个事务模板中的参数pi,j，遍历在它之前的每一个参数pm,n，计算拥有满足ER参数对的事务个数；同理，遍历之前的返回集rx,y，分别计算满足ER、IR的事务个数；Step B13: For the parameters pi,j in each transaction template, traverse each parameter pm,n before it, and calculate the number of transactions that satisfy the ER parameter pair; similarly, traverse the previous return set rx,y, Calculate the number of transactions that satisfy ER and IR respectively;

步骤B14：从K个事务实例随机选择N个事务实例组，每组两个；然后计算每组事务中参数对的LR系数(a,b)；Step B14: randomly select N transaction instance groups from the K transaction instances, two in each group; then calculate the LR coefficients (a, b) of the parameter pairs in each group of transactions;

步骤B15：使用步骤B13-B14中获得的统计信息据，为每个参数pi,j构造依赖关系$2；Step B15: Use the statistical information data obtained in steps B13-B14 to construct a dependency $2 for each parameter pi,j;

步骤B16：循环结构中同一操作会多次运行，使用依赖项$3来描述其中参数的变化；通过遍历负载轨迹，计算连续执行的循环操作中参数值的更改；系数(a,b)的计算与步骤B14相同，然后根据统计信息构造依赖关系$3。Step B16: The same operation in the loop structure will be run multiple times, and the change of the parameters in it is described by the dependency $3; by traversing the load trajectory, the change of the parameter value in the continuously executed loop operation is calculated; the calculation of the coefficients (a, b) is the same as Step B14 is the same, and then construct the dependency $3 according to the statistical information.

本发明提出的面向应用的事务负载生成方法，步骤B中，具体包括：The application-oriented transaction load generation method proposed by the present invention, in step B, specifically includes:

步骤B21：生成满足预期重复率的所有高频项；Step B21: generate all high-frequency terms that meet the expected repetition rate;

步骤B22：遍历前一个时间窗口中的所有参数，并为每个区间选择重复参数，直到满足区间上的参数重复率；Step B22: traverse all parameters in the previous time window, and select repetition parameters for each interval until the parameter repetition rate on the interval is satisfied;

步骤B23：根据该参数的值推导出该参数的索引，从而识别它在当前时间窗口中的所属区间；如果索引不在当前时间窗口的索引域中，则忽略该参数；Step B23: derive the index of the parameter according to the value of the parameter, thereby identifying the interval to which it belongs in the current time window; if the index is not in the index domain of the current time window, ignore the parameter;

步骤B24：生成添加到每个区间的随机参数，以达到基数要求；基于候选参数，使用参数生成机制实例化参数；在一个确定的区间内，只需随机选择一个候选参数作为输出。Step B24: Generate random parameters added to each interval to meet the cardinality requirement; based on the candidate parameters, use the parameter generation mechanism to instantiate the parameters; in a certain interval, just randomly select a candidate parameter as the output.

本发明公开了一个面向应用的合成负载生成系统和方法，可以捕捉真实在线事务处理(OLTP)应用的负载特征，然后生成和实际应用负载性能指标高度相似的合成负载，同时保证信息隐蔽性、工具可扩展性和负载可拓展性。创新点主要包括：The invention discloses an application-oriented synthetic load generation system and method, which can capture the load characteristics of a real online transaction processing (OLTP) application, and then generate a synthetic load that is highly similar to the actual application load performance index, while ensuring information concealment and tools. Scalability and load scalability. The innovations mainly include:

1.提出了描述联机事务处理(OLTP)应用负载特征的新方法：从事务逻辑和数据访问分布两个维度刻画OLTP负载特征。1. A new method to describe the load characteristics of online transaction processing (OLTP) applications is proposed: OLTP load characteristics are characterized from two dimensions of transaction logic and data access distribution.

2.提出了从真实应用负载轨迹中提取事务逻辑和数据访问分布的方法，同时保证应用信息的隐蔽性。2. A method is proposed to extract transaction logic and data access distribution from real application load trajectories, while ensuring the concealment of application information.

3.从控制事务冲突以及分布式事务比例的角度出发，提出通过控制操作参数依赖关系保证隐含事务逻辑的方法。3. From the perspective of controlling transaction conflicts and the proportion of distributed transactions, a method is proposed to ensure implicit transaction logic by controlling the dependencies of operation parameters.

4.设计并实现了刻画数据访问分布的三个角度：倾斜性，动态性和连续性。4. Three angles to describe the data access distribution are designed and implemented: inclination, dynamics and continuity.

5.实现了第一个面向应用的OLTP负载生成器，保证生成负载在性能评测上的真实性和可扩展性。5. Implemented the first application-oriented OLTP load generator to ensure the authenticity and scalability of the generated load in performance evaluation.

参考文献references

[1]TPC-C benchmark,http://www.tpc.org/tpcc/.[1] TPC-C benchmark, http://www.tpc.org/tpcc/.

[2]SQL Server Distributed Replay,https://docs.microsoft.com/enus/sql/tools/distributed-replay/sql-server-distributed-replay？view＝sqlserver-2017.[2] SQL Server Distributed Replay, https://docs.microsoft.com/enus/sql/tools/distributed-replay/sql-server-distributed-replay?id=2 view=sqlserver-2017.

[3]L.Galanis,S.Buranawatanachoke,R.Colle,B.Dageville,K.Dias,J.Klein,S.Papadomanolakis,L.L.Tan,V.Venkataramani,Y.Wang,[3] L. Galanis, S. Buranawatanachoke, R. Colle, B. Dageville, K. Dias, J. Klein, S. Papadomanolakis, L. L. Tan, V. Venkataramani, Y. Wang,

et al.,“Oracle database replay,”in SIGMOD,2008,pp.1159–1170.et al., “Oracle database replay,” in SIGMOD, 2008, pp.1159–1170.

[4]E.Lo,N.Cheng,W.W.K.Lin,W.Hon,and B.Choi,“Mybenchmark:generatingdatabases for query workloads,”in VLDBJ,2014,pp.895–913.[4] E.Lo, N.Cheng, W.W.K.Lin, W.Hon, and B.Choi, “Mybenchmark: generating databases for query workloads,” in VLDBJ, 2014, pp.895–913.

[5]Y.Li,R.Zhang,X.Yang,Z.Zhang,and A.Zhou,“Touchstone:Generatingenormous query-aware test databases,”in USENIX ATC,2018,pp.575–586.[5] Y. Li, R. Zhang, X. Yang, Z. Zhang, and A. Zhou, “Touchstone: Generatingenormous query-aware test databases,” in USENIX ATC, 2018, pp.575–586.

[6]Y.Tay,B.T.Dai,D.T.Wang,E.Y.Sun,Y.Lin,and Y.Lin,“Upsizer:Synthetically scaling an empirical relational database,”in InformationSystems,2013,pp.1168–1183.[6] Y.Tay, B.T.Dai, D.T.Wang, E.Y.Sun, Y.Lin, and Y.Lin, “Upsizer: Synthetically scaling an empirical relational database,” in Information Systems, 2013, pp.1168–1183.

[7]J.Zhang and Y.Tay,“Dscaler:Synthetically scaling a givenrelational database,”in PVLDB,2016,pp.1671–1682.[7] J. Zhang and Y. Tay, "Dscaler: Synthetically scaling a givenrelational database," in PVLDB, 2016, pp.1671–1682.

[8]P.S.Yu,M.-S.Chen,H.-U.Heiss,and S.Lee,“On workloadcharacterization of relational database environments,”in IEEE Transactions onSoftware Engineering,1992,pp.347–355.[8] P.S.Yu, M.-S.Chen, H.-U.Heiss, and S.Lee, “On workload characterization of relational database environments,” in IEEE Transactions on Software Engineering, 1992, pp.347–355.

[9]Q.T.Tran,K.Morfonios,and N.Polyzotis,“Oracle workloadintelligence,”in SIGMOD,2015,pp.1669–1681.[9] Q.T.Tran, K.Morfonios, and N.Polyzotis, "Oracle workload intelligence," in SIGMOD, 2015, pp.1669–1681.

[10]M.Alomari,M.Cahill,A.Fekete,and U.Rohm,“The cost ofserializability on platforms that use snapshot isolation,”in ICDE,2008,pp.576–585.[10] M. Alomari, M. Cahill, A. Fekete, and U. Rohm, “The cost of serializability on platforms that use snapshot isolation,” in ICDE, 2008, pp. 576–585.

[11]B.F.Cooper,A.Silberstein,E.Tam,R.Ramakrishnan,and R.Sears,“Benchmarking cloud serving systems with ycsb,”in SoCC,2010,pp.143–154.[11] B.F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” in SoCC, 2010, pp.143–154.

[12]H.J.Jeong and S.H.Lee,“A workload generator for databasesystembenchmarks,”in iiWAS,2005,pp.813–822.[12] H.J.Jeong and S.H.Lee, "A workload generator for databasesystem benchmarks," in iiWAS, 2005, pp.813–822.

[13]P.Ameri,N.Schlitter,J.Meyer,and A.Streit,“Nowog:a workloadgenerator for database performance benchmarking,”in DASC/PiCom/DataCom/CyberSciTech,2016,pp.666–673.[13] P. Ameri, N. Schlitter, J. Meyer, and A. Streit, "Nowog: a workload generator for database performance benchmarking," in DASC/PiCom/DataCom/CyberSciTech, 2016, pp.666–673.

[14]P.E.O’Neil,E.J.O’Neil,X.Chen,and S.Revilak,“The star schemabenchmark and augmented fact table indexing,”in TPCTC,2009,pp.237–252.[14] P.E.O’Neil, E.J.O’Neil, X.Chen, and S.Revilak, “The star schemabenchmark and augmented fact table indexing,” in TPCTC, 2009, pp.237–252.

[15]R.Cole,F.Funke,L.Giakoumakis,W.Guy,A.Kemper,S.Krompass,H.Kuno,R.Nambiar,T.Neumann,M.Poess,et al.,“The mixed workload ch-benchmark,”inDBTest,2011,pp.8.[15] R. Cole, F. Funke, L. Giakoumakis, W. Guy, A. Kemper, S. Krompass, H. Kuno, R. Nambiar, T. Neumann, M. Poess, et al., “The mixed workload ch-benchmark, "inDBTest, 2011, pp.8.

[16]F.Coelho,J.Paulo,R.Vilac，a,J.Pereira,and R.Oliveira,“Htapbench:Hybrid transactional and analytical processing benchmark,”in ICPE,2017,pp.293–304.[16] F. Coelho, J. Paulo, R. Vilac, a, J. Pereira, and R. Oliveira, “Htapbench: Hybrid transactional and analytical processing benchmark,” in ICPE, 2017, pp. 293–304.

[17]SQL Server Profiler,https://docs.microsoft.com/en-us/sql/tools/sqlserver-profiler/sql-server-profiler？view＝sql-server-2017.[17] SQL Server Profiler, https://docs.microsoft.com/en-us/sql/tools/sqlserver-profiler/sql-server-profiler?id=en view=sql-server-2017.

[18]Y.Wang,S.Buranawatanachoke,R.Colle,K.Dias,L.Galanis,S.Papadomanolakis,and U.Shaft,“Real application testing with databasereplay,”in DBTest,2009,pp.8.[18] Y. Wang, S. Buranawatanachoke, R. Colle, K. Dias, L. Galanis, S. Papadomanolakis, and U. Shaft, “Real application testing with databasereplay,” in DBTest, 2009, pp.8.

[19]C.Binnig,D.Kossmann,E.Lo,and M.T.Ozsu,“Qagen:generating¨query-aware test databases,”in SIGMOD,2007,pp.341–352.[19] C. Binnig, D. Kossmann, E. Lo, and M.T. Ozsu, "Qagen: generating¨query-aware test databases," in SIGMOD, 2007, pp.341–352.

[20]A.Arasu,R.Kaushik,and J.Li,“Data generation using declarativeconstraints,”in SIGMOD,2011,pp.685–696.[20] A. Arasu, R. Kaushik, and J. Li, “Data generation using declarative constraints,” in SIGMOD, 2011, pp. 685–696.

附图说明Description of drawings

图1是本发明基本架构图。Fig. 1 is the basic structure diagram of the present invention.

图2是本发明确定键值属性的值域示意图。FIG. 2 is a schematic diagram of a value range for determining a key-value attribute according to the present invention.

图3是本发明S-Dist示例。Figure 3 is an example of the S-Dist of the present invention.

图4是本发明参数生成示例。FIG. 4 is an example of parameter generation of the present invention.

图5是本发明C-Dist示例。Figure 5 is an example of the C-Dist of the present invention.

图6a-6d是本发明在PostgreSQL数据库上模拟TPC-C负载性能指标的偏差.Figures 6a-6d are the deviations of the TPC-C load performance index simulated on the PostgreSQL database by the present invention.

图7是本发明中对倾斜的和动态的负载中S-Dist和D-Dist的探讨。Figure 7 is a discussion of S-Dist and D-Dist in tilted and dynamic loads in the present invention.

图8是本发明中对连续性负载中C-Dist的探讨。FIG. 8 is a discussion of C-Dist in continuous load in the present invention.

图9是本发明事务逻辑提取算法性能(K＝N)。FIG. 9 is the performance of the transaction logic extraction algorithm of the present invention (K=N).

图10是本发明C-Dist提取算法性能。Figure 10 shows the performance of the C-Dist extraction algorithm of the present invention.

具体实施方式Detailed ways

结合以下具体实施例和附图，对发明作进一步的详细说明。实施本发明的过程、条件、实验方法等，除以下专门提及的内容之外，均为本领域的普遍知识和公知常识，本发明没有特别限制内容。The invention will be further described in detail with reference to the following specific embodiments and accompanying drawings. Except for the content specifically mentioned below, the process, conditions, experimental methods, etc. for implementing the present invention are all common knowledge and common knowledge in the field, and the present invention is not particularly limited.

面向应用的数据库性能评测有以下需求：Application-oriented database performance evaluation has the following requirements:

保真性。用来评估的负载应当和真实应用负载高度相似。由评测获得的性能指标(例如吞吐、延迟、物理资源的利用率)应当和实际应用运行的结果一致。评测负载和真实应用负载的相似性用性能指标的偏离程度衡量。偏离程度越小，相似性越高。Fidelity. The load used for evaluation should closely resemble the real application load. The performance indicators (such as throughput, latency, physical resource utilization) obtained by the evaluation should be consistent with the results of actual application operation. The similarity between the evaluation load and the real application load is measured by the deviation of the performance index. The smaller the deviation, the higher the similarity.

隐蔽性。数据隐私是商业应用的基本需求，所以真实应用的负载通常无法直接用做数据库性能评测。concealment. Data privacy is a basic requirement of commercial applications, so the workload of real applications cannot be directly used for database performance evaluation.

工具可扩展性。目标应用可能有很大的数据规模和很高的并发/吞吐。这就需要负载生成工具能够在多个节点上进行扩展，并且支持并发的数据库和负载生成。Tool Extensibility. The target application may have large data size and high concurrency/throughput. This requires load generation tools that can scale across multiple nodes and support concurrent database and load generation.

负载可拓展性。有时，需要拓展当前应用负载以衡量在期望合成负载规模下DBMS的性能。由于主要致力于生成事务型负载，所以关键需要关注的查询并发性和查询吞吐量。Load scalability. Sometimes, it is necessary to scale the current application load to measure the performance of the DBMS under the scale of the expected synthetic load. Since the main focus is on generating transactional workloads, query concurrency and query throughput are key concerns.

基于这些需求，本发明将面向应用的事务负载生成问题形式化：Based on these requirements, the present invention formalizes the application-oriented transaction load generation problem:

面向应用的事务负载生成(Application-oriented synthetic workloadgeneration)：生成与目标应用高度相似的合成负载，同时保证保真性、隐蔽性、工具可扩展性和负载可拓展性。Application-oriented synthetic workload generation: Generate synthetic workloads that are highly similar to the target application while maintaining fidelity, stealth, tool scalability, and workload scalability.

基于上述数据生成问题的定义，本发明设计的基本架构如图1所示。为了解决数据隐私问题，本发明的方法将隔离生产环境和评测环境，从而使得数据的拥有者保护数据隐私。从功能的角度划分，本发明的实现可分为数据库生成模块(Database Generation)和负载生成模块(Workload Generation)。Based on the above definition of the data generation problem, the basic architecture designed by the present invention is shown in FIG. 1 . In order to solve the data privacy problem, the method of the present invention will isolate the production environment and the evaluation environment, so that the data owner can protect the data privacy. From the perspective of functions, the implementation of the present invention can be divided into a database generation module (Database Generation) and a load generation module (Workload Generation).

数据库生成模块(Database Generation)：数据库生成器的输入主要包含两部分：数据库模式和数据特征；输出是一个测试数据库(Test DB)。由于数据特征是单调冗长的并且需要从真实的数据库获取，本发明提供一个数据特征提取器(Data CharacteristicsExtractor)来帮助本发明使用简单的SQL查询自动化地获取这些信息。在设计中，本发明只关注测试数据库的一些基本数据特征(例如属性的值域)，因为合成负载的负载特征才是影响DBMS性能的关键因素。Database generation module (Database Generation): The input of the database generator mainly includes two parts: database schema and data characteristics; the output is a test database (Test DB). Since data characteristics are monotonous and verbose and need to be obtained from a real database, the present invention provides a data characteristic extractor (Data Characteristics Extractor) to help the present invention automatically obtain these information using simple SQL queries. In the design, the present invention only focuses on some basic data characteristics of the test database (such as the value range of attributes), because the load characteristics of the synthetic load are the key factors affecting the performance of the DBMS.

负载生成模块(Workload Generation)：负载生成模块由三个部分组成：事务逻辑分析器(Transaction Logic Analyzer)、数据访问分布分析器(Data AccessDistribution Analyzer)和负载生成器(Workload Generator)。事务逻辑分析器通过分析短时间段内的完全负载轨迹(包含每个SQL操作的所有参数和返回结果集)，提取事务逻辑信息。数据访问分布分析器通过分析较长时间段内的部分负载轨迹(只包含SQL操作的一些关键参数)，提取数据访问分布信息和吞吐信息。负载生成器利用这些信息将事务模板中的参数值实例化，生成合成负载。Load Generation Module (Workload Generation): The load generation module consists of three parts: Transaction Logic Analyzer, Data Access Distribution Analyzer and Load Generator. The transaction logic analyzer extracts transaction logic information by analyzing the full load trace (including all parameters and returned result sets for each SQL operation) over a short period of time. The data access distribution analyzer extracts data access distribution information and throughput information by analyzing partial load trajectories over a long period of time (including only some key parameters of SQL operations). The load generator uses this information to instantiate the parameter values in the transaction template to generate a synthetic load.

本发明中的数据库生成：Database generation in the present invention:

所有数据库生成所需要的数据特征和Touchstone[5]相同，例如表大小、属性的值域、属性的基数，即非重复值的个数。生成测试数据库事实上就是生成满足主/外键约束和非键值属性数据特征的多张表。The data characteristics required for all database generation are the same as Touchstone [5], such as table size, attribute value range, attribute cardinality, that is, the number of distinct values. Generating a test database is actually generating multiple tables that satisfy primary/foreign key constraints and non-key attribute data characteristics.

不失一般性，本发明假设主键和外键都是整数。主键是记录的标识符，通常没有物理意义，所以不用考虑它们的数据特征。生成主键时，本发明简单地按顺序生成；生成外键时，本发明在它所引用主键的值域内随机生成。这可以保证主键的唯一性和外键的引用完整性。在生成这些键值之前，需要有三个步骤来确定它们的值域。第一步，如果主键只包括单独的一个属性，那么它的值域就是[1,s]，s是表的大小(如图3中的①)；第二步，外键属性的值域可以由它所引用的主键确定(如图3中的②)；第三步，本发明处理复合主键中非外键属性的值域问题(如图3中的③)。最常见和最合理的情况是复合主键中只有一个非外键属性，那么这个属性的值域是

d_fk是复合主键中的外键属性之一的值域。其它情况的处理类似。如果涉及到级联引用，第二步和第三步可能要执行多次。Without loss of generality, the present invention assumes that both primary and foreign keys are integers. Primary keys are identifiers for records and usually have no physical meaning, so their data characteristics are not considered. When generating a primary key, the present invention simply generates sequentially; when generating a foreign key, the present invention randomly generates within the value range of the primary key it refers to. This guarantees the uniqueness of primary keys and the referential integrity of foreign keys. Before generating these keys, three steps are required to determine their range. In the first step, if the primary key only includes a single attribute, its value range is [1, s], and s is the size of the table (① in Figure 3); in the second step, the value range of the foreign key attribute can be It is determined by the primary key it refers to (2 in Figure 3); in the third step, the present invention deals with the value range problem of non-foreign key attributes in the composite primary key (3 in Figure 3). The most common and reasonable case is that there is only one non-foreign key attribute in the composite primary key, then the value range of this attribute is

d _fk is the value field of one of the foreign key attributes in the composite primary key. Other cases are handled similarly.

Steps

2 and 3 may be performed multiple times if cascading references are involved.

包含随机索引生成器和索引数值转化器的随机属性生成器[5]用于生成非键值属性的值，同时满足期望的数据特征，特别是基数特征。随机索引生成器的输出是1到n的整数，其中n是属性的基数。给定一个索引，索引数值转化器确定性地将其映射到属性值域中的值。根据属性的数据类型，本发明采用不同的索引数值转化器。对于数值类型，例如整数，本发明简单地使用线性函数，该函数将索引均匀地映射到属性值域；对于字符串类型，例如varchar，将有一些随机生成的满足长度要求的种子字符串。本发明首先根据输入索引选择一个种子字符串(如第(i％k)个种子字符串，其中i是索引，k是所有种子字符串的数目。然后将索引和选定的种子字符串连接起来作为输出值。A random attribute generator [5], which includes a random index generator and an index value converter, is used to generate values for non-key-valued attributes while satisfying desired data characteristics, especially cardinality characteristics. The output of the random index generator is an integer from 1 to n, where n is the cardinality of the attribute. Given an index, the index-to-value converter deterministically maps it to a value in the property's range. According to the data type of the attribute, the present invention uses different index value converters. For numeric types, such as integers, the present invention simply uses a linear function, which maps indices uniformly to attribute value ranges; for string types, such as varchar, there will be some randomly generated seed strings that meet length requirements. The present invention first selects a seed string (such as the (i%k)th seed string according to the input index, where i is the index and k is the number of all seed strings. Then the index and the selected seed string are connected together as the output value.

总之，每个表的生成彼此独立。并且对于每个表，通过将主键生成按范围分配给每个线程，本发明可以在多个节点上实现并行数据生成。In summary, each table is generated independently of each other. And for each table, by assigning primary key generation to each thread by range, the present invention can realize parallel data generation on multiple nodes.

本发明中的事务逻辑：Transaction logic in the present invention:

本文中提到的事务逻辑代表OLTP应用中潜在的业务逻辑。在数据库测试过程中，事务逻辑会对死锁可能性、分布式事务比例造成显著影响，从而影响测试数据库性能。在这一部分，本发明首先引入事务逻辑的定义，然后给出提取事务逻辑的算法。The transaction logic mentioned in this article represents the underlying business logic in an OLTP application. In the process of database testing, transaction logic will have a significant impact on the possibility of deadlock and the proportion of distributed transactions, thereby affecting the performance of the test database. In this part, the present invention first introduces the definition of transaction logic, and then presents the algorithm for extracting transaction logic.

事务模板中SQL参数与SQL参数之间的关系以及SQL参数与返回项之间的关系决定了SQL操作之间的隐藏语义。在对现有OLTP基准负载和实际应用负载进行调查之后，本发明关注四种类型的关系。首先，相等关系(ER)是最常见的关系。例如，在一定概率下，两个SQL参数是相等的。其次，包含关系(IR)也很常见。因为SQL结果返回集可能是一组元组，所以SQL参数的值可能是前一个返回结果集中的一个值。第三，线性关系(LR)是对相等关系的补充和延伸，具有更强的表达能力。第四，between关系(BR)是为形如“col between p1 andp2”和“col≥p1 and col≤p2”的谓词提出的，这里p2和p1之间有between关系。The relationship between SQL parameters and SQL parameters in a transaction template and the relationship between SQL parameters and returned items determine the hidden semantics between SQL operations. After investigating existing OLTP benchmark loads and real-world application loads, the present invention focuses on four types of relationships. First, the equality relation (ER) is the most common relation. For example, with a certain probability, two SQL parameters are equal. Second, inclusion relations (IR) are also common. Because the SQL result return set may be a set of tuples, the value of the SQL parameter may be a value in the previous returned result set. Third, the linear relationship (LR) is a supplement and extension to the equality relationship, and has stronger expressive power. Fourth, the between relation (BR) is proposed for predicates of the form "col between p1 and p2" and "col≥p1 and col≤p2", where there is a between relation between p2 and p1.

下面正式定义事务逻辑。O_i表示事务模板中的第i个操作；p_i,j代表O_i的第j个参数；r_i,j代表O_i的第j个返回项；i和j都从1开始计数。The transaction logic is formally defined below. O _i represents the i-th operation in the transaction template; pi _,j represents the j-th parameter of O _i ; ri _,j represents the j-th return item of O _i ; both i and j are counted from 1.

定义2事务逻辑(Transaction logic)：针对一个事务模板，事务逻辑由事务结构信息和参数依赖关系信息组成，详述如下：Definition 2 Transaction logic: For a transaction template, the transaction logic consists of transaction structure information and parameter dependency information, as detailed below:

·事务结构信息：· Transaction structure information:

#1在分支结构中每个分支执行的可能性。#1 Possibility of executing each branch in a branch structure.

#2在循环结构中操作的平均执行次数。The average number of executions of #2 operations in a loop structure.

·参数依赖信息(针对每个参数p_i,j)：Parameter dependency information (for each parameter p _i,j ):

$1[p_i,l,p_i,j,BR,Δ]$1[pi _,l ,pi _,j ,BR,Δ]

$2一个dep-item的列表，dep-item∈{[p_m,n,p_i,j,ER,ξ],[p_m,n,p_i,j,LR,ξ,(a,b)],[r_x,y,p_i,j,ER,ξ],[r_x,y,p_i,j,LR,ξ,(a,b)],[r_x,y,p_i,j,IR,ξ]}；m≤i；x<i；如果m＝i,那么n<j.$2 a list of dep-item, dep-item∈{[p _m,n ,pi _,j ,ER,ξ],[p _m,n ,pi _,j ,LR,ξ,(a,b)] ,[r _x,y ,pi _,j ,ER,ξ],[r _x,y ,pi _,j ,LR,ξ,(a,b)],[r _x,y ,pi _,j , IR,ξ]}; m≤i; x<i; if m=i, then n<j.

$3一个[p_i,j,LR,ξ,(a,b)]的列表，O_i必须是循环结构中的操作。$3 A list of [pi _,j ,LR,ξ,(a,b)], O _i must be an operation in a loop structure.

如果一个参数p_i,j具有依赖$1，那么$2和$3都没有必要存在了，因为p_i,j可以表示为(p_i,l+Δ)，Δ是从负载轨迹中计算出的平均增量。在每个依赖中，ξ表示相应依赖被满足的可能性；(a,b)是描述线性关系的两个系数。依赖$3代表在连续执行的循环操作中，相同参数的值之间的线性关系。If a parameter pi _,j has a dependency on $1, then neither $2 nor $3 need exist, because pi _,j can be expressed as (pi _,l + Δ), where Δ is the average increment calculated from the load trajectory . In each dependency, ξ denotes the probability that the corresponding dependency is satisfied; (a, b) are two coefficients describing the linear relationship. Dependency $3 represents a linear relationship between the values of the same parameter in successively executed loop operations.

事务逻辑是应用层业务逻辑的体现，并不经常变化，所以不需要分析较长时间段内的负载轨迹。由于每个事务模板的事务逻辑分析过程是相同且相互独立的，本发明接下的提取算法只针对一个事务模板分析。事务逻辑提取算法包括六步，输入是一个事务模板和相应负载轨迹上的K个事务。算法如下：The transaction logic is the embodiment of the application layer business logic and does not change frequently, so there is no need to analyze the load trajectory over a long period of time. Since the transaction logic analysis process of each transaction template is the same and independent of each other, the subsequent extraction algorithm of the present invention only analyzes one transaction template. The transaction logic extraction algorithm consists of six steps, and the input is a transaction template and K transactions on the corresponding load trajectory. The algorithm is as follows:

步骤1：提取事务结构信息。通过遍历负载轨迹，计算事务模板内每个操作的执行次数，从而计算每个分支执行的可能性和循环操作的平均执行次数。Step 1: Extract transaction structure information. By traversing the load trajectory, the number of executions of each operation in the transaction template is calculated, thereby calculating the probability of each branch execution and the average execution times of the loop operation.

步骤2：确定BR。首先，本发明识别事务模板中所有满足BR的参数对<p_i,l,p_i,j>，然后遍历负载轨迹以获得平均增量Δ。然后，本发明为p_i,j构建依赖$1，之后第3-6步的处理可以跳过p_i,j。Step 2: Determine BR. First, the present invention identifies all parameter pairs <pi _,l ,pi _,j > in the transaction template that satisfy BR, and then traverses the load trajectory to obtain the average increment Δ. Then, the present invention builds the dependency $1 for pi _,j , after which the processing of steps 3-6 can skip pi _,j .

步骤3：收集ER和IR信息。针对每个事务模板中的参数p_i,j，本发明遍历在它之前的每一个参数p_m,n，计算拥有满足ER参数对(即p_i,j＝p_m,n)的事务个数；同理，遍历之前的返回集r_x,y，分别计算满足ER、IR的事务个数。Step 3: Collect ER and IR information. For the parameters p _i,j in each transaction template, the present invention traverses each parameter p _m,n before it, and calculates the number of transactions that have the ER parameter pair (ie p _i,j =p _m,n ) ; Similarly, traverse the previous return set r _x,y to calculate the number of transactions that satisfy ER and IR respectively.

步骤4：收集LR信息。LR只涉及到数值类型的参数和返回集，并且返回集必须由通过主键过滤的操作得到。由于LR系数(a,b)的计算需要两个事务实例，所以本发明从K个事务实例随机选择N个事务实例组(每组两个)。然后计算每组事务中参数对的LR系数(a,b)。这里需要忽略具有系数(1,0)的LR，因为它是由ER表示的。Step 4: Collect LR information. LR only involves parameters and return sets of numeric types, and the return sets must be obtained by filtering operations by primary keys. Since the calculation of the LR coefficients (a, b) requires two transaction instances, the present invention randomly selects N transaction instance groups (two in each group) from the K transaction instances. The LR coefficients (a, b) of the parameter pairs in each set of transactions are then calculated. LR with coefficient (1,0) needs to be ignored here because it is represented by ER.

步骤5：通过权衡确定ER，IR和LR。使用步骤3-4中获得的统计信息据，本发明可以很容易地为每个参数p_i,j构造依赖关系$2。但是，对于每个参数，可能有很多依赖项，而且有些参数的概率(即ξ)很小，因此本发明需要在这些依赖项之间进行权衡，以消除噪声并减少后续计算。本发明选取了最重要的依赖项，如概率较高的依赖项，并保证选取依赖项的概率之和小于1。本发明更倾向于保留ER，因为发现在实验中ER比IR和LR重要得多。Step 5: Determine ER, IR and LR by weighing. Using the statistical data obtained in steps 3-4, the present invention can easily construct a dependency $2 for each parameter p _i,j . However, for each parameter, there may be many dependencies, and some parameters have a small probability (ie, ξ), so the present invention requires a trade-off between these dependencies to eliminate noise and reduce subsequent computations. The present invention selects the most important dependencies, such as dependencies with higher probability, and ensures that the sum of the probabilities of the selected dependencies is less than 1. The present invention prefers to preserve ER, as ER was found to be much more important than IR and LR in the experiments.

步骤6：构造循环结构的LR。循环结构中同一操作会多次运行，本发明使用依赖项$3来描述其中参数的变化。通过遍历负载轨迹，计算连续执行的循环操作中参数值的更改。系数(a,b)的计算与步骤4相似。然后根据统计信息构造依赖关系$3。Step 6: Construct the LR of the cyclic structure. In the loop structure, the same operation will be run multiple times, and the present invention uses the dependency $3 to describe the change of the parameters. By traversing the load trajectory, the changes in parameter values in the continuously executed loop operation are calculated. The calculation of coefficients (a, b) is similar to step 4. Then construct the dependency $3 based on the statistics.

如果步骤3-4遇到了在循环之中的操作，本发明只使用循环第一次执行的负载轨迹。本发明中的数据访问分布：If steps 3-4 encounter an operation in the loop, the present invention only uses the load trajectory for the first execution of the loop. Data access distribution in the present invention:

数据访问分布对负载事务的冲突强度、数据库系统的缓存命中率都有着不可忽视的影响，所以一直被当作应用负载的重要特征。在这一部分，本发明首先刻画基本的倾斜数据访问分布，再刻画数据访问分布的动态性和连续性，最后给出候选参数的生成算法。Data access distribution has a non-negligible impact on the conflict intensity of load transactions and the cache hit rate of database systems, so it has always been regarded as an important feature of application load. In this part, the present invention firstly describes the basic oblique data access distribution, then the dynamics and continuity of the data access distribution, and finally gives the generation algorithm of the candidate parameters.

合成负载的数据访问分布取决于事务模板中实例化的参数值。不失一般性，假设OLTP应用中决定数据访问分布的谓词形式都可以被表示为“col op para”。本发明使用从负载轨迹中提取的高频项集(HFI)和直方图统计信息(HS)代表每个参数的倾斜数据访问分布(S-Dist)。HFI记录出现频率最高的H个热数据项。属性的域值被均匀地划分为I个区间，然后将每个区间上的参数(除去已在HFI中出现过的)的频率和基数记为HS。图3是S-Dist的一个例子，其中H和I的值都是5，对应的属性是整型的，域值[0,2000]。再HFI中，最热的项是57，出现频率为0.17；再HS的第一个区间中，有20个唯一的参数值，总访问频率为0.08。The distribution of data access for synthetic workloads depends on the parameter values instantiated in the transaction template. Without loss of generality, it is assumed that the form of predicates that determine the distribution of data access in OLTP applications can be expressed as "col op para". The present invention represents the skewed data access distribution (S-Dist) for each parameter using high frequency itemsets (HFI) and histogram statistics (HS) extracted from the load trajectory. HFI records the H hot data items with the highest frequency. The domain value of the attribute is evenly divided into I intervals, and then the frequency and cardinality of the parameters on each interval (except those that have appeared in HFI) are recorded as HS. Figure 3 is an example of S-Dist, where the value of H and I are both 5, the corresponding attribute is an integer, and the domain value is [0, 2000]. In re-HFI, the hottest item is 57, and the frequency of occurrence is 0.17; in the first interval of re-HS, there are 20 unique parameter values, and the total access frequency is 0.08.

S-Dist中的数据项都来自在真实数据上运行的负载轨迹，但相同的数据在本发明生成的合成数据库中并不一定存在，所以首先本发明要为HFI做数据转换。假设属性生成器是“index＝ranInt[1,400],value＝index*5”，其中400是属性基数。第一步，使用属性生成器重新生成HFI中的数据项，如图4所示，57被替换成了195.第二步，根据高频项的频率和所有区间计算累积概率数组，即图4中的“cumu prob”。最后，生成一个0到1之间的随机数，映射到累积概率数组中的一项，从而挑选出合适的参数补充谓词。图4给出了选择参数值的两个例子。The data items in the S-Dist all come from load trajectories running on real data, but the same data does not necessarily exist in the synthetic database generated by the present invention, so the present invention first needs to do data conversion for HFI. Suppose the attribute generator is "index=ranInt[1,400], value=index*5", where 400 is the attribute cardinality. The first step is to use the attribute generator to regenerate the data items in the HFI, as shown in Figure 4, where 57 is replaced by 195. The second step is to calculate the cumulative probability array based on the frequency of the high frequency items and all intervals, that is, Figure 4 "cumu prob" in . Finally, a random number between 0 and 1 is generated and mapped to an item in the cumulative probability array, thereby picking out suitable parameters to complement the predicate. Figure 4 shows two examples of selecting parameter values.

另外，为了控制每个区间上生成参数的基数，本发明将随机索引生成器重新定义为

其中cdn_i是目标区间的基数，cdn_avg是每个区间的平均基数，minIdx_i是目标区间的最小索引。在图4中，区间2的随机索引生成器为

其中cdn₂＝50，

minIdx₂＝80*2+1＝161.In addition, in order to control the cardinality of the generated parameters in each interval, the present invention redefines the random index generator as

where cdn _i is the cardinality of the target interval, cdn _avg is the average cardinality of each interval, and minIdx _i is the minimum index of the target interval. In Figure 4, the random index generator for interval 2 is

where cdn ₂ =50,

minIdx ₂ =80*2+1=161.

虽然上述示例中的参数是整数类型，但本发明的方法是通用的。对于所有非键值属性的数值型参数，S-Dist以及参数生成完全相同。对于键值属性的参数，存在小的差异。因为合成数据库中的主键顺序生成，所以合成数据库中的键值属性的域值可能与真实数据库中的主键不同。因此，当收集S-Dist统计信息时，本发明使用真实数据库中的键值属性的域值来划分区间并构造HS。但在合成负载生成过程中，本发明使用合成数据库中的键值属性的域值来支持参数生成。对于字符串类型参数，最大的区别是如何划分区间。当构建HS的时候，字符串类型参数所属区间由h％I计算，其中h是参数的哈希值。索引数值转化器与合成数据库生成所采用的一致。Although the parameters in the above examples are of integer type, the method of the present invention is general. For all numeric parameters of non-key-value attributes, S-Dist and parameter generation are identical. For parameters of key-value properties, there are small differences. Because the primary keys in the synthetic database are generated sequentially, the field values of the key-valued properties in the synthetic database may differ from the primary keys in the real database. Therefore, when collecting S-Dist statistics, the present invention uses the domain value of the key-value attribute in the real database to divide the interval and construct the HS. However, in the synthetic load generation process, the present invention uses the domain value of the key-value attribute in the synthetic database to support parameter generation. For string type parameters, the biggest difference is how to divide the interval. When constructing HS, the interval to which the string type parameter belongs is calculated by h%I, where h is the hash value of the parameter. The index value converter is identical to that used for synthetic database generation.

动态性。如果数据访问分布动态变化，S-Dist是不准确的，甚至是完全错误的。假设有一张有100条记录的表和一个100秒的负载。在第i秒中，负载的所有数据库请求只访问表中的第i条记录。假设数据库的吞吐量在此期间是稳定的。此时，如果只使用S-Dist来表示整个数据访问过程，会发现没有热点数据，而且数据访问分布非常统一。很明显，这与事实大不相同。使用该S-Dist生成的合成负载，数据库上的事务冲突强度将会大大低于实际负载的冲突强度。所以本发明提出D-Dist，在S-Dist的基础上，增加对动态性的刻画。首先，根据日志的时间戳将同一个参数的负载轨迹划分为多个等长的时间窗口。然后，对于任何时间窗口中的参数轨迹，本发明将生成单独的S-Dist，并且将整个参数轨迹的D-Dist定义为S-Dist的列表。最后，本发明使用与生成时间相对应的S-Dist实例化符号参数。此外，对于数值型参数，在时间窗口中使用的数据范围可以比属性域值小得多。为了提高HS的精度，可以根据当前窗口的数据范围对区间进行划分。当然，在生成参数时，还必须使用每个区间的相应索引范围。dynamic. If the data access distribution changes dynamically, S-Dist is inaccurate, or even completely wrong. Suppose there is a table with 100 records and a load of 100 seconds. In the ith second, all database requests of the load access only the ith record in the table. It is assumed that the throughput of the database is stable during this period. At this point, if you only use S-Dist to represent the entire data access process, you will find that there is no hotspot data, and the data access distribution is very uniform. Obviously, this is very different from the truth. Using the synthetic workload generated by this S-Dist, the transaction conflict intensity on the database will be significantly lower than that of the actual workload. Therefore, D-Dist is proposed in the present invention, and on the basis of S-Dist, dynamic characterization is added. First, the load trajectory of the same parameter is divided into multiple time windows of equal length according to the timestamp of the log. Then, for the parametric trajectory in any time window, the present invention will generate a separate S-Dist, and define the D-Dist of the entire parametric trajectory as a list of S-Dist. Finally, the present invention instantiates the symbol parameters using the S-Dist corresponding to the generation time. Also, for numeric parameters, the data range used in the time window can be much smaller than the domain value. In order to improve the accuracy of HS, the interval can be divided according to the data range of the current window. Of course, when generating parameters, the corresponding index range for each interval must also be used.

连续性。在某些应用中，数据的热度与时间密切相关，具体表现为数据可以被连续访问一段时间。本发明称之为数据访问分布的连续性。之前，D-Dist捕捉时间窗口中数据访问的偏差，而忽略了连续时间窗口之间数据访问的连续性。当使用它生成合成负载时，在连续时间窗口之间访问的数据可能完全不同，从而导致较低的缓存命中率。因此，本发明提出C-Dist，在D-Dist的基础上，增加对连续性的刻画。在收集统计数据时，本发明计算了当前时间窗口中的高频项与前一个时间窗口中高频项的重复率，以及所有区间的参数重复率。图5在图3中示例的基础上增加HFI和HS的重复频率。在这个例子中，本发明可以看到HFI的重复频率为0.6，也就是从前一个时间窗口中保持了三个高频项。五个区间的重复频率分别为0、0.33、0.5、0.46和0.56。假设cdn₁＝15，则在区间1中有15*0.33≈5个参数出现在前一个时间窗口中。continuity. In some applications, the popularity of data is closely related to time, which means that the data can be accessed continuously for a period of time. The present invention refers to the continuity of data access distribution. Previously, D-Dist captured the deviation of data access in time windows, ignoring the continuity of data access between consecutive time windows. When using it to generate synthetic loads, the data accessed between consecutive time windows can be quite different, resulting in lower cache hit rates. Therefore, the present invention proposes C-Dist, and on the basis of D-Dist, the characterization of continuity is added. When collecting statistical data, the present invention calculates the repetition rate of the high frequency item in the current time window and the high frequency item in the previous time window, as well as the parameter repetition rate of all intervals. FIG. 5 increases the repetition frequency of HFI and HS based on the example in FIG. 3 . In this example, the present invention can see that the repetition rate of the HFI is 0.6, that is, three high frequency terms are kept from the previous time window. The repetition frequencies of the five intervals are 0, 0.33, 0.5, 0.46 and 0.56, respectively. Assuming cdn ₁ =15, then in interval 1 there are 15*0.33≈5 parameters appearing in the previous time window.

为了保证C-Dist中的参数重复率，本发明需要为每个时间窗口预生成候选参数。在算法1中给出了候选参数的详细生成过程。在第1-2行中，生成满足预期重复率的所有高频项。在第3-10行中，遍历前一个时间窗口中的所有参数，并为每个区间选择重复参数，直到满足区间上的参数重复率。在第6行，根据该参数的值推导出该参数的索引，从而识别它在当前时间窗口中的所属区间。如果索引不在当前时间窗口的索引域中，则忽略该参数。对于字符串类型的参数，例如“296#dgtckuy”，“#”字符前面部分是本发明需要的索引。最后，在第11-13行中，本发明生成添加到每个区间的随机参数，以达到基数要求。基于这些候选参数，本发明使用图4的参数生成机制实例化参数。在某一个确定的区间内，本发明只需随机选择一个候选参数作为输出。另外，如果在合成负载生成过程中在线生成候选参数，则负载生成器可能成为性能瓶颈，从而影响评估结果的正确性。因此，本发明可以脱机生成所有时间窗口的候选参数，并将它们存储在磁盘上，然后在生成合成负载时根据需要读取它们。In order to ensure the parameter repetition rate in the C-Dist, the present invention needs to pre-generate candidate parameters for each time window. The detailed generation process of candidate parameters is given in Algorithm 1. In lines 1-2, generate all high frequency terms that satisfy the expected repetition rate. In lines 3-10, iterate over all parameters in the previous time window and choose repetition parameters for each interval until the parameter repetition rate on the interval is satisfied. In line 6, the index of the parameter is derived according to the value of the parameter, thereby identifying the interval to which it belongs in the current time window. This parameter is ignored if the index is not in the index field of the current time window. For parameters of string type, such as "296#dgtckuy", the part before the "#" character is the index required by the present invention. Finally, in lines 11-13, the present invention generates random parameters added to each interval to meet cardinality requirements. Based on these candidate parameters, the present invention instantiates the parameters using the parameter generation mechanism of FIG. 4 . Within a certain interval, the present invention only needs to randomly select a candidate parameter as an output. In addition, if candidate parameters are generated online during the synthetic load generation process, the load generator may become a performance bottleneck, thus affecting the correctness of the evaluation results. Therefore, the present invention can generate candidate parameters for all time windows offline, store them on disk, and then read them as needed when generating synthetic loads.

本实施例中的生成负载：The generated load in this example:

给定每个事务模板的事务逻辑和每个参数的数据访问分布，图1中的负载生成器(Workload Generator)负责产生满足指定配置的合成负载。同时，在分布式环境中高效地生成高并发、高吞吐量的合成负载也是对本发明的负载生成器的基本要求。下面将从线程模型、事务执行和参数实例化三个级别介绍负载生成的细节。Given the transaction logic of each transaction template and the data access distribution of each parameter, the workload generator (Workload Generator) in Figure 1 is responsible for generating a synthetic workload that satisfies the specified configuration. At the same time, efficiently generating a high-concurrency and high-throughput synthetic load in a distributed environment is also a basic requirement for the load generator of the present invention. The details of load generation are described below from three levels: threading model, transaction execution, and parameter instantiation.

线程模型。部署负载生成器时，用户可以配置测试节点数和每个节点上的测试线程数以模拟并发性。对于每个测试线程，本发明建立一个单独的数据库连接。实现了支持测试线程调用事务的两种不同的执行模型：无等待循环和固定吞吐量。在无等待循环设置的情况下，所有测试线程不停地发出事务，请求之间没有任何思考时间。在固定吞吐量设置中，用户可以指定固定请求吞吐量或吞吐量比例因子。如果指定了吞吐量比例因子，则将每个时间窗口中的吞吐量与吞吐量比例因子的乘积作为该窗口的目标吞吐量。测试线程通过控制事务请求之间的思考时间来达到所需的吞吐量。当所需的吞吐量超过当前测试线程所能达到的最大吞吐量时，执行模型将退化为无等待循环。不同的执行模型使本发明能够构建具有可拓展性的合成负载。threading model. When deploying the load generator, the user can configure the number of test nodes and the number of test threads on each node to simulate concurrency. For each test thread, the present invention establishes a separate database connection. Two different execution models are implemented to support test thread invocation transactions: no wait loop and fixed throughput. In the no-wait-loop setup, all test threads kept issuing transactions without any think time between requests. In the fixed throughput setting, the user can specify a fixed request throughput or throughput scaling factor. If a throughput scale factor is specified, the product of the throughput in each time window and the throughput scale factor is the target throughput for that window. The test thread achieves the desired throughput by controlling the think time between transaction requests. When the required throughput exceeds the maximum throughput that can be achieved by the current test thread, the execution model degenerates into a no-wait loop. Different execution models enable the present invention to build scalable synthetic payloads.

事务执行。测试线程根据从负载轨迹中提取的事务比例调用不同类型的事务。事务比例随时间窗口周期性调整。在执行事务时，将使用事务逻辑的结构信息来确定是否需要执行分支结构中的操作，以及循环结构中执行操作的次数。对于一个SQL操作的执行，本发明首先将所有参数逐个实例化，然后把具有具体参数值的操作发送到测试数据库。在执行操作之后，结果集和参数被保存为中间状态，以便在同一事务实例内的后续操作中生成其它参数。Transaction execution. The test thread invokes different types of transactions according to the proportion of transactions extracted from the load trace. The transaction ratio is adjusted periodically over time windows. When executing a transaction, the structure information of the transaction logic is used to determine whether an operation in a branch structure needs to be performed, and how many times to perform an operation in a loop structure. For the execution of an SQL operation, the present invention first instantiates all parameters one by one, and then sends the operation with specific parameter values to the test database. After an operation is performed, the result set and parameters are saved as an intermediate state so that additional parameters can be generated in subsequent operations within the same transaction instance.

参数实例化。在实例化参数时，首先要保证事务逻辑的一致性，然后保证合成负载的数据访问分布。对于一个参数，有以下几种情况。情况1：如果只有依赖$1，则可以根据增量Δ和相关联的较小参数直接计算该参数的值。情况2：如果只有依赖$2，本发明首先尝试通过根据依赖项的概率随机选择依赖项来实例化参数。当未选择中任何依赖项时，则使用数据访问分布来实例化此参数。情况3：如果同时存在依赖关系$2和$3，则相应的操作一定在循环结构中。循环第一次执行时，如同情况2一样，本发明仍然基于依赖$2和数据访问分布实例化参数；对于非第一次的循环执行，本发明首先基于概率尝试使用依赖$3实例化参数，如果没有选中任何依赖项，则使用依赖$2和数据访问分布来实例化参数。parameter instantiation. When instantiating parameters, first ensure the consistency of the transaction logic, and then ensure the data access distribution of the synthetic load. For a parameter, there are the following situations. Case 1: If only $1 is dependent, the value of this parameter can be calculated directly from the increment Δ and the associated smaller parameter. Case 2: If there are only dependencies $2, the present invention first tries to instantiate the parameters by randomly selecting dependencies according to their probabilities. When none of the dependencies are selected, the data access distribution is used to instantiate this parameter. Case 3: If there are dependencies $2 and $3 at the same time, the corresponding operation must be in a loop structure. When the loop is executed for the first time, as in case 2, the present invention still instantiates parameters based on dependency $2 and data access distribution; for non-first loop executions, the present invention first tries to use dependency $3 instantiation parameters based on probability, if no Checking any of the dependencies uses dependency $2 and the data access distribution to instantiate the parameters.

总之，对于所有测试线程，事务执行和参数实例化是相互独立的，因此本发明的负载生成器可以部署在多个节点上，以有效地生成高并发、吞吐量的合成负载，同时满足所需的负载特性和配置。In summary, transaction execution and parameter instantiation are independent of each other for all test threads, so the load generator of the present invention can be deployed on multiple nodes to efficiently generate high concurrency, throughput synthetic load while satisfying the required load characteristics and configuration.

实施例Example

实验环境lab environment

实验硬件配置：4个物理节点，每个节点2个CPU，型号为Intel Xeon Silver4110@2.1GHz；内存为120GB；存储为4TB，RAID-5，4GB RAID缓存。物理节点之间使用万兆以太网通信。Experimental hardware configuration: 4 physical nodes, each with 2 CPUs, the model is Intel Xeon Silver4110@2.1GHz; the memory is 120GB; the storage is 4TB, RAID-5, 4GB RAID cache. 10 Gigabit Ethernet communication is used between physical nodes.

实验一：使用TPC-C负载作为模拟对象，改变扩展因子，通过比较真实负载(即TPC-C负载)与合成负载在相同数据库上的吞吐、时延、CPU利用率和磁盘利用率，检测合成负载的保真性。实验数据库为PostgreSQL。实验结果如图6a-6d。数据库请求的并发数与扩展因子相同。在图6a中，我们给出了真实负载和合成负载的事务执行吞吐量。结果表明，两种负载的吞吐量非常相似，最大偏差为6.29％。针对平均时延和95％时延的度量，如图6b所示。从这两个指标来看，合成负载与真实负载非常接近，最大偏差只有8.99％。图6c和图6d分别展示了两种负载的CPU和磁盘使用情况。结果表明，PostgreSQL数据库中执行合成负载和真实负载的资源消耗是一致的，进一步验证本发明生成的合成负载的高保真性。Experiment 1: Use the TPC-C load as the simulation object, change the expansion factor, and detect the synthetic load by comparing the throughput, latency, CPU utilization and disk utilization of the real load (ie TPC-C load) and the synthetic load on the same database Load fidelity. The experimental database is PostgreSQL. The experimental results are shown in Figure 6a-6d. The number of concurrent database requests is the same as the scaling factor. In Figure 6a, we present the transaction execution throughput for real and synthetic workloads. The results show that the throughput of the two workloads is very similar, with a maximum deviation of 6.29%. The metrics for average delay and 95% delay are shown in Figure 6b. From these two metrics, the synthetic load is very close to the real load, with a maximum deviation of only 8.99%. Figure 6c and Figure 6d show the CPU and disk usage for the two workloads, respectively. The results show that the resource consumption of executing the synthetic load and the real load in the PostgreSQL database is consistent, which further verifies the high fidelity of the synthetic load generated by the present invention.

实验二：本实验展示了数据访问分布(即S-Dist、D-Dist和C-Dist)描述数据访问的倾斜度、动态性和连续性的能力。由于现有基准测试负载的数据访问分布通常既不是动态的，也不是连续的，我们基于YCSB构建评估负载。本实验都是在MySQL数据库上进行的，其中包含一个来自YCSB的测试表。测试表的大小为10⁶，数据库请求并发数为20。Experiment 2: This experiment demonstrates the ability of data access distributions (i.e., S-Dist, D-Dist, and C-Dist) to describe the inclination, dynamics, and continuity of data access. Since the data access distribution of existing benchmark workloads is usually neither dynamic nor continuous, we build the evaluation workload based on YCSB. The experiments were all performed on a MySQL database with a test table from YCSB. The size of the test table is 10 ⁶ , and the number of concurrent database requests is 20.

图7中的评估负载只有一种事务类型。该事务由五对读写操作组成，每对操作先读取一条记录，然后更新它。扩展的YCSB负载运行90秒，分为三个阶段，每个阶段的数据请求在10³条记录内随机选择。在第一阶段，数据访问分布为参数s＝1的Zipf分布；第二阶段依旧是Zipf分布，但参数s＝1.2；第三阶段是均匀分布。图7显示了YCSB和本发明生成的负载的事务吞吐量、时延和死锁量的动态变化。从结果可以看出，当使用D-Dist时，由本发明产生的合成负载与YCSB产生的真实负载在吞吐量、时延和死锁量上是动态一致的，这表明D-Dist能够很好地描述工作负载的动态性。同时，在每个时间窗口中，D-Dist由S-Dist表示，这也表明S-Dist能够很好地刻画负载的倾斜度。但是全局的S-Dist并不是很好，它定义在整个负载时间上，不考虑负载的动态变化。The evaluation load in Figure 7 has only one transaction type. The transaction consists of five pairs of read and write operations, each of which reads a record and then updates it. The extended YCSB load runs for 90 seconds and is divided into three phases, with data requests in each phase randomly selected within ¹⁰³ records. In the first stage, the data access distribution is a Zipf distribution with parameter s=1; the second stage is still a Zipf distribution, but with a parameter s=1.2; the third stage is a uniform distribution. Figure 7 shows the dynamic changes in transaction throughput, latency, and deadlock amount for YCSB and loads generated by the present invention. It can be seen from the results that when using D-Dist, the synthetic load generated by the present invention is dynamically consistent with the real load generated by YCSB in throughput, delay and deadlock amount, which shows that D-Dist can well Describe the dynamic nature of the workload. Meanwhile, in each time window, D-Dist is represented by S-Dist, which also shows that S-Dist can well characterize the inclination of the load. But the global S-Dist is not very good, it is defined in the whole load time, regardless of the dynamic changes of the load.

图8中的评估负载是YCSB的单行更新事务，运行100秒，时间窗口为1秒。每个时间窗口中的数据请求基于10³个随机记录，同时每个时间窗口的所选记录有50％与前一个窗口的一致。MySQL的Innodb_buffer_pool_size设置为16MB。图8给出了分别由YCSB和本发明生成的负载的吞吐量和Innodb_buffer_pool_reads增量。Innodb_buffer_pool_reads是InnoDB必须直接从磁盘读取的数目。结果表明，D-Dist的磁盘访问率明显高于YCSB，且吞吐量较低。这是因为D-Dist无法捕捉数据访问分布的连续性，导致每个时间窗口中的数据请求几乎完全不同，缓存命中率很低。使用C-Dist的负载性能与YCSB是一致的，这表明C-Dist可以很好地刻画数据访问的连续性。The evaluation workload in Figure 8 is a single-row update transaction for YCSB running for 100 seconds with a time window of 1 second. Data requests in each time window are based on ¹⁰³ random records, while the selected records for each time window are 50% identical to those of the previous window. MySQL's Innodb_buffer_pool_size is set to 16MB. Figure 8 presents the throughput and Innodb_buffer_pool_reads delta for loads generated by YCSB and the present invention, respectively. Innodb_buffer_pool_reads is the number that InnoDB must read directly from disk. The results show that the disk access rate of D-Dist is significantly higher than that of YCSB, and the throughput is lower. This is because D-Dist cannot capture the continuity of data access distribution, resulting in almost completely different data requests in each time window and a low cache hit rate. The load performance of using C-Dist is consistent with that of YCSB, which indicates that C-Dist can well characterize the continuity of data access.

实验三：在事务实例数(K)和事务实例组数(N)相等的情况下，同时改变K和N，通过执行时间和内存消耗观察事务逻辑提取算法的性能。实验结果如图9。从结果可以看出，当K和N都是10⁴时，事务逻辑的提取时间仅为2.1秒，并且内存消耗为1.1GB。随着K和N的增加，执行时间和内存消耗几乎线性增加。总体上，本文提出的事务逻辑提取是高效的，可以在几秒钟内完成。Experiment 3: When the number of transaction instances (K) and the number of transaction instance groups (N) are equal, change K and N at the same time, and observe the performance of the transaction logic extraction algorithm through execution time and memory consumption. The experimental results are shown in Figure 9. As can be seen from the results, when both K and N are ¹⁰⁴ , the fetch time of the transaction logic is only 2.1 seconds, and the memory consumption is 1.1GB. As K and N increase, the execution time and memory consumption increase almost linearly. Overall, the transaction logic extraction proposed in this paper is efficient and can be completed in seconds.

实验四：改变负载轨迹的长度，通过执行时间和内存消耗观察C-Dist提取算法性能。实验结果如图10。结果表明，C-Dist的提取时间与负载轨迹呈线性关系，而内存消耗为常数。这是因为C-Dist是一个基于窗口的数据访问分布，处理完成后可以从内存中删除每个时间窗口的负载轨迹。在图10中，当TPC-C负载轨迹时间为10⁴秒(事务吞吐量为3610.3，日志量为33.8GB)，C-Dist的提取时间为678.7s，内存消耗为1.2GB。由于实际评估中的最大负载量周期一般为一天，所以本发明能够有效地支持高吞吐量负载的性能评估。Experiment 4: Change the length of the load trajectory, and observe the performance of the C-Dist extraction algorithm through execution time and memory consumption. The experimental results are shown in Figure 10. The results show that the extraction time of C-Dist is linear with the load trajectory, while the memory consumption is constant. This is because C-Dist is a window-based data access distribution, and the load trace for each time window can be removed from memory after processing is complete. In Figure 10, when the TPC-C load trajectory time is 10 ⁴ seconds (the transaction throughput is 3610.3, and the log volume is 33.8GB), the extraction time of C-Dist is 678.7s, and the memory consumption is 1.2GB. Since the maximum load period in the actual evaluation is generally one day, the present invention can effectively support the performance evaluation of high-throughput loads.

本发明的保护内容不局限于以上实施例。在不背离发明构思的精神和范围下，本领域技术人员能够想到的变化和优点都被包括在本发明中，并且以所附的权利要求书为保护范围。The protection content of the present invention is not limited to the above embodiments. Variations and advantages that can occur to those skilled in the art without departing from the spirit and scope of the inventive concept are included in the present invention, and the appended claims are the scope of protection.

Claims

1. an application-oriented transaction load generation system, characterized in that, comprising: a database generation module and a load generation module; wherein,

The database generation module is used to generate a test database by acquiring database schema and data features;

The load generation module generates a synthetic load similar to the real load by analyzing the load trajectory of the real application; the load generation module includes:

A transaction logic analyzer, which extracts transaction logic information by analyzing the full load trajectory in a short period of time;

Data access distribution analyzer, which extracts data access distribution information and throughput information by analyzing partial load trajectories over a long period of time;

A load generator, which utilizes the transaction logic information, the data access distribution information and the throughput information extracted before to instantiate the parameter values in the transaction template to generate a synthetic load; the load generator is executing During a transaction, the structure information of the transaction logic is used to determine whether the operation in the branch structure needs to be executed, and the number of times to execute the operation in the loop structure; for the execution of an SQL operation, all parameters are first instantiated one by one, and then the parameters with specific parameter values are instantiated one by one. The operation is sent to the test database; after the operation is executed, the result set and parameters are saved as an intermediate state for generating additional parameters in subsequent operations within the same transaction instance.

2 . The application-oriented transaction load generation system of claim 1 , wherein the data features are automatically obtained using SQL queries through a data feature extractor. 3 .

3. The application-oriented transaction load generation system of claim 1, wherein the load generator can configure the number of test nodes and the number of test threads on each node to simulate concurrency; for each test thread , to establish a separate database connection.

4. The application-oriented transaction load generation system according to claim 1, wherein, if the load generator only depends on $1 for one parameter, it can directly calculate according to the increment Δ and the associated smaller parameter The value of this parameter; if there are only dependencies $2, first try to instantiate the parameter by randomly selecting the dependencies according to their probabilities, when none of the dependencies are selected, use the data access distribution to instantiate this parameter; if both Depending on $2 and $3, the corresponding operation must be in the loop structure. When the loop is executed for the first time, the parameters are instantiated based on the dependency $2 and the data access distribution. For non-first loop execution, first try to use the dependency $3 based on the probability. Instantiate the parameter, if no dependencies are checked, use the dependency $2 and the data access distribution to instantiate the parameter.

5. An application-oriented transaction load generation method, comprising the following steps:

Step A: Generate a test database by acquiring the database schema and data characteristics;

Step B: Generate a synthetic load similar to the real load by analyzing the load trajectory of the real application, including the following sub-steps:

Step B1: Extract transaction logic information by analyzing the complete load trajectory in a short period of time;

Step B2: Extract data access distribution information and throughput information by analyzing partial load trajectories in a long period of time;

Step B3: Using the previously extracted transaction logic information, the data access distribution information, and the throughput information, instantiate the parameter values in the transaction template to generate a synthetic load; when the load generator executes the transaction, use the The structure information of the transaction logic to determine whether the operation in the branch structure needs to be executed, and the number of times to execute the operation in the loop structure; for the execution of an SQL operation, all parameters are first instantiated one by one, and then the operation with specific parameter values is sent to Test the database; after an operation is performed, the result set and parameters are saved as an intermediate state to generate additional parameters in subsequent operations within the same transaction instance.

6. The application-oriented transaction load generation method according to claim 5, wherein in step A, the test database is a plurality of tables satisfying primary key, foreign key constraints and non-key attribute data characteristics; specifically including The following steps:

Step A1: Generate primary keys in sequence;

Step A2: When generating a foreign key, randomly generate it within the value range of the primary key it refers to;

Step A3: Generate the value of the non-key-valued attribute through the random attribute generator including the random index generator and the index value converter, while satisfying the expected data characteristics.

7. The application-oriented transaction load generation method according to claim 6, characterized in that, before generating the key value, first determine the value domain: in the first step, if the primary key only includes a single attribute, then its The value range is [1, s], s is the size of the table; the second step, the value range of the foreign key attribute can be determined by the primary key it refers to; the third step, processing the value of the non-foreign key attribute in the composite primary key domain, there is only one non-foreign key attribute in the composite primary key, then the value domain of this attribute is

8. The application-oriented transaction load generation method of claim 6, wherein the output of the random index generator is an integer from 1 to n, where n is the cardinality of the attribute; given an index, the index value converter Deterministically map it to a value in the property's range; use different index-to-value converters depending on the data type of the property: for numeric types, use a linear function that maps indices uniformly to the property's range; for characters String type, randomly generated seed string that meets the length requirement; first select a seed string according to the input index, and then concatenate the index and the selected seed string as the output value.

9. The application-oriented transaction load generation method according to claim 5, wherein in the step B1, the transaction logic extraction algorithm comprises:

Step B11: Calculate the execution times of each operation in the transaction template by traversing the load trajectory, thereby calculating the execution possibility of each branch and the average execution times of the loop operation;

Step B12: Identify all parameter pairs <pi _,l ,pi _,j > in the transaction template that satisfy the BR, and then traverse the load trajectory to obtain the average increment Δ; build a dependency $1 for pi _,j ;

Step B13: For the parameters p _i,j in each transaction template, traverse each parameter p _m,n before it, and calculate the number of transactions that satisfy the ER parameter pair; similarly, traverse the previous return set r _{x ,y} , respectively calculate the number of transactions that satisfy ER and IR;

Step B14: randomly select N transaction instance groups from the K transaction instances, two in each group; then calculate the LR coefficients (a, b) of the parameter pairs in each group of transactions;

Step B15: Use the statistical data obtained in steps B13-B14 to construct a dependency $2 for each parameter p _i,j ;

Step B16: In the loop structure, the same operation will be run multiple times, and use dependencies to describe the changes in the parameters; by traversing the load trajectory, calculate the changes of parameter values in the continuously executed loop operations; the calculation and steps of the coefficients (a, b) Same as B14, then construct dependency $3 based on statistics.

10. The application-oriented transaction load generation method according to claim 5, wherein in step B, the method specifically comprises:

Step B21: Generate all high-frequency terms that meet the expected repetition rate;

Step B22: traverse all parameters in the previous time window, and select repetition parameters for each interval until the parameter repetition rate on the interval is satisfied;

Step B23: derive the index of the parameter according to the value of the parameter, thereby identifying the interval to which it belongs in the current time window; if the index is not in the index domain of the current time window, ignore the parameter;

Step B24: Generate random parameters added to each interval to meet the cardinality requirement; based on the candidate parameters, use the parameter generation mechanism to instantiate the parameters; in a certain interval, just randomly select a candidate parameter as the output.