CN115048464A

CN115048464A - User operation behavior data detection method and device and electronic equipment

Info

Publication number: CN115048464A
Application number: CN202110251231.XA
Authority: CN
Inventors: 顾强; 孙小娟; 屈林波; 丁乐
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2022-09-13

Abstract

The invention provides a method and a device for detecting user operation behavior data and electronic equipment, belonging to the technical field of computers, wherein the method comprises the following steps: collecting user operation behavior data; performing entity extraction on the user operation behavior data to obtain entity identification data; performing feature selection and feature dimension reduction on the entity identification data to obtain feature data subjected to dimension reduction; performing clustering analysis on the characteristic data to obtain classified data of various operation behaviors; and performing data analysis on the classified data by adopting an anomaly detection algorithm to obtain normal data of normal operation behaviors of the user and abnormal data of abnormal operation behaviors of the user. The invention can effectively detect the abnormal data of the abnormal operation behaviors of the user by performing entity extraction, feature selection, feature dimension reduction, cluster analysis and abnormal detection algorithm analysis on the data of the user operation behaviors.

Description

User operation behavior data detection method, device and electronic device

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种用户操作行为数据的检测方法、装置及电子设备。The present invention relates to the field of computer technology, and in particular, to a method, device and electronic device for detecting user operation behavior data.

背景技术Background technique

现有技术中，异常检测系统在发现网络中的违规行为方面发挥了重要作用。由于难以直接从海量数据中提取出异常流量，现有的异常检测设备所采用的方式是对所有流量数据进行随机抽样，对提取到的异常流量进行进一步的分析，但是由于网络中用户正常行为的流量数据远远多于异常流量数据，因此随机抽样的采样方式会遗漏大量的异常流量。采用现有技术中的传统机器学习、深度学习算法或随机抽样进行异常检测在实际操作的过程中主要有下列几项问题：参数设置较难、假定条件过多、数据内容限制较多等。In the prior art, anomaly detection systems play an important role in discovering irregularities in the network. Since it is difficult to directly extract abnormal traffic from massive data, the method adopted by the existing abnormality detection equipment is to randomly sample all traffic data and further analyze the extracted abnormal traffic. However, due to the normal behavior of users in the network There are far more traffic data than abnormal traffic data, so the sampling method of random sampling will miss a lot of abnormal traffic. The use of traditional machine learning, deep learning algorithms or random sampling in the prior art for anomaly detection mainly has the following problems in the actual operation process: difficult parameter setting, too many assumptions, and many restrictions on data content.

发明内容SUMMARY OF THE INVENTION

本发明提供一种用户操作行为数据的检测方法、装置及电子设备，用以解决现有技术中对用户行为进行异常检测中会遗漏大量的异常流量以及采取有关算法进行异常检测存在算法的参数设置较难、假定条件过多以及数据内容限制较多等问题，实现根据用户操作行为的情况进行实时监测及对可能的违规操作做出预测。The present invention provides a method, device and electronic device for detecting user operation behavior data, which are used to solve the problem that a large amount of abnormal traffic is missed in the abnormal detection of user behavior in the prior art, and the parameter setting of the existing algorithm for abnormal detection by adopting relevant algorithms Problems such as difficulty, too many assumptions, and more data content restrictions, realize real-time monitoring and prediction of possible illegal operations based on user operation behavior.

本发明提供一种用户操作行为数据的检测方法，包括：The present invention provides a method for detecting user operation behavior data, comprising:

采集用户操作行为数据，所述用户操作行为数据用于分析用户的操作行为是否异常；collecting user operation behavior data, where the user operation behavior data is used to analyze whether the user's operation behavior is abnormal;

对所述用户操作行为数据进行实体抽取，得到实体识别数据，所述实体识别数据用于提取与用户异常操作行为有关的数据；Perform entity extraction on the user operation behavior data to obtain entity identification data, and the entity identification data is used to extract data related to the abnormal operation behavior of the user;

对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，所述特征数据为通过特征选择和特征降维实现特征抽取和数据压缩的数据；Perform feature selection and feature dimensionality reduction on the entity recognition data to obtain dimensionality-reduced feature data, where the feature data is data obtained by feature selection and feature dimensionality reduction to achieve feature extraction and data compression;

对所述特征数据进行聚类分析，得到各种操作行为的归类数据，所述归类数据用于将用户的各种操作行为进行归类；Perform cluster analysis on the feature data to obtain classification data of various operation behaviors, and the classification data is used to classify various operation behaviors of the user;

采用异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户异常操作行为的异常数据。An anomaly detection algorithm is used to perform data analysis on the classified data to obtain normal data of the user's normal operation behavior and abnormal data of the user's abnormal operation behavior.

根据本发明提供的一种用户操作行为数据的检测方法，所述采集用户操作行为数据，包括：According to a method for detecting user operation behavior data provided by the present invention, the collection of user operation behavior data includes:

基于第一数据库采集用户操作行为数据，所述第一数据库中存储有关系型数据和记录用户各种操作行为的日志数据；Collect user operation behavior data based on a first database, where relational data and log data recording various user operation behaviors are stored in the first database;

所述用户操作行为数据包括用户各种操作开始/结束时间、操作具体步骤、操作顺序、操作最终结果的一种或多种组合的数据。The user operation behavior data includes data of one or more combinations of the start/end time of various operations of the user, the specific steps of the operation, the sequence of operations, and the final result of the operation.

根据本发明提供的一种用户操作行为数据的检测方法，所述对所述用户操作行为数据进行实体抽取，得到实体识别数据，包括：According to a method for detecting user operation behavior data provided by the present invention, performing entity extraction on the user operation behavior data to obtain entity identification data, including:

对所述用户操作行为数据的部分数据进行标注以作为训练数据，并利用神经网络训练实体抽取模型；Marking part of the data of the user operation behavior data as training data, and using a neural network to train an entity extraction model;

基于所述实体抽取模型，对所述用户操作行为数据进行实体抽取，得到实体识别数据；其中，Based on the entity extraction model, entity extraction is performed on the user operation behavior data to obtain entity identification data; wherein,

所述实体抽取模型的第一层为词嵌入层，用于将输入的单词序列训练成词向量输出；The first layer of the entity extraction model is a word embedding layer, which is used to train the input word sequence into a word vector output;

所述实体抽取模型的第二层，用于将第一层输出的词向量输入至BiLSTM层进行训练以学习单词与输出标签的关系，所述BiLSTM层包括正向LSTM网络和反向LSTM网络，正向LSTM网络和反向LSTM网络通过一输出层进行连接；The second layer of the entity extraction model is used to input the word vector output from the first layer to the BiLSTM layer for training to learn the relationship between words and output labels, and the BiLSTM layer includes a forward LSTM network and a reverse LSTM network, The forward LSTM network and the reverse LSTM network are connected through an output layer;

所述实体抽取模型的第三层是在BiLSTM层的输出序列上设有注意力模型，用于处理标签问题以使所述实体抽取模型更好聚焦局部特征并突出关键词的重要作用；The third layer of the entity extraction model is provided with an attention model on the output sequence of the BiLSTM layer, which is used to deal with the label problem so that the entity extraction model can better focus on local features and highlight the important role of keywords;

所述实体抽取模型的第四层为所述注意力机制后所使用的CRF层，用于通过转移矩阵输出标签之间的转移得分，并基于每个标签的转换规律以及标签语法的合理性，得到最佳标签序列。The fourth layer of the entity extraction model is the CRF layer used after the attention mechanism, which is used to output the transition score between labels through the transition matrix, and based on the conversion rule of each label and the rationality of the label syntax, Get the best tag sequence.

根据本发明提供的一种用户操作行为数据的检测方法，所述对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，包括：According to a method for detecting user operation behavior data provided by the present invention, the feature selection and feature dimension reduction are performed on the entity identification data to obtain dimension-reduced feature data, including:

将所述实体识别数据和第二数据库中存储的数据进行汇总，所述第二数据库中存储有办理用户业务的数据；summarizing the entity identification data and data stored in a second database, where the second database stores data for handling user services;

对数据中出现的异常值/重复值进行处理；Handling outliers/duplicates in the data;

对处理后的数据进行特征选择，并存储经过选择过滤的特征选择数据；Perform feature selection on the processed data, and store the selected filtered feature selection data;

基于所述特征选择数据计算表征数据相关性的协方差矩阵，并对其进行特征分解，得到特征值和特征向量集合；Calculate a covariance matrix representing data correlation based on the feature selection data, and perform eigendecomposition on it to obtain a set of eigenvalues and eigenvectors;

将所述特征值和特征向量集合投影至特征矩阵，得到降维后的特征数据，并将所述特征数据进行存储。Projecting the set of eigenvalues and eigenvectors to a feature matrix to obtain dimensionality-reduced feature data, and storing the feature data.

根据本发明提供的一种用户操作行为数据的检测方法，所述对所述特征数据进行聚类分析，得到各种操作行为的归类信息，包括：According to a method for detecting user operation behavior data provided by the present invention, the cluster analysis is performed on the feature data to obtain classification information of various operation behaviors, including:

基于K-means密度聚类算法，将所述特征数据的集合按照特征相似度分成属于不同簇类对象，包括将特征相似的数据分布于同一簇中，将特征不相似的数据分布在簇外；Based on the K-means density clustering algorithm, the set of feature data is divided into objects belonging to different clusters according to the feature similarity, including distributing the data with similar features in the same cluster, and distributing the data with dissimilar features outside the cluster;

基于所述特征数据分布的密度进行数据分析，得到各种操作行为的归类数据；Perform data analysis based on the density of the characteristic data distribution to obtain classification data of various operational behaviors;

所述K-means密度聚类算法是通过在聚类之前预先设定阈值，基于所述特征数据的密度、簇内平均距离和簇间距离计算出权重，采用加权的欧氏距离计算出所述特征数据的距离，并通过计算得到的所述特征数据的密度、权值和距离来选择初始聚类中心，得到所述K-means密度聚类算法的初始输入参数。The K-means density clustering algorithm is to pre-set a threshold before clustering, calculate the weight based on the density of the feature data, the average distance within the cluster and the distance between the clusters, and use the weighted Euclidean distance to calculate the weight. The distance of the characteristic data, and the initial cluster center is selected by calculating the density, weight and distance of the characteristic data, and the initial input parameters of the K-means density clustering algorithm are obtained.

根据本发明提供的一种用户操作行为数据的检测方法，所述基于异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据，包括：According to a method for detecting user operation behavior data provided by the present invention, the data analysis is performed on the classified data based on an anomaly detection algorithm to obtain normal data of the user's normal operation behavior and abnormal data of the user's illegal operation behavior, including:

采用孤立森林、One Class SVM以及局部异常因子三种异常检测算法分别对所述归类数据进行异常打分，得到对应的异常打分值；Using three anomaly detection algorithms, isolated forest, One Class SVM and local anomaly factor, respectively, to score anomaly on the classified data, and obtain the corresponding anomaly score value;

将所述三种异常检测算法输出的异常打分值进行加权归一，得到针对所有用户的异常打分值的排名；Weighting and normalizing the anomaly scoring values output by the three anomaly detection algorithms to obtain the ranking of the abnormal scoring values for all users;

根据所述异常打分值的排名，确定用户正常操作行为的正常数据与用户违规操作行为的异常数据。According to the ranking of the abnormal score values, the normal data of the user's normal operation behavior and the abnormal data of the user's illegal operation behavior are determined.

根据本发明提供的一种用户操作行为数据的检测方法，所述基于异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据之后，还包括：According to a method for detecting user operation behavior data provided by the present invention, the data analysis is performed on the classified data based on an anomaly detection algorithm, and after obtaining the normal data of the user's normal operation behavior and the abnormal data of the user's illegal operation behavior, further include:

若确定为用户违规操作行为的异常数据，则以邮件、短信方式告知系统管理员及相关的技术人员，以及对部分异常数据启动灾备机制以解决异常的问题。If it is determined to be abnormal data of the user's illegal operation, the system administrator and related technical personnel will be notified by email and text message, and a disaster recovery mechanism will be activated for some abnormal data to solve the abnormal problem.

本发明还提供一种用户操作行为数据的检测装置，包括：The present invention also provides a detection device for user operation behavior data, comprising:

数据采集模块，用于采集用户操作行为数据，所述用户操作行为数据为描述用户各种操作行为的数据；a data collection module for collecting user operation behavior data, where the user operation behavior data is data describing various user operation behaviors;

实体抽取模块，用于对所述用户操作行为数据进行实体抽取，得到实体识别数据，所述实体识别数据为从所述用户操作行为数据中提取和异常数据有关的数据；an entity extraction module, configured to perform entity extraction on the user operation behavior data to obtain entity identification data, where the entity identification data is data related to abnormal data extracted from the user operation behavior data;

特征选择模块，用于对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，所述特征数据为通过特征选择和特征降维来实现特征抽取和数据压缩的数据；A feature selection module, configured to perform feature selection and feature dimensionality reduction on the entity identification data, to obtain feature data after dimensionality reduction, and the feature data is data obtained by feature selection and feature dimensionality reduction to realize feature extraction and data compression;

聚类分析模块，用于对所述特征数据进行聚类分析，得到各种操作行为的归类数据，所述归类数据用于将用户的各种操作行为进行归类；a cluster analysis module, configured to perform cluster analysis on the feature data to obtain classification data of various operation behaviors, and the classification data is used to classify various operation behaviors of the user;

异常检测模块，用于采用异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据。The anomaly detection module is used for performing data analysis on the classified data by using an anomaly detection algorithm to obtain normal data of the user's normal operation behavior and abnormal data of the user's illegal operation behavior.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述用户操作行为数据的检测方法的步骤。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, when the processor executes the program, the processor implements any of the above-mentioned user operation behaviors The steps of the data detection method.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述用户操作行为数据的检测方法的步骤。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-mentioned methods for detecting user operation behavior data.

本发明提供的用户操作行为数据的检测方法、装置及电子设备，通过对用户操作行为数据进行实体抽取、特征选择、特征降维、聚类分析以及异常检测算法分析，能够有效地检测出用户异常操作行为的异常数据。The user operation behavior data detection method, device and electronic device provided by the present invention can effectively detect user anomalies by performing entity extraction, feature selection, feature dimension reduction, cluster analysis and anomaly detection algorithm analysis on the user operation behavior data. Exception data for operational behavior.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are the For some embodiments of the invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明提供的用户操作行为数据的检测方法的流程示意图；1 is a schematic flowchart of a method for detecting user operation behavior data provided by the present invention;

图2是本发明提供的实体抽取步骤的流程示意图；2 is a schematic flowchart of an entity extraction step provided by the present invention;

图3是本发明提供的实体抽取模型的结构示意图；Fig. 3 is the structural representation of the entity extraction model provided by the present invention;

图4是本发明提供的特征处理步骤的流程示意图；4 is a schematic flowchart of a feature processing step provided by the present invention;

图5是本发明提供的聚类分析步骤的流程示意图；5 is a schematic flowchart of a cluster analysis step provided by the present invention;

图6是本发明提供的异常打分步骤的流程示意图；6 is a schematic flowchart of an abnormal scoring step provided by the present invention;

图7是本发明提供的用户操作行为数据的检测装置的结构示意图；7 is a schematic structural diagram of a device for detecting user operation behavior data provided by the present invention;

图8是本发明提供的电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。The terms "first", "second" and the like in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein.

随着大数据时代的到来，互联网的用户数量急剧增加，网络中的数据量呈现海量增长的趋势，网络系统安全中的作弊违规问题呈现出逐年上升的趋势.网络安全事件愈发严峻，作弊违规行为大幅上升，发现网络中可能存在的威胁变得愈发重要。通过对用户操作进行研究与分析，能够尽早发现违规操作的用户，从而保障网络安全与系统正常运行。因此，通过异常检测发现用户行为中存在的违规操作成为当前亟需解决的问题。With the advent of the era of big data, the number of Internet users has increased sharply, the amount of data in the network has shown a trend of massive growth, and the problem of cheating and violations in network system security has shown an increasing trend year by year. Behavior has risen sharply, and it has become increasingly important to spot possible threats in the network. By researching and analyzing user operations, users who violate the regulations can be discovered as soon as possible, thereby ensuring network security and normal system operation. Therefore, finding illegal operations in user behavior through anomaly detection has become an urgent problem to be solved.

由于网络预防机制仍需完善，因此进行安全监测、发现作弊行为变得愈发重要。异常检测是通过采集用户操作时的信息，并对采集信息进行分析，从而检测是否存在违规行为。现有技术中，常用的异常检测方法主要有机器学习、深度学习等，例如决策树(DecisionTree)、随机森林(Random Forest)、支持向量机(SVM)、AdaBoost、GBDT(gradient boostingdecision tree)、神经网络等方法。Because the network prevention mechanism still needs to be improved, it is more and more important to conduct security monitoring and detect cheating behavior. Anomaly detection is to detect whether there is a violation by collecting information during user operations and analyzing the collected information. In the prior art, commonly used anomaly detection methods mainly include machine learning, deep learning, etc., such as decision tree (Decision Tree), random forest (Random Forest), support vector machine (SVM), AdaBoost, GBDT (gradient boosting decision tree), neural network. network, etc.

现有技术中，使用传统机器学习、深度学习算法或随机抽样进行异常检测在实际操作的过程中主要有下列几项问题：In the prior art, the use of traditional machine learning, deep learning algorithms or random sampling for anomaly detection mainly has the following problems in the actual operation process:

第一，参数设置较难。First, parameter setting is difficult.

传统的异常检测算法在寻找最优参数时的难度较大，特别是基于邻近度的方法，这些算法通过离群强度概念量化异常程度，消耗时间和复杂度随维数增加，参数搜索较难，在建模过程中需要耗费大量的时间来确定模型相关参数。Traditional anomaly detection algorithms are more difficult to find optimal parameters, especially proximity-based methods. These algorithms quantify the degree of anomaly through the concept of outlier strength. The time consumption and complexity increase with the dimension, and parameter search is difficult. In the modeling process, it takes a lot of time to determine the model-related parameters.

第二，特征工程可能不够准确，个别算法假定条件过多。Second, feature engineering may not be accurate enough, and individual algorithms assume too many conditions.

到目前为止，已存在许多利用用户操作日志进行异常行为分析的方法。但是在特征工程方面，没有一个系统详细的描述，并且有关类别的统计特征不能用于单层分类模型中，这使得检测效率受到限制。传统的分类算法有逻辑回归算法(LR)，支持向量机算法(SVM)，朴素贝叶斯算法(NB)，K近邻算法(KNN)等。对于逻辑回归模型而言，当特征空间较大时，模型的表现效果不是很好，容易出现过拟合现象。当观测变量较多时，支持向量机的分类效率不是很高，并且很难找到一个适合的核函数。对于朴素贝叶斯模型而言，该模型对输入数据的表达形式比较敏感，并且需要计算先验概率，K近邻模型的时间和空间复杂度都比较高，需要花费较长的运行时间，效率低下。除此之外，这些算法不能同时满足低方差和低偏差。例如，朴素贝叶斯是高偏差、低方差的分类器，相反地，K近邻模型是低偏差、高方差的分类器。所以，基于这些传统机器学习算法的异常行为异常检测系统普遍存在无法实现在检测率和误报率之间达到平衡的特点。So far, there are many methods for abnormal behavior analysis using user operation logs. However, in terms of feature engineering, there is no systematic detailed description, and the statistical features of the relevant categories cannot be used in the single-layer classification model, which makes the detection efficiency limited. Traditional classification algorithms include logistic regression algorithm (LR), support vector machine algorithm (SVM), naive Bayes algorithm (NB), K-nearest neighbor algorithm (KNN) and so on. For logistic regression models, when the feature space is large, the performance of the model is not very good, and it is prone to overfitting. When there are many observation variables, the classification efficiency of SVM is not very high, and it is difficult to find a suitable kernel function. For the naive Bayesian model, the model is sensitive to the expression form of the input data and needs to calculate the prior probability. The time and space complexity of the K-nearest neighbor model is relatively high, it takes a long time to run, and the efficiency is low. . Besides, these algorithms cannot satisfy both low variance and low bias. For example, Naive Bayes is a high-bias, low-variance classifier, and conversely, the K-Nearest Neighbors model is a low-bias, high-variance classifier. Therefore, abnormal behavior anomaly detection systems based on these traditional machine learning algorithms generally have the characteristics that they cannot achieve a balance between the detection rate and the false alarm rate.

第三，人工维护方式的成本投入大。Third, the cost of manual maintenance is large.

用户操作是否合规有时也需要由这一方面富有经验的专业人员来做出判断。在人工运维时，人工的成本较高，系统越复杂需要投入的人力越多，成本自然会更高，且人工运维无法做到24小时不间断进行异常监测工作。Whether the user's operation is compliant or not sometimes needs to be judged by professionals with experience in this field. In manual operation and maintenance, the cost of labor is high. The more complex the system, the more manpower needs to be invested, and the cost will naturally be higher, and manual operation and maintenance cannot achieve 24-hour uninterrupted abnormal monitoring work.

第四，数据内容限制较多。Fourth, there are many restrictions on data content.

传统的异常检测算法在训练时所需数据项为已经统计好的数字型数据，但在某些系统中所存储的用户日志文件可能多为非结构化的文本数据，这些数据中包含着大量重要信息，如果不加以提取将会对结果产生较大影响，但传统异常检测方法缺少这种信息提取的步骤，无法对这些文本数据进行处理。The data items required by traditional anomaly detection algorithms during training are already counted numerical data, but in some systems, the user log files stored may be mostly unstructured text data, which contain a large number of important data. Information, if not extracted, will have a greater impact on the results, but traditional anomaly detection methods lack such information extraction steps and cannot process these text data.

因此，基于上述现有技术存在的问题，本发明提供了一种用户操作行为数据的检测方法、装置及电子设备，通过对用户前台操作的各种行为数据进行分析及结合数据挖掘的技术对用户行为的情况进行实时监测及对可能的违规操作做出预测，，能够有效地检测出用户异常操作行为的异常数据。Therefore, based on the problems existing in the above-mentioned prior art, the present invention provides a method, device and electronic device for detecting user operation behavior data. Real-time monitoring of behaviors and prediction of possible illegal operations can effectively detect abnormal data of abnormal user behaviors.

以下对本发明涉及的技术术语进行描述：The technical terms involved in the present invention are described below:

(1)信息抽取(1) Information extraction

信息抽取(Information Extraction，IE)是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。Information extraction (Information Extraction, IE) is to structure the information contained in the text and turn it into a table-like organizational form. The input information extraction system is the original text, and the output is the information point in a fixed format. Information points are extracted from various documents and then integrated together in a unified form. This is the main task of information extraction. The benefit of having information integrated in a unified form is that it is easy to inspect and compare. Information extraction techniques do not attempt to fully understand the entire document, but only analyze the portion of the document that contains relevant information. As for what information is relevant, that will be determined by the scope of the domain that the system was designed for.

信息抽取任务主要包括实体抽取、关系抽取等。实体抽取，又称作命名实体识别(Named Entity Recognition，简称NER)，是指从非结构化文本中识别出具有特定意义的实体命名性指称项，并注明其类别(例如人名、地名、机构组织名、金额数目等)。具体细分类别的话，实体识别的任务就是识别出待处理文本中三大类(实体类、时间类和数字类)、七小类(人名、机构名、地名、时间、日期、货币和百分比)命名实体。Information extraction tasks mainly include entity extraction, relation extraction and so on. Entity extraction, also known as Named Entity Recognition (NER), refers to identifying entity naming referents with specific meanings from unstructured text, and indicating their categories (such as person names, place names, institutions, etc.) organization name, amount, etc.). For specific sub-categories, the task of entity recognition is to identify three categories (entity category, time category and number category) and seven subcategories (person name, institution name, place name, time, date, currency and percentage) in the text to be processed. named entity.

实体识别通常需要完成两方面的工作，具体为识别实体词边界及识别实体词类别。中英文在识别任务中侧重点又有所不同，英语中的实体信息的特征较为明显，通常为单词首字母大写，因此原文的NER任务难度相对简单，侧重点更多关注识别实体词类别。但中文的实体识别任务难度更大，不光要侧重实体类别，还需要寻找实体边界。Entity recognition usually needs to complete two aspects of work, specifically identifying entity word boundaries and identifying entity word categories. Chinese and English have different emphasis in recognition tasks. The characteristics of entity information in English are more obvious, usually the first letter of the word is capitalized. Therefore, the difficulty of the NER task in the original text is relatively simple, and the focus is more on identifying entity word categories. However, the task of entity recognition in Chinese is more difficult, not only to focus on entity categories, but also to find entity boundaries.

(2)聚类分析(2) Cluster analysis

聚类分析指将物理或抽象对象的集合分组为由类似的对象组成的多个类的分析过程。它是一种重要的人类行为。Cluster analysis refers to the analytical process of grouping a collection of physical or abstract objects into classes of similar objects. It is an important human behavior.

聚类分析的目标就是在相似的基础上收集数据来分类。聚类源于很多领域，包括数学，计算机科学，统计学，生物学和经济学。在不同的应用领域，很多聚类技术都得到发展，这些技术方法被用作描述数据和衡量不同数据源间的相似性，以及把数据源分类到不同的簇中。The goal of cluster analysis is to collect data to classify on the basis of similarity. Clustering has its origins in many fields, including mathematics, computer science, statistics, biology, and economics. In different application fields, many clustering techniques have been developed. These techniques are used to describe data and measure the similarity between different data sources, as well as classify data sources into different clusters.

(3)异常检测(3) Anomaly detection

在数据挖掘中，异常检测对不匹配预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。In data mining, anomaly detection is the identification of items, events, or observations that do not match expected patterns or other items in a dataset. Often anomalous items turn into bank frauds, structural flaws, medical problems, text errors, and more. Anomalies are also known as outliers, novelties, noise, biases, and exceptions.

有三大类异常检测方法。在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器(与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性)。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。There are three broad categories of anomaly detection methods. Unsupervised anomaly detection methods can detect anomalies in unlabeled test data by finding the instances that are most mismatched with other data, assuming that most instances in the dataset are normal. Supervised anomaly detection methods require a dataset that has been labeled "normal" and "abnormal" and involve training a classifier (the key difference from many other statistical classification problems is the inherent imbalance of anomaly detection). Semi-supervised anomaly detection methods create a model representing normal behavior given a normal training dataset, and then detect the likelihood of test instances generated by the learned model.

下面结合图1-图8描述本发明所述一种用户操作行为数据的检测方法、装置及电子设备。The following describes a method, device and electronic device for detecting user operation behavior data according to the present invention with reference to FIGS. 1 to 8 .

图1是本发明提供的用户操作行为数据的检测方法的流程示意图，如图所示。一种用户操作行为数据的检测方法，包括：FIG. 1 is a schematic flowchart of a method for detecting user operation behavior data provided by the present invention, as shown in the figure. A detection method for user operation behavior data, comprising:

步骤101，采集用户操作行为数据，所述用户操作行为数据用于分析用户的操作行为是否异常。Step 101: Collect user operation behavior data, where the user operation behavior data is used to analyze whether the user's operation behavior is abnormal.

可选的，可基于第一数据库(系统数据库)采集包括前台系统层面的全部用户操作行为数据，所述第一数据库中存储有关系型数据和记录用户各种操作行为的日志数据。Optionally, all user operation behavior data including the front-end system level can be collected based on a first database (system database), where relational data and log data recording various user operation behaviors are stored in the first database.

所述用户操作行为数据包括但不限于用户各种操作开始/结束时间、操作具体步骤、操作顺序、操作最终结果的一种或多种组合的数据。The user operation behavior data includes, but is not limited to, data of one or more combinations of the start/end time of various operations of the user, the specific steps of the operation, the sequence of operations, and the final result of the operation.

步骤102，对所述用户操作行为数据进行实体抽取，得到实体识别数据，所述实体识别数据用于提取与用户异常操作行为有关的数据。Step 102: Perform entity extraction on the user operation behavior data to obtain entity identification data, where the entity identification data is used to extract data related to the user's abnormal operation behavior.

可选的，可通过本发明改进LSTM-CRF的实体抽取算法对采集到的用户操作行为数据进行实体抽取，可从大量非结构化文本数据中提取到与异常操作行为有关的数据，比如用户操作行为名称等。Optionally, the entity extraction algorithm of the improved LSTM-CRF of the present invention can be used to perform entity extraction on the collected user operation behavior data, and data related to abnormal operation behaviors can be extracted from a large amount of unstructured text data, such as user operation. behavior name, etc.

由于系统内用户前台操作日志数据较多、历史数据量较大、少量的异常数据存在等特殊情况，因此需要对日志数据内的操作行为做一定的实体抽取工作，本发明利用自然语言处理的实体识别技术并结合深度学习算法提取实体信息，在抽取时增加了BiLSTM双向循环神经网络与注意力机制，脱离传统的人工标注日志的方法，节约大量人力成本，且准确率较高。Due to special circumstances such as a large amount of user front-end operation log data, a large amount of historical data, and the existence of a small amount of abnormal data in the system, it is necessary to do certain entity extraction work for the operation behavior in the log data. Recognition technology is combined with deep learning algorithm to extract entity information. BiLSTM bidirectional recurrent neural network and attention mechanism are added during extraction, which is separated from the traditional manual labeling method of logs, saves a lot of labor costs, and has a high accuracy rate.

步骤103，对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，所述特征数据为通过特征选择和特征降维实现特征抽取和数据压缩的数据。Step 103: Perform feature selection and feature dimension reduction on the entity identification data to obtain dimension-reduced feature data, where the feature data is data obtained by feature selection and feature dimension reduction to achieve feature extraction and data compression.

可选的，针对所述实体识别数据存在的维度过高问题，可使用基于PCA(principalcomponents analysis，主成分分析)的特征降维处理，可降低预测模型的复杂程度，降低那些对模型重要程度较低的特征权重，剔除缺失数据，提高后续建模的准确性。Optionally, for the problem of excessively high dimensionality in the entity recognition data, feature dimensionality reduction processing based on PCA (principal components analysis, principal component analysis) can be used, which can reduce the complexity of the prediction model and reduce those that are more important to the model. Low feature weights eliminate missing data and improve the accuracy of subsequent modeling.

系统中收集到的数据特征众多，可能存在“维数灾难”的问题。“维数灾难”造成关键的因素和数据被淹没，无法被挖掘，进而造成预测精度陷入瓶颈，难以继续提高，且高维度的、巨量的数据造成预测模型越来越复杂，计算速度随之下降。本发明基于以上问题，采用基于主成分分析(PCA)的方法对高维度的数据进行降维处理，提高预测精度，降低预测模型的复杂程度，实现特征抽取和数据压缩。The data collected in the system has many characteristics, and there may be a problem of "curse of dimensionality". The "dimension disaster" causes key factors and data to be submerged and cannot be mined, which in turn causes the prediction accuracy to fall into a bottleneck, making it difficult to continue to improve, and the high-dimensional and huge amount of data causes the prediction model to become more and more complex, and the calculation speed increases accordingly. decline. Based on the above problems, the present invention adopts the method based on Principal Component Analysis (PCA) to perform dimension reduction processing on high-dimensional data, improves the prediction accuracy, reduces the complexity of the prediction model, and realizes feature extraction and data compression.

步骤104，对所述特征数据进行聚类分析，得到各种操作行为的归类数据，所述归类数据用于将用户的各种操作行为进行归类。Step 104: Perform cluster analysis on the feature data to obtain classification data of various operation behaviors, where the classification data is used to classify various operation behaviors of the user.

可选的，可使用K-means(K-均值聚类)密度聚类算法将用户操作数据集合分成属于不同簇类对象，使得分布在同一簇中的操作行为特征高度相似，而不同簇的对象之间特征差距较大，直到把所有的点都聚合完毕。通过运用聚类的分析技术方法，不仅能够实现对操作数据的稀疏和稠密区域实时快速的划定与识别，而且还能够达到对其所存在的独立簇、独立点等被及时地发现，从而挖掘与分析出隐藏在各个数据背后内在的相关数理关系。Optionally, the K-means (K-means clustering) density clustering algorithm can be used to divide the user operation data set into objects belonging to different clusters, so that the operation behavior characteristics distributed in the same cluster are highly similar, while the objects in different clusters are highly similar. There is a large gap between the features until all the points are aggregated. By using clustering analysis techniques, not only can real-time and rapid delineation and identification of sparse and dense areas of operational data, but also the existence of independent clusters, independent points, etc. can be found in time, so as to mine And analyze the relevant mathematical relationship hidden behind each data.

本发明是在上述步骤103的特征降维后的数据基础上基于改进后的K-Means算法来对用户操作数据做聚类分析，改进算法同时考虑了样本密度、簇内平均距离和簇间距离，寻找和发现动态性数据内在的各种具有相关性的数理关系规律，挖掘出动态性操作数据的情报价值，从而为用户操作合规提供预测和决策服务。引入基于聚类算法的操作数据研究，不仅能够降低人工随机抽样用户的成本，而且还能够进一步提升用户数据挖掘的效能和优化异常分析的准确性，从而改变传统异常分析的分散性和局部性，必将成为用户行为分析内在发展的必然趋势。The present invention performs cluster analysis on the user operation data based on the improved K-Means algorithm on the basis of the data after feature dimension reduction in the above step 103, and the improved algorithm also considers the sample density, the average distance within the cluster and the distance between the clusters , to find and discover various related mathematical relationship laws inherent in dynamic data, and mine the intelligence value of dynamic operational data, so as to provide prediction and decision-making services for user operation compliance. The introduction of operational data research based on clustering algorithm can not only reduce the cost of manual random sampling of users, but also further improve the efficiency of user data mining and optimize the accuracy of anomaly analysis, thereby changing the dispersion and locality of traditional anomaly analysis. It will become an inevitable trend of the internal development of user behavior analysis.

步骤105，采用异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户异常操作行为的异常数据。Step 105 , using an abnormality detection algorithm to perform data analysis on the classified data, to obtain normal data of the user's normal operation behavior and abnormal data of the user's abnormal operation behavior.

可选的，可基于加权融合三种异常检测算法(孤立森林、One Class SVM以及局部异常因子)的得分值来对用户操作行为进行预测计算，将上一步骤104中的聚类结果进行更深入的分析，全面识别和评价最可能影响系统的各种异常用户操作。Optionally, the user operation behavior can be predicted and calculated based on the score values of the weighted fusion of three anomaly detection algorithms (isolated forest, One Class SVM and local anomaly factor), and the clustering results in the previous step 104 are updated. In-depth analysis, comprehensive identification and evaluation of various abnormal user actions that are most likely to affect the system.

异常检测分析的主要任务是在正常的用户数据集中提取出小概率的异常数据点，本发明采用孤立森林、One-Class SVM以及局部异常因子这三种种算法的集成来全面识别和评价最可能影响系统的各种异常用户操作，利用这三种算法以加权融合的方式完成异常检测，可以分别得到所有操作行为的异常打分，不再仅依赖某一种异常检测算法来做预测判断，这样能够极大提升预测准确率与效率。The main task of anomaly detection and analysis is to extract a small probability of abnormal data points in the normal user data set. The present invention adopts the integration of the three algorithms of isolated forest, One-Class SVM and local abnormal factor to comprehensively identify and evaluate the most likely impact. For various abnormal user operations of the system, these three algorithms are used to complete the abnormality detection in a weighted fusion method, and the abnormal scores of all operation behaviors can be obtained separately, instead of relying only on a certain abnormality detection algorithm to make predictions and judgments, which can extremely The prediction accuracy and efficiency are greatly improved.

综上所述，为了能够较为准确的对用户操作行为进行合规性分析及对未来可能会发生的问题进行预测告警，本发明通过对用户前台操作的各项数据进行收集，并结合实体识别技术、特征选择、特征降维、文本聚类分析及异常检测算法来对所有需要关注的用户操作行为数据进行建模，从而能够有效地检测出用户异常操作行为的异常数据。To sum up, in order to more accurately analyze the compliance of the user's operation behavior and predict and warn the problems that may occur in the future, the present invention collects various data of the user's foreground operation and combines the entity recognition technology. , feature selection, feature dimensionality reduction, text clustering analysis and anomaly detection algorithms to model all user operation behavior data that needs attention, so as to effectively detect abnormal data of user abnormal operation behavior.

以下将通过具体实施例对上述步骤101～105进行描述。The above steps 101 to 105 will be described below through specific embodiments.

图2是本发明提供的实体抽取步骤的流程示意图，图3是本发明提供的实体抽取模型的结构示意图，如图2、图3所示。上述所述步骤102中，所述对所述用户操作行为数据进行实体抽取，得到实体识别数据，包括：FIG. 2 is a schematic flowchart of an entity extraction step provided by the present invention, and FIG. 3 is a schematic structural diagram of an entity extraction model provided by the present invention, as shown in FIGS. 2 and 3 . In the above-mentioned step 102, the entity extraction is performed on the user operation behavior data to obtain entity identification data, including:

步骤201，对所述用户操作行为数据的部分数据进行标注以作为训练数据，并利用神经网络训练实体抽取模型。Step 201: Mark part of the user operation behavior data as training data, and use a neural network to train an entity extraction model.

由于系统数据库中除了保存有关系型数据外，还保留有大量日志文件，这些日志文件中记录了用户的各种操作信息，因此有必要从这些日志文件提取到相关的操作行为实体信息。但如果通过人工筛选或标注的方式来提取数据会耗费大量人力成本，且准确率也无法保证。Since the system database not only saves relational data, but also retains a large number of log files, which record various operation information of users, so it is necessary to extract relevant operation behavior entity information from these log files. However, if the data is extracted by manual screening or labeling, it will consume a lot of labor costs, and the accuracy cannot be guaranteed.

因此，本发明使用自然语言处理技术中的实体识别方法并结合深度学习算法来对系统数据库中存储的日志文件提取实体信息。具体方式是首先标注部分训练数据并利用神经网络训练实体抽取模型，使神经网络学习到日志文件中的句法、词法特征，最终使用该模型对更多数据做出预测。Therefore, the present invention extracts entity information from the log files stored in the system database by using the entity recognition method in the natural language processing technology combined with the deep learning algorithm. The specific method is to first mark part of the training data and use the neural network to train the entity extraction model, so that the neural network can learn the syntactic and lexical features in the log file, and finally use the model to make predictions on more data.

步骤202，基于所述实体抽取模型，对所述用户操作行为数据进行实体抽取，得到实体识别数据。Step 202: Based on the entity extraction model, entity extraction is performed on the user operation behavior data to obtain entity identification data.

由于用户操作日志内的文本数据常常面临所需处理文本篇幅不定、文本夹带很多无关网络用词等情况，传统的实体抽取模型针对这一特殊情况识别效果受到极大影响，所以本发明对传统的NER(Name Entity Recognition，命名实体识别)模型进行优化调整，具体如下：Because the text data in the user operation log is often faced with the situation that the length of the text to be processed varies, and the text contains many irrelevant network words, the traditional entity extraction model has a great impact on the recognition effect of this special situation. The NER (Name Entity Recognition, Named Entity Recognition) model is optimized and adjusted as follows:

(1)实体抽取模型的第一层(1) The first layer of the entity extraction model

所述实体抽取模型的第一层为词嵌入层，用于将输入的单词序列训练成词向量输出。The first layer of the entity extraction model is a word embedding layer, which is used to train the input word sequence into a word vector output.

具体的，本发明采用Word2Vec(word to vector，用于产生词向量的相关模型)中的CBOW(连续词汇)模型进行词向量训练，CBOW模型通过对上下文分析确定每个词的位置，输出每个单词的词向量作为下一层神经网络输入的各个时间步。Specifically, the present invention uses the CBOW (continuous vocabulary) model in Word2Vec (word to vector, a related model for generating word vectors) to perform word vector training. The CBOW model determines the position of each word by analyzing the context, and outputs each word. The word vector of the word is used as the input to the next layer of neural network at various time steps.

(2)实体抽取模型的第二层(2) The second layer of the entity extraction model

所述实体抽取模型的第二层，用于将第一层输出的词向量输入至BiLSTM(Bi-directional Long Short-Term Memory，缩写BiLSTM)层进行训练以学习单词与输出标签的关系，所述BiLSTM层包括正向LSTM(Long Short-Term Memory，缩写LSTM)网络和反向LSTM网络，正向LSTM网络和反向LSTM网络通过一输出层进行连接。双向LSTM网络会通过正向LSTM以及反向LSTM得到相对应的隐含输出序列，将其拼接组成每一个时刻的完整隐含序列，作为下一层的输入，通过BiLSTM层产生的隐藏状态组成的矩阵为H＝{h₁,h₂,……,h_j}。The second layer of the entity extraction model is used to input the word vector output by the first layer into the BiLSTM (Bi-directional Long Short-Term Memory, abbreviated BiLSTM) layer for training to learn the relationship between words and output labels, the described The BiLSTM layer includes a forward LSTM (Long Short-Term Memory, abbreviated LSTM) network and a reverse LSTM network, and the forward LSTM network and the reverse LSTM network are connected through an output layer. The bidirectional LSTM network will obtain the corresponding implicit output sequence through the forward LSTM and the reverse LSTM, splicing it to form the complete implicit sequence at each moment, as the input of the next layer, and composed of the hidden state generated by the BiLSTM layer. The matrix is H={h ₁ , h ₂ , ..., h _j }.

以下对BiLSTM层的算法改进进行描述(如图3所示)：The following describes the algorithm improvement of the BiLSTM layer (as shown in Figure 3):

传统NER模型使用单向LSTM结构，单向LSTM结构只能记录t时间步之前的输入，无法获取未来时间步的信息。当遇到文本篇幅短小的情况时，模型需要更有效地抓住仅有的特征信息，兼顾上下文语境，才能更有效地捕捉到特征。The traditional NER model uses a one-way LSTM structure. The one-way LSTM structure can only record the input before the t time step, and cannot obtain the information of the future time step. When the text is short, the model needs to capture the only feature information more effectively, taking into account the context, in order to capture the features more effectively.

而双向LSTM结构(BiLSTM)可以有效解决这一问题，BiLSTM由向后两单向LSTM组成，两个网络结构中间用一个输出层进行连接。前向LSTM将数据通过输入层进入神经网络结构，按照正常计算与传递方式在输出层得到训练结果。反向LSTM是指在训练过程中，神经网络将误差逐层传递至输入层，并根据误差对每一层的网络参数进行更新。双向LSTM模型同时考虑过去和未来时刻的序列信息，实现完整记录每一个时间步未来与过去信息的目标，当文本篇幅短小时，预测的结果也能相对准确。The bidirectional LSTM structure (BiLSTM) can effectively solve this problem. BiLSTM consists of two backward unidirectional LSTMs, and the two network structures are connected by an output layer in the middle. The forward LSTM enters the data into the neural network structure through the input layer, and obtains the training results in the output layer according to the normal calculation and transmission method. Inverse LSTM means that during the training process, the neural network passes the error layer by layer to the input layer, and updates the network parameters of each layer according to the error. The bidirectional LSTM model considers the sequence information of the past and future times at the same time, and achieves the goal of completely recording the future and past information of each time step. When the text is short, the predicted results can be relatively accurate.

(3)实体抽取模型的第三层(3) The third layer of the entity extraction model

所述实体抽取模型的第三层是在BiLSTM层的输出序列上增加注意力机制(注意力模型)，用于处理标签问题以使所述实体抽取模型更好聚焦局部特征并突出关键词的重要作用，为BiLSTM层的输出分配不同的权重，新的输出向量则是由各特征向量与对应权重的乘积相加后获得。The third layer of the entity extraction model is to add an attention mechanism (attention model) to the output sequence of the BiLSTM layer to deal with the labeling problem so that the entity extraction model can better focus on local features and highlight the importance of keywords. The function is to assign different weights to the output of the BiLSTM layer, and the new output vector is obtained by adding the product of each feature vector and the corresponding weight.

对于i时刻的模型输出向量，模型利用注意力权重分布向量对编码的源序列的隐藏层输出进行加权求和计算，得到针对当前输出的源序列编码结果，公式如下：For the model output vector at time i, the model uses the attention weight distribution vector to calculate the weighted summation of the hidden layer output of the encoded source sequence, and obtains the encoding result of the source sequence for the current output. The formula is as follows:

其中，c_i表示利用注意力机制输出新的字特征向量，它是由前序模型输出的各特征向量h_j与对应权重a_ij的乘积和计算得到。a_ij由前一时刻字特征向量c_i-1与h_j通过下面的两个公式计算得出。注意力层即对所有时刻的输出乘上对应的权重相加作为最终输出，如下：Among them, c _i represents the use of the attention mechanism to output a new word feature vector, which is calculated by the sum of the products of each feature vector h _j output by the pre-order model and the corresponding weight a _ij . a _ij is calculated from the word feature vectors c _i-1 and h _j at the previous moment by the following two formulas. The attention layer multiplies the output at all times by the corresponding weights and adds them as the final output, as follows:

e_ij＝v_atanh(w_ac_i-1+w_bh_j)。e _ij = v _a tanh(w _a c _i-1 +w _b h _j ).

其中，v_a，w_a，w_b为权重。Among them, v _a , w _a , and w _b are weights.

上述提到的注意力系数a_ij，又称为感知机，BiLSTM生成的隐藏层h_j的值是通过感知机a_ij来测量与输出标签的位置i的关系。隐藏层不仅包含了文本全局信息，还包含文本的局部关键词信息，通过加权求和得到当前时间步的输出状态。接着还需要进行线性转换，使其与标签维度相对应，再经过softmax(用于将神经网络的输出结果转化成概率表达式)算法得到最后的输出向量。为了换取较高精度，本模型中采用的注意力模型是由加法模型组成的。The above-mentioned attention coefficient a _ij is also called the perceptron. The value of the hidden layer h _j generated by BiLSTM is measured by the perceptron a _ij in relation to the position i of the output label. The hidden layer contains not only the global information of the text, but also the local keyword information of the text, and the output state of the current time step is obtained by weighted summation. Then a linear transformation is needed to make it correspond to the label dimension, and then the final output vector is obtained through the softmax (used to convert the output of the neural network into a probability expression) algorithm. In exchange for higher accuracy, the attention model used in this model is composed of an additive model.

本发明引入的注意力模型(Attention Model)可广泛应用在不同的深度学习领域中，能够帮助NER模型更好地聚焦局部特征，在极小篇幅中抓住文本重点。并且引入注意力模型，模型将重点关注打标签单词附近的其他单词，而适当忽略距离较远或无关的单词信息。概率分布值代表注意力模型给出的各个单词注意力值，有效展示了注意力模型聚焦的区域。The Attention Model introduced by the present invention can be widely used in different deep learning fields, and can help the NER model to better focus on local features and capture the key points of text in a very small space. And introducing an attention model, the model will focus on other words near the labeled word, while appropriately ignoring distant or irrelevant word information. The probability distribution value represents the attention value of each word given by the attention model, effectively showing the area where the attention model focuses.

结合BiLSTM，合成整个句子中间语义的变换函数，公式为：Combined with BiLSTM, the transformation function of the middle semantics of the entire sentence is synthesized. The formula is:

注意力模型的当前状态C_i需要通过输入句子的长度L_x、注意力系数a_ij和第j个单词的状态值h_j共同决定。注意力模型的更新由注意力系数决定，输出项分给输入项的注意力越多，其对应的a_ij数值就越大。The current state C _i of the attention model needs to be jointly determined by the length L _x of the input sentence, the attention coefficient a _ij and the state value h _j of the jth word. The update of the attention model is determined by the attention coefficient. The more attention the output item assigns to the input item, the larger the corresponding a _ij value.

(4)实体抽取模型的第四层(4) The fourth layer of the entity extraction model

具体的，在注意力机制后使用CRF，可使用维特比解码得到最佳标签序列，输出最佳的解决方案。Specifically, using CRF after the attention mechanism can use Viterbi decoding to obtain the best label sequence and output the best solution.

由此可知，现有技术是采用LSTM-CRF的实体抽取算法，本发明是采用对LSTM-CRF进行改进的BiLSTM-CRF的实体抽取算法，BiLSTM是由双向LSTM网络结构组成。CRF是一种常用的序列标注算法，可用于词性标注，分词，命名实体识别等任务。本发明所采用的BiLSTM+CRF是将BiLSTM和CRF结合在一起，使模型既可以像CRF一样考虑序列前后之间的关联性，又可以拥有LSTM的特征抽取及拟合能力。It can be seen that the prior art adopts the entity extraction algorithm of LSTM-CRF, and the present invention adopts the entity extraction algorithm of BiLSTM-CRF which improves LSTM-CRF, and BiLSTM is composed of a bidirectional LSTM network structure. CRF is a commonly used sequence tagging algorithm, which can be used for tasks such as part-of-speech tagging, word segmentation, and named entity recognition. The BiLSTM+CRF adopted in the present invention combines BiLSTM and CRF, so that the model can not only consider the correlation between the sequences before and after the sequence like CRF, but also have the feature extraction and fitting capabilities of LSTM.

综上所述，本发明将自然语言处理领域的实体抽取技术应用于用户操作数据收集中，并针对用户日志文本的特殊性对命名实体识别模型进行相应改进，在现有技术的LSTM-CRF命名实体识别模型的基础上，将单向LSTM改成双向LSTM，并加入注意力模型。改进后的命名实体识别模型应用于面向数据领域的用户行为分析工作中，对操作步骤重点关注的文本进行命名实体识别，帮助异常检测系统高效挖掘有价值的信息，对海量日志信息的特征捕捉取得了良好效果。To sum up, the present invention applies the entity extraction technology in the field of natural language processing to the collection of user operation data, and improves the named entity recognition model according to the particularity of the user log text. Based on the entity recognition model, the one-way LSTM is changed to a two-way LSTM, and an attention model is added. The improved named entity recognition model is applied to the user behavior analysis work in the data field, and the named entity recognition is performed on the text that the operation steps focus on, which helps the anomaly detection system to efficiently mine valuable information and capture the characteristics of massive log information. good effect.

图4是本发明提供的特征处理步骤的流程示意图，如图所示。上述所述步骤103中，所述对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，包括：FIG. 4 is a schematic flowchart of the feature processing steps provided by the present invention, as shown in the figure. In the above-mentioned step 103, the feature selection and feature dimension reduction are performed on the entity identification data to obtain dimension-reduced feature data, including:

将上述步骤102中提取到的实体识别数据以及第二数据库内存储的离散型数据结合起来后，这些数据可能存在“维数灾难”的问题。一方面，“维数灾难”造成关键的因素和数据被淹没，无法被挖掘，进而造成预测精度陷入瓶颈，难以继续提高；另一方面，高维度的、巨量的数据造成预测模型越来越复杂，计算速度也越来越慢，不得不对计算能力不断扩容，造成计算能力的浪费，所以为了不断提高预测精度，降低预测模型的复杂程度，在构建特征向量集时先对高维度的数据进行降维处理是必要的。本发明采用基于主成分分析法(PCA)的特征降维和特征选择来实现特征抽取和数据压缩。具体如下：After combining the entity identification data extracted in the above step 102 and the discrete data stored in the second database, these data may have the problem of "dimension disaster". On the one hand, the "dimension disaster" causes key factors and data to be submerged and cannot be mined, which in turn causes the prediction accuracy to fall into a bottleneck and it is difficult to continue to improve; Complex, the calculation speed is getting slower and slower, and the computing power has to be continuously expanded, resulting in a waste of computing power. Therefore, in order to continuously improve the prediction accuracy and reduce the complexity of the prediction model, the high-dimensional data is first constructed when constructing the feature vector set. Dimensionality reduction processing is necessary. The invention adopts feature dimension reduction and feature selection based on principal component analysis (PCA) to realize feature extraction and data compression. details as follows:

步骤401，根据系统数据库和实际业务需求，将所述实体识别数据和第二数据库中存储的数据进行汇总，所述第二数据库中存储有办理用户业务的数据。Step 401, according to the system database and actual business requirements, summarize the entity identification data and the data stored in the second database, and the second database stores the data for handling user services.

具体的，将系统数据库中存储的数据(即经过处理的所述实体识别数据)和第二数据库中存储的数据(即办理用户业务的数据)这两类数据加载汇聚在一起。Specifically, the data stored in the system database (that is, the processed entity identification data) and the data stored in the second database (that is, the data for handling user services) are loaded and aggregated together.

步骤402，对数据中出现的异常值/重复值进行处理。Step 402: Process outliers/repeated values appearing in the data.

具体的，对数据中的出现一些异常数据进行处理，比如，性能数据超出正常范围阈值的记录，采用直接删除的方法将异常值别除；对数据中的出现重复现象进行处理，出现重复值可能是平台程序重复启动或在入库阶段出现问题导致。可采用合并法，通过判断记录间的属性值是否相等，将相等的记录合并为一条记录。Specifically, some abnormal data in the data are processed. For example, if the performance data exceeds the normal range threshold, the abnormal value is removed by the method of direct deletion; the repeated phenomenon in the data is processed. It is caused by the repeated startup of the platform program or a problem during the storage phase. The merging method can be used to combine the equal records into one record by judging whether the attribute values between the records are equal.

步骤403，对处理后的数据进行特征选择，并存储经过选择过滤的特征选择数据。Step 403 , perform feature selection on the processed data, and store the feature selection data that has been selected and filtered.

在在机器学习中，特征选择一般有两个目的：第一，减少特征数量，提高训练速度；第二，减少噪声特征从而提高模型在测试集上的准确率。常用的特征选择算法有很多，比如卡方检验和互信息。In machine learning, feature selection generally has two purposes: first, to reduce the number of features and improve training speed; second, to reduce noise features to improve the accuracy of the model on the test set. There are many commonly used feature selection algorithms, such as chi-square test and mutual information.

具体的，对离散类型数据通过离散式计算方法获得选择结果，主要包括卡方检验和互信息；而对连续类型数据则通过连续式计算方法获得选择结果，主要包括皮尔森相关系数(Pearson correlation coefficient)和费希尔得分方法(Fisher′s scoringmethod)，并存储经过选择过滤的特征数据，为进一步的数据分析提供支持。Specifically, for discrete data, the selection results are obtained by discrete calculation methods, including chi-square test and mutual information; for continuous data, selection results are obtained by continuous calculation methods, mainly including Pearson correlation coefficient. ) and Fisher's scoring method, and store selected filtered feature data to provide support for further data analysis.

步骤404，基于所述特征选择数据计算表征数据相关性的协方差矩阵，并对其进行特征分解，得到特征值和特征向量集合。Step 404: Calculate a covariance matrix representing data correlation based on the feature selection data, and perform feature decomposition on the covariance matrix to obtain a set of eigenvalues and eigenvectors.

步骤405，将所述特征值和特征向量集合投影至特征矩阵，得到降维后的特征数据，并将所述特征数据进行存储。Step 405: Project the set of eigenvalues and eigenvectors to a feature matrix to obtain dimensionality-reduced feature data, and store the feature data.

具体的，可通过主成分分析(PCA)算法来实现数据降维，所述存储降维后的特征数据可作为深度学习预测系统与大数据分析处理系统的数据基础。Specifically, a principal component analysis (PCA) algorithm can be used to achieve data dimension reduction, and the stored dimension-reduced feature data can be used as a data basis for a deep learning prediction system and a big data analysis and processing system.

上述所述主成分分析(PCA)算法如下：The principal component analysis (PCA) algorithm described above is as follows:

对用户前台的操作行为数据展开主成分分析，获得降低维度的主成分分量。将所有操作行为数据整理成样本矩阵，矩阵大小为m×k维：Principal component analysis is carried out on the operation behavior data of the user's foreground, and the principal component components with reduced dimensions are obtained. Organize all operation behavior data into a sample matrix, the size of the matrix is m×k dimension:

中心化样本矩阵：Centered sample matrix:

计算特征数据集的方差：Compute the variance of the feature dataset:

其中，X表示特征数据x_i的集合。Among them, X represents the set of feature data _xi .

计算协方差矩阵的特征值并取出最大的d个特征值所对应的特征向量，输出投影矩阵，假设通过变换后的坐标系是{w₁,w₂,…,w_d}，其中w为标准正交基向量。如果将数据降维后，特征数据x_i于低维坐标系的投影为z_i＝(z_i1,z_i2,…,z_id)，于z_i来构造x_i，结果为：Calculate the eigenvalues of the covariance matrix and take out the eigenvectors corresponding to the largest d eigenvalues, and output the projection matrix, assuming that the transformed coordinate system is {w ₁ ,w ₂ ,...,w _d }, where w is the standard Orthogonal basis vectors. If the dimension of the data is reduced, the projection of the feature data _xi to the low-dimensional coordinate system is _zi = (z _i1 , z _i2 ,..., z _id ), and _{xi is constructed from z i} _, the result is:

重构的

与原本的x_i的距离为：refactored

The distance from the original _xi is:

其中，constμ为常量，可忽略。Among them, constμ is constant and can be ignored.

为了达到降维效果，应使上式最小，由于

代表协方差矩阵，计算出最少的特征维度：In order to achieve the dimensionality reduction effect, the above formula should be minimized, because

Represents the covariance matrix, which computes the minimum feature dimension:

以上式为约束函数，得出PCA降维之后的主成分分量。The above formula is a constraint function, and the principal component components after PCA dimensionality reduction are obtained.

综上所述，在收集到的用户操作数据集维度过高而无法构建有效数据模型的，而在数据表现层，高纬度的大量数据会导致数据处理算法的计算复杂度呈指数级增加，甚至出现维度爆炸，严重影响系统运行效率。PCA数据降维是一种可以在降低数据维度的同时，尽可能保留原有数据主要信息的特征处理和数据压缩方法。PCA降维能够保留足够的信息用以区分不同的类别，可以有效存储数据信息，降低数据复杂度，还能够帮助数据集进行潜在性的扩展可能。To sum up, if the dimension of the collected user operation data set is too high to build an effective data model, at the data presentation layer, a large amount of high-dimensional data will lead to an exponential increase in the computational complexity of data processing algorithms, and even Dimension explosion occurs, which seriously affects the efficiency of system operation. PCA data dimensionality reduction is a feature processing and data compression method that can reduce the data dimension while retaining the main information of the original data as much as possible. PCA dimensionality reduction can retain enough information to distinguish different categories, effectively store data information, reduce data complexity, and help the data set to potentially expand.

进一步的，本发明可由不同功能的模块组合实现上述步骤101～103。比如，通过系统设置如下功能的模块：Further, in the present invention, the above steps 101 to 103 can be implemented by a combination of modules with different functions. For example, a module with the following functions is set by the system:

核心数据库，用于存储平台采集的各项数据，为其他模块提供数据基础。数据预处理模块，用于对原始数据进行缺失值填充、去除数据冗余以及非数值型数据编码等处理，并进行归一化与中心化操作，统一数据结构，以方便后续计算。数据降维压缩模块，用于采用主成分分析技术(PCA)对数据进行降维，减小数据量，为深度学习预测模型提供数据支持；数据特征抽取模块根据数据类型采用相应标准进行特征抽取，提取数据关键信息，为大数据分析处理提供数据基础。The core database is used to store various data collected by the platform and provide a data basis for other modules. The data preprocessing module is used to fill in missing values, remove data redundancy, and encode non-numeric data for the original data, and perform normalization and centralization operations to unify the data structure to facilitate subsequent calculations. The data dimensionality reduction and compression module is used to reduce the dimensionality of the data by using principal component analysis (PCA), reduce the amount of data, and provide data support for the deep learning prediction model; the data feature extraction module uses corresponding standards to extract features according to the data type, Extract key data information and provide a data foundation for big data analysis and processing.

上述所述功能模块只是本发明实现上述步骤101～103的示例，本发明并不限于上述功能模块。The above-mentioned functional modules are only examples of implementing the above-mentioned steps 101 to 103 in the present invention, and the present invention is not limited to the above-mentioned functional modules.

图5是本发明提供的聚类分析步骤的流程示意图，如图所示。上述步骤104中，所述对所述特征数据进行聚类分析，得到各种操作行为的归类信息，包括：FIG. 5 is a schematic flowchart of a cluster analysis step provided by the present invention, as shown in the figure. In the above step 104, the cluster analysis is performed on the feature data to obtain classification information of various operation behaviors, including:

步骤501，基于K-means密度聚类算法，将所述特征数据的集合按照特征相似度分成属于不同簇类对象，包括将特征相似的数据分布于同一簇中，将特征不相似的数据分布在簇外。Step 501, based on the K-means density clustering algorithm, divide the set of feature data into objects belonging to different clusters according to the feature similarity, including distributing the data with similar features in the same cluster, and distributing the data with dissimilar features in the same cluster. outside the cluster.

可选的，所述K-means密度聚类算法是通过在聚类之前预先设定阈值，基于所述特征数据的密度、簇内平均距离和簇间距离计算出权重，采用加权的欧氏距离计算出所述特征数据的距离，并通过计算得到的所述特征数据的密度、权值和距离来选择初始聚类中心，得到所述K-means密度聚类算法的初始输入参数。Optionally, the K-means density clustering algorithm calculates the weight based on the density of the feature data, the average distance between clusters and the distance between clusters by presetting a threshold before clustering, and adopts the weighted Euclidean distance. Calculate the distance of the feature data, and select the initial cluster center by calculating the density, weight and distance of the feature data, and obtain the initial input parameters of the K-means density clustering algorithm.

步骤502，基于所述特征数据分布的密度进行数据分析，得到各种操作行为的归类数据。Step 502: Perform data analysis based on the density of the characteristic data distribution to obtain classification data of various operation behaviors.

本发明以经过特征选择后的所述特征数据为研究对象，通过K-means(k均值聚类算法)密度聚类算法分析挖掘用户行为操作数据，将用户操作分为多个簇，这些操作中以符合规范的为主，聚类分析是发现这些合规操作的簇集，违规操作的数据往往分布在这些簇外，通过聚类能够自动化发现这些违规操作行为。The present invention takes the feature data after feature selection as the research object, analyzes and mines user behavior operation data through K-means (k-means clustering algorithm) density clustering algorithm, and divides user operations into multiple clusters. Focusing on compliance with specifications, cluster analysis is to find clusters of these compliance operations. The data of illegal operations are often distributed outside these clusters, and these illegal operations can be automatically discovered through clustering.

以下对K-means密度聚类算法进行具体描述：The following is a detailed description of the K-means density clustering algorithm:

经典K-means聚类算法的基本思想是：输入聚类数目k之后，首先从数据集中随机选取k个样本点作为初始聚类中心，然后计算各个样本点分别到k个初始聚类中心的距离，将样本按照距离最小原则归类，形成k个簇，再计算各个簇的平均值得到新的聚类中心，不断重复上述过程，直到聚类中心不再发生变化或者迭代次数达到设定的值之后，算法结束。The basic idea of the classical K-means clustering algorithm is: after inputting the number of clusters k, first randomly select k sample points from the data set as the initial cluster centers, and then calculate the distances from each sample point to the k initial cluster centers. , classify the samples according to the principle of minimum distance, form k clusters, and then calculate the average value of each cluster to obtain a new cluster center, and repeat the above process until the cluster center no longer changes or the number of iterations reaches the set value After that, the algorithm ends.

K-means算法在计算样本之间距离时可采用欧氏距离，所述样本之间距离的计算公式如下：The K-means algorithm can use the Euclidean distance when calculating the distance between samples, and the calculation formula of the distance between the samples is as follows:

其中，上式中的x_i＝{x_i1,x_i2,…,x_im}和x_j＝{x_j1,x_j2,…,x_jm}为任意两个维度等于m的样本点；x_ip表示样本i对应第p个维度的具体取值。Among them, x _i ={x _i1 ,x _i2 ,...,x _im } and x _j ={x _j1 ,x _j2 ,...,x _jm } in the above formula are sample points with any two dimensions equal to m; x _ip Indicates the specific value of the p-th dimension corresponding to sample i.

本发明对上述经典K-means算法进行改进，如下：The present invention improves the above-mentioned classical K-means algorithm as follows:

经典K-means聚类算法具有一定的局限性，由于算法的初始聚类中心是随机设置的，聚类结果不稳定而且易陷人局部最优，结果易受噪声点影响；在聚类之前需要用户预先设定K值，算法的自适应性较差。针对上述问题，本发明提出一种基于距离和权重改进的K-means算法，权重的计算综合了样本密度、簇内平均距离和簇间距离，并且样本距离的计算采用的是加权的欧氏距离，加大了数据属性之间的区分程度，减少了异常点的影响，然后通过计算得到的样本密度、样本权值和距离来选择初始聚类中心，得到K-means聚类算法的初始输入参数。The classical K-means clustering algorithm has certain limitations. Since the initial clustering center of the algorithm is randomly set, the clustering results are unstable and easy to fall into the local optimum, and the results are easily affected by noise points; The user presets the K value, and the algorithm has poor adaptability. In view of the above problems, the present invention proposes an improved K-means algorithm based on distance and weight. The calculation of the weight integrates the sample density, the average distance within the cluster and the distance between the clusters, and the calculation of the sample distance adopts the weighted Euclidean distance. , which increases the degree of distinction between data attributes and reduces the influence of outliers. Then, the initial clustering center is selected through the calculated sample density, sample weight and distance, and the initial input parameters of the K-means clustering algorithm are obtained. .

具体步骤如下：Specific steps are as follows:

步骤1：对于给定的数据集D，计算得到数据集内所有样本的密度和数据集D内所有样本元素的权重w。第一个初始聚类中心就选择D中密度最大的对象c₁，将之添加到聚类中心点的集合C中，此时C＝{c₁}，然后将D中所有距离点c₁小于MeanDist(D)的点删除。Step 1: For a given data set D, calculate the density of all samples in the data set and the weight w of all sample elements in the data set D. The first initial cluster center selects the object c ₁ with the highest density in D, and adds it to the set C of cluster center points. At this time, C={c ₁ }, and then all distance points c ₁ in D are less than Point deletion for MeanDist(D).

样本的密度计算公式：The formula for calculating the density of the sample is:

所有样本元素的权重w的计算公式：The formula for calculating the weight w of all sample elements:

MeanDist(D)计算公式：MeanDist(D) calculation formula:

步骤2：选择具有最大τ_i＝ω_i·d_ω(x_i,c₁)值的点x_i作为第2个初始聚类中心，记为c₂，将c₂添加到集合C中，此时C＝{c₁,c₂}，与第一步类似的，将D中所有距离c₂小于MeanDist(D)的点删除。Step 2: Select the point _xi with the maximum value of τ _i =ω _i ·d _ω ( _xi ,c ₁ ) as the second initial cluster center, denoted as c ₂ , add c ₂ to the set C, this When C={c ₁ , c ₂ }, similar to the first step, delete all points in D whose distance c ₂ is less than MeanDist(D).

步骤3：选择具有最大τ_i＝ω_i`·d_ω(x_i`,c₂)值的点x_i`，作为第3个初始聚类中心，记为c₃，将c₃添加到集合C中，此时C＝{c₁,c₂,c₃}，将D中所有距离c₃小于MeanDist(D)的点删除，类似的不停重复上述过程，直到数据集D变为空集。此时C＝{c₁,c₂,…,c_k}，由此得到k个初始聚类中心，即集合C中的样本点。Step 3: Select the point x _i` with the maximum value of τ _i =ω _i` ·d _ω (x _i` ,c ₂ ) as the third initial cluster center, denoted as c ₃ , and add c ₃ to the set In C, at this time C={c ₁ , c ₂ , c ₃ }, delete all points in D whose distance c ₃ is less than MeanDist(D), and similarly repeat the above process until the data set D becomes an empty set . At this time, C={c ₁ , c ₂ ,...,c _k }, thereby obtaining k initial cluster centers, that is, the sample points in the set C.

步骤4：以上面步骤得到的初始聚类中心和聚类数为输入，对给定数据集D进行K-means聚类运算，直到聚类中心不再变化。Step 4: Using the initial cluster center and the number of clusters obtained in the above steps as input, perform K-means clustering operation on the given data set D until the cluster center does not change.

步骤5：输出最终聚类结果。Step 5: Output the final clustering result.

综上所述，本发明基于密度聚类算法的用户操作规范性分析，可以智能化的对用户操作行为规律进行挖掘，降低人工审核造成的高成本缺点及无法保证人工预测的准确性和实时性的问题。改进的K-means算法排除了孤立点的影响，有效解决了经典K-means算法的抗噪性差以及易陷入局部最优的缺点，并且提高了算法的稳定性。To sum up, the user operation normative analysis of the present invention based on the density clustering algorithm can intelligently mine the user operation behavior rules, reduce the high cost disadvantage caused by manual review, and cannot guarantee the accuracy and real-time performance of manual prediction. The problem. The improved K-means algorithm eliminates the influence of outliers, effectively solves the shortcomings of the classical K-means algorithm's poor anti-noise and easy to fall into local optimum, and improves the stability of the algorithm.

图6是本发明提供的异常打分步骤的流程示意图，如图所示。上述步骤105中，所述基于异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据，包括：FIG. 6 is a schematic flowchart of an abnormality scoring step provided by the present invention, as shown in the figure. In the above step 105, the data analysis is performed on the classified data based on the abnormality detection algorithm to obtain normal data of the user's normal operation behavior and abnormal data of the user's illegal operation behavior, including:

步骤601，采用孤立森林、One Class SVM以及局部异常因子三种异常检测算法分别对所述归类数据进行异常打分，得到对应的异常打分值。Step 601 , using three anomaly detection algorithms of isolation forest, One Class SVM and local anomaly factor to respectively perform anomaly scoring on the classified data to obtain a corresponding anomaly scoring value.

具体的，经过上述步骤104的聚类分析可以将用户的各种操作行为进行归纳，挖掘其中的操作规律逻辑。本发明是将上一步骤104中的聚类结果进行更深入的分析，通过这些数据来检测用户操作是否异常。异常检测分析的主要任务是在正常的用户数据集中提取出小概率的异常数据点，这些异常点的产生不是由于随机偏差，而是有如故障、威胁、入侵等完全不同的机制。这些异常事件的发生频率同大量的正常事件相比仅仅是少数的一部分。异常检测算法众多，它们的期望尽管都是尽可能分离出正常数据与异常数据，但其原理各不相同。本发明采用孤立森林、One-Class SVM以及局部异常因子这三种算法来完成异常检测任务。Specifically, through the cluster analysis in the above step 104, various operation behaviors of the user can be summarized, and the operation rules and logics therein can be mined. The present invention further analyzes the clustering results in the previous step 104, and detects whether the user operation is abnormal through these data. The main task of anomaly detection and analysis is to extract small-probability anomalous data points from normal user data sets. These anomaly points are not generated by random deviations, but by completely different mechanisms such as faults, threats, and intrusions. The frequency of these abnormal events is only a small fraction of the large number of normal events. There are many anomaly detection algorithms. Although they all aim to separate normal data from abnormal data as much as possible, their principles are different. The present invention uses three algorithms of isolation forest, One-Class SVM and local abnormal factor to complete the abnormal detection task.

以下对孤立森林、One-Class SVM以及局部异常因子这三种算法进行具体描述。The three algorithms of isolation forest, One-Class SVM and local anomaly factor are described in detail below.

(1)孤独森林(1) Lonely Forest

孤独森林算法是基于划分和集成学习的异常检测算法，该算法的设计利用了异常数据具有的两个特点：一是相对于正常数据，异常数据数量很少；二是异常数据与正常数据的属性值存在明显的差异。孤独森林算法的核心在于随机进行采样并构造一定数量的隔离树(ifree)，由这些隔离树组成一个孤独森林(iForest)。构造孤独森林的主要步骤如下：The Lonely Forest algorithm is an anomaly detection algorithm based on division and ensemble learning. The design of the algorithm takes advantage of two characteristics of abnormal data: one is that the number of abnormal data is small compared to normal data; the other is the attributes of abnormal data and normal data There are significant differences in values. The core of the lonely forest algorithm is to randomly sample and construct a certain number of isolation trees (ifree), and these isolation trees form a lonely forest (iForest). The main steps in constructing a lonely forest are as follows:

步骤1：从一组连续性数据组成的训练集中随机选择m个样本数据点作为子采样集D＝{d₁,d₂,…,d_m}，数据点的维度为n，作为树的根节点。Step 1: randomly select m sample data points from a training set consisting of a set of continuous data as a sub-sampling set D={d ₁ , d ₂ ,..., d _m }, the dimension of the data point is n, as the root of the tree node.

步骤2：从当前子采样集中随机选择一个维度A和一个分裂点p，p介于当前子采样集中维度A的最大值和最小值之间。Step 2: Randomly select a dimension A and a split point p from the current sub-sampling set, where p is between the maximum and minimum values of dimension A in the current sub-sampling set.

步骤3：对子采样集的每个数据d_i，按其维度A的值d_i(A)进行划分，若d_i(A)<p则划分至左子树，反之则划分至右子树。Step 3: Divide each data d _i of the sub-sampling set according to the value d _i (A) of its dimension A. If d _i (A)<p, divide it into the left subtree, otherwise, divide it into the right subtree .

步骤4：重复步骤2和3，不断构造新的左、右子树，直至满足下列条件之一：①D中只剩下一个数据点或者多个相同的数据点，无法进一步划分；②隔离树的高度达到限定高度。Step 4: Repeat steps 2 and 3 to continuously construct new left and right subtrees until one of the following conditions is met: ① There is only one data point or multiple identical data points left in D, which cannot be further divided; ② The height reaches the limit height.

步骤5：重复上述步骤，直至隔离树的数量达到指定数量N，由这些隔离树组成一个孤立森林。Step 5: Repeat the above steps until the number of isolation trees reaches the specified number N, and these isolation trees form an isolation forest.

(2)One-Class SVM(2)One-Class SVM

One-Class SVM将一分类问题等价为一个特殊的二分类问题，将经典SVM特征空间中的分离超平面和最大分类间隔的问题转化成了最大化超平面与原点之间间隔的问题，将优化问题转化为:One-Class SVM equates the one-class problem as a special two-class problem, and transforms the problem of separating the hyperplane and the maximum classification interval in the classical SVM feature space into the problem of maximizing the interval between the hyperplane and the origin. The optimization problem turns into:

式中ω为超平面法向量，i为样本编号，ξ_i为松弛变量，ρ为超平面截距，v∈(0,1]为预设负样本比例，l为样本总数，vl为惩罚系数，控制着边界支持向量率的上界和全部支持向量率的下界。One-Class SVM的训练过程仅需要正样本参与，从而能够保证较高的异常识别率。因此，本算法主要用于估测高维数据分布，适用于解决正负训练样本数目不均情况下的训练样本筛选、异常检测等机器学习问题。where ω is the hyperplane normal vector, i is the sample number, ξ _i is the slack variable, ρ is the hyperplane intercept, v∈(0,1] is the preset negative sample ratio, l is the total number of samples, and vl is the penalty coefficient , which controls the upper bound of the boundary support vector rate and the lower bound of the total support vector rate. The training process of One-Class SVM only requires the participation of positive samples, which can ensure a high abnormal recognition rate. Therefore, this algorithm is mainly used to estimate High-dimensional data distribution, suitable for solving machine learning problems such as training sample selection and anomaly detection when the number of positive and negative training samples is uneven.

(3)局部异常因子(LOF)(3) Local Outlier Factor (LOF)

LOF算法是通过对每个点p及其邻域点的密度判断该点是否为异常点，如果点p的密度越低，则点p是异常点的可能性越大。假设在经过阈值处理后的点云中取任意一点p，其第k距离d_k(p)定义为：The LOF algorithm judges whether the point is an abnormal point by the density of each point p and its neighboring points. If the density of the point p is lower, the possibility that the point p is an abnormal point is greater. Assuming that any point p is taken in the thresholded point cloud, the k-th distance d _k (p) is defined as:

d_k(p)＝d(p,o)； _dk (p)=d(p,o);

式中，d(p,o)为点p与点o之间的距离。where d(p, o) is the distance between point p and point o.

给定d_k(p)后，定义p的第k距离邻域为所有与p距离小于d_k(p)的点，即After given d _k (p), define the k-th distance neighborhood of p as all points whose distance from p is less than d _k (p), that is

N_k(p)＝{q∈D\{p}|d(p,q)≤d_k(p)}；N _k (p)={q∈D\{p}|d(p,q)≤d _k (p)};

式中：N_k(p)为点p的第k距离邻域；q为点p的邻域点；D\{p}表示除点p之外的点云集合。In the formula: N _k (p) is the k-th distance neighborhood of point p; q is the neighborhood point of point p; D\{p} represents the point cloud set except for point p.

点到点o的第k可达距离为:The kth reachable distance from point to point o is:

d_r(p,o)＝max{d_k(o),d(p,o)}；d _r (p,o)=max{ _dk (o),d(p,o)};

上式意味着离点o最近的h个点，o到它们的可达距离相等且等于d_k(o)。The above formula means that the h points closest to the point o, the reachable distances from o to them are equal and equal to d _k (o).

根据上述定义，点p的局部可达密度表示为：According to the above definition, the local reachability density of point p is expressed as:

通过点p的局部可达距离以及点o(点p的邻域点)的局部可达距离作比，构造如下所示的比较因子，即局部离群因子，进而检测异常点：By comparing the local reachable distance of point p and the local reachable distance of point o (the neighborhood point of point p), the following comparison factor is constructed, that is, the local outlier factor, and then abnormal points are detected:

该比值越接近1，表明点p的密度和其邻域点密度相差不多，p可能与邻域同属一簇；该比值越小于1，表明p的密度高于其邻域点密度，p为密集点；该比值越大于1，表明p的密度小于其邻域点密度，p越可能是异常点。因此，观察LOF值选取合适的值，保留取值范围之内的点，即为异常点去除之后的目标点云。The closer the ratio is to 1, it indicates that the density of point p is similar to that of its neighbors, and p may belong to the same cluster as its neighbors; the smaller the ratio is, it indicates that the density of p is higher than that of its neighbors, and p is Dense points; the larger the ratio is than 1, it indicates that the density of p is less than the density of its neighbors, and the more likely p is an abnormal point. Therefore, observe the LOF value and select an appropriate value, and keep the points within the value range, that is, the target point cloud after the outliers are removed.

步骤602，将所述三种异常检测算法输出的异常打分值进行加权归一，得到针对所有用户的异常打分值的排名。Step 602 , weighting and normalizing the abnormal score values output by the three abnormality detection algorithms to obtain a ranking of the abnormal score values for all users.

具体的，针对不同的数据源，很难保证哪一类异常检测算法能够取得最优的结果，因此采用孤立森林、One Class SVM以及局部异常因子这三种算法的集成来全面识别和评价最可能影响系统的各种异常用户。本发明利用这三种算法进行异常检测，可以分别得到所有用户的异常打分。对这三种算法结果进行加权归一，可以得到最终的针对所有用户的异常打分排名。Specifically, for different data sources, it is difficult to guarantee which type of anomaly detection algorithm can achieve the best results. Therefore, the integration of the three algorithms of isolation forest, One Class SVM and local anomaly factor is used to comprehensively identify and evaluate the most likely Affect various abnormal users of the system. The present invention uses these three algorithms for abnormality detection, and can obtain the abnormality scores of all users respectively. By weighting and normalizing the results of these three algorithms, the final abnormal score ranking for all users can be obtained.

每个算法都会对用户i计算一个独立的异常分值。孤立森林、One Class SVM、局部异常因子这三种算法的几个分别记为S₁、S₂、S₃，其对应的权重分别为P₁、P₂、P₃，则最终的异常评分Score为：Each algorithm computes an independent outlier score for user i. The three algorithms of isolation forest, One Class SVM, and local abnormal factor are denoted as S ₁ , S ₂ , and S ₃ respectively, and their corresponding weights are P ₁ , P ₂ , and P ₃ respectively. Then the final anomaly score Score for:

步骤603，根据所述异常打分值的排名，确定用户正常操作行为的正常数据与用户违规操作行为的异常数据。Step 603 , according to the ranking of the abnormal score values, determine the normal data of the user's normal operation behavior and the abnormal data of the user's illegal operation behavior.

由此可知，根据上述最终的异常评分Score进行排名，可全面识别和评价最可能影响系统的各种异常用户操作。It can be seen that, ranking according to the above-mentioned final abnormal score Score can comprehensively identify and evaluate various abnormal user operations that are most likely to affect the system.

综上所述，基于异常检测的用户行为分析，利用加权融合三种异常检测算法预测用户操作合规性得分，以集成的方式来全面识别和评价最可能影响系统的各种异常用户，以更高的准确率尽可能分离出正常数据与异常数据，确保异常检测的准确性。To sum up, the user behavior analysis based on anomaly detection uses the weighted fusion of three anomaly detection algorithms to predict user operation compliance scores, and comprehensively identifies and evaluates various abnormal users that are most likely to affect the system in an integrated manner, so as to improve the performance of the system. The high accuracy rate separates normal data and abnormal data as much as possible to ensure the accuracy of abnormal detection.

进一步的，在上述步骤105中，所述基于异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据之后，还包括：Further, in the above step 105, after the data analysis is performed on the classified data based on the abnormality detection algorithm to obtain the normal data of the user's normal operation behavior and the abnormal data of the user's illegal operation behavior, the method further includes:

具体的，通过对上述步骤603中预测的结果进行判定，若预测存在用户操作异常则会以邮件、短信的方式告知系统管理员及相应的技术人员。同时，为了减少后续还会发生的此类违规操作事件，对于告警的数据的各指标的详细信息会进行分析，比如某操作出现的次数过多或持续时间过长，可能存在此操作缺陷过多的情况。对此通过分析各指标数据，对于部分异常情况进行灾备机制的启动，比如在平台备用节点上自动开启一些容器化服务等操作。Specifically, by judging the predicted result in the above step 603, if it is predicted that there is abnormal operation of the user, the system administrator and the corresponding technical personnel will be notified in the form of email and short message. At the same time, in order to reduce such illegal operation events that will occur in the future, the detailed information of each indicator of the alarm data will be analyzed. For example, if an operation occurs too many times or lasts too long, there may be too many defects in this operation. Case. In this regard, by analyzing the data of various indicators, the disaster recovery mechanism is activated for some abnormal situations, such as automatically starting some containerized services on the standby node of the platform.

本发明能够一方面将可能存在的异常情况进行告警，另一方面对于部分异常的场景通过启用灾备机制来尝试是否可以解决该异常、减少该异常对于用户的体验性或是为运维人员争取更多的时间来定位及解决问题。The present invention can, on the one hand, alert possible abnormal situations, and on the other hand, for some abnormal scenarios, by enabling the disaster recovery mechanism, it is possible to try whether the abnormality can be solved, reduce the experience of the abnormality for users, or strive for the operation and maintenance personnel. More time to locate and solve problems.

下面对本发明提供的用户操作行为数据的检测装置进行描述，下文描述的用户操作行为数据的检测装置与上文描述的用户操作行为数据的检测方法可相互对应参照。The device for detecting user operation behavior data provided by the present invention is described below. The device for detecting user operation behavior data described below and the method for detecting user operation behavior data described above may refer to each other correspondingly.

图7是本发明提供的用户操作行为数据的检测装置的结构示意图，如图所示。一种用户操作行为数据的检测装置700，包括数据采集模块710、实体抽取模块720、特征选择模块730、聚类分析模块740以及异常检测模块750。其中，FIG. 7 is a schematic structural diagram of an apparatus for detecting user operation behavior data provided by the present invention, as shown in the figure. An apparatus 700 for detecting user operation behavior data includes a data collection module 710 , an entity extraction module 720 , a feature selection module 730 , a cluster analysis module 740 and an abnormality detection module 750 . in,

数据采集模块710，用于采集用户操作行为数据，所述用户操作行为数据为描述用户各种操作行为的数据；A data collection module 710, configured to collect user operation behavior data, where the user operation behavior data is data describing various user operation behaviors;

实体抽取模块720，用于对所述用户操作行为数据进行实体抽取，得到实体识别数据，所述实体识别数据为从所述用户操作行为数据中提取和异常数据有关的数据；an entity extraction module 720, configured to perform entity extraction on the user operation behavior data to obtain entity identification data, where the entity identification data is data related to abnormal data extracted from the user operation behavior data;

特征选择模块730，用于对所述实体识别数据进行特征选择和特征降维，得到降维后的特征数据，所述特征数据为通过特征选择和特征降维来实现特征抽取和数据压缩的数据；The feature selection module 730 is used to perform feature selection and feature dimensionality reduction on the entity identification data, and obtain feature data after dimensionality reduction, and the feature data is the data for realizing feature extraction and data compression through feature selection and feature dimensionality reduction ;

聚类分析模块740，用于对所述特征数据进行聚类分析，得到各种操作行为的归类数据，所述归类数据用于将用户的各种操作行为进行归类；The cluster analysis module 740 is configured to perform cluster analysis on the feature data to obtain classification data of various operation behaviors, and the classification data is used to classify various operation behaviors of the user;

异常检测模块750，用于采用异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户违规操作行为的异常数据。The anomaly detection module 750 is configured to perform data analysis on the classified data by using an anomaly detection algorithm to obtain normal data of the user's normal operation behavior and abnormal data of the user's illegal operation behavior.

可选的，所述数据采集模块710，是基于第一数据库采集用户操作行为数据，所述第一数据库中存储有关系型数据和记录用户各种操作行为的日志数据；所述用户操作行为数据包括用户各种操作开始/结束时间、操作具体步骤、操作顺序、操作最终结果的一种或多种组合的数据。Optionally, the data collection module 710 collects user operation behavior data based on a first database, where relational data and log data recording various user operation behaviors are stored in the first database; the user operation behavior data It includes data of one or more combinations of the start/end time of various operations of the user, the specific steps of the operation, the sequence of operations, and the final result of the operation.

可选的，所述实体抽取模块720，还用于执行如下步骤：Optionally, the entity extraction module 720 is further configured to perform the following steps:

可选的，所述特征选择模块730，还用于执行如下步骤：Optionally, the feature selection module 730 is further configured to perform the following steps:

可选的，所述聚类分析模块740，还用于执行如下步骤：Optionally, the cluster analysis module 740 is further configured to perform the following steps:

可选的，所述异常检测模块750，还用于执行如下步骤：Optionally, the abnormality detection module 750 is further configured to perform the following steps:

进一步的，所述用户操作行为数据的检测装置700还包括系统告警模块(图中暂未标示)。Further, the apparatus 700 for detecting user operation behavior data further includes a system alarm module (not marked in the figure).

所述告警模块，用于若确定为用户违规操作行为的异常数据，则以邮件、短信方式告知系统管理员及相关的技术人员，以及对部分异常数据启动灾备机制以解决异常的问题。The alarm module is used to notify the system administrator and related technical personnel by email and short message if it is determined to be abnormal data of the user's illegal operation behavior, and activate a disaster recovery mechanism for some abnormal data to solve the abnormal problem.

图8示例了一种电子设备的实体结构示意图，如图8所示，该电子设备可以包括：处理器(processor)810、通信接口(Communications Interface)820、存储器(memory)830和通信总线840，其中，处理器810，通信接口820，存储器830通过通信总线840完成相互间的通信。处理器810可以调用存储器830中的逻辑指令，以执行所述用户操作行为数据的检测方法，所述方法包括：FIG. 8 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 8 , the electronic device may include: a processor (processor) 810, a communication interface (Communications Interface) 820, a memory (memory) 830, and a communication bus 840, The processor 810 , the communication interface 820 , and the memory 830 communicate with each other through the communication bus 840 . The processor 810 may invoke the logic instructions in the memory 830 to execute the method for detecting the user operation behavior data, the method comprising:

此外，上述的存储器830中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 830 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述各方法所提供的所述用户操作行为数据的检测方法，所述方法包括：In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer When executed, the computer can execute the method for detecting the user operation behavior data provided by the above methods, and the method includes:

采用异常检测算法对所述归类数据进行数据分析，得到用户正常操作行为的正常数据与用户异常操作行为的异常数据。又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各提供的所述用户操作行为数据的检测方法，所述方法包括：An anomaly detection algorithm is used to perform data analysis on the classified data to obtain normal data of the user's normal operation behavior and abnormal data of the user's abnormal operation behavior. In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and the computer program is implemented by a processor to execute the above-mentioned detection methods for the user operation behavior data provided. , the method includes:

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. a detection method of user operation behavior data, is characterized in that, comprises:

collecting user operation behavior data, where the user operation behavior data is used to analyze whether the user's operation behavior is abnormal;

Perform entity extraction on the user operation behavior data to obtain entity identification data, and the entity identification data is used to extract data related to the abnormal operation behavior of the user;

Perform feature selection and feature dimensionality reduction on the entity recognition data to obtain dimensionality-reduced feature data, where the feature data is data obtained by feature selection and feature dimensionality reduction to achieve feature extraction and data compression;

Perform cluster analysis on the feature data to obtain classification data of various operation behaviors, and the classification data is used to classify various operation behaviors of the user;

An anomaly detection algorithm is used to perform data analysis on the classified data to obtain normal data of the user's normal operation behavior and abnormal data of the user's abnormal operation behavior.

2. The method for detecting user operation behavior data according to claim 1, wherein the collecting user operation behavior data comprises:

Collect user operation behavior data based on a first database, where relational data and log data recording various user operation behaviors are stored in the first database;

The user operation behavior data includes data of one or more combinations of the start/end time of various operations of the user, the specific steps of the operation, the sequence of operations, and the final result of the operation.

3. The method for detecting user operation behavior data according to claim 1, wherein the entity extraction is performed on the user operation behavior data to obtain entity identification data, comprising:

Marking part of the data of the user operation behavior data as training data, and using a neural network to train an entity extraction model;

Based on the entity extraction model, entity extraction is performed on the user operation behavior data to obtain entity identification data; wherein,

The first layer of the entity extraction model is a word embedding layer, which is used to train the input word sequence into a word vector output;

The second layer of the entity extraction model is used to input the word vector output from the first layer to the BiLSTM layer for training to learn the relationship between words and output labels, and the BiLSTM layer includes a forward LSTM network and a reverse LSTM network, The forward LSTM network and the reverse LSTM network are connected through an output layer;

The third layer of the entity extraction model is provided with an attention model on the output sequence of the BiLSTM layer, which is used to deal with the label problem so that the entity extraction model can better focus on local features and highlight the important role of keywords;

The fourth layer of the entity extraction model is the CRF layer used after the attention mechanism, which is used to output the transition score between labels through the transition matrix, and based on the conversion rule of each label and the rationality of the label syntax, Get the best tag sequence.

4. the detection method of user operation behavior data according to claim 1, is characterized in that, described entity identification data is carried out feature selection and feature dimensionality reduction, obtains the feature data after dimensionality reduction, comprising:

summarizing the entity identification data and data stored in a second database, where the second database stores data for handling user services;

Handling outliers/duplicates in the data;

Perform feature selection on the processed data, and store the selected filtered feature selection data;

Calculate a covariance matrix representing data correlation based on the feature selection data, and perform eigendecomposition on it to obtain a set of eigenvalues and eigenvectors;

Projecting the set of eigenvalues and eigenvectors to a feature matrix to obtain dimensionality-reduced feature data, and storing the feature data.

5. The method for detecting user operation behavior data according to claim 1, wherein the feature data is subjected to cluster analysis to obtain classification information of various operation behaviors, including:

Based on the K-means density clustering algorithm, the set of feature data is divided into objects belonging to different clusters according to the feature similarity, including distributing the data with similar features in the same cluster, and distributing the data with dissimilar features outside the cluster;

Perform data analysis based on the density of the characteristic data distribution to obtain classification data of various operational behaviors;

The K-means density clustering algorithm is to pre-set a threshold before clustering, calculate the weight based on the density of the feature data, the average distance within the cluster and the distance between the clusters, and use the weighted Euclidean distance to calculate the weight. The distance of the characteristic data, and the initial cluster center is selected by calculating the density, weight and distance of the characteristic data, and the initial input parameters of the K-means density clustering algorithm are obtained.

6 . The method for detecting user operation behavior data according to claim 1 , wherein the data analysis is performed on the classified data based on an anomaly detection algorithm to obtain normal data of the user’s normal operation behavior and user’s illegal operation behavior. 7 . abnormal data, including:

Using three anomaly detection algorithms, isolated forest, One Class SVM and local anomaly factor, respectively, to score anomaly on the classified data, and obtain the corresponding anomaly score value;

Weighting and normalizing the anomaly scoring values output by the three anomaly detection algorithms to obtain the ranking of the abnormal scoring values for all users;

According to the ranking of the abnormal score values, the normal data of the user's normal operation behavior and the abnormal data of the user's illegal operation behavior are determined.

7. The method for detecting user operation behavior data according to claim 1, wherein the data analysis is performed on the classified data based on an anomaly detection algorithm to obtain normal data of the user's normal operation behavior and user's illegal operation behavior After the abnormal data, it also includes:

If it is determined to be abnormal data of the user's illegal operation, the system administrator and related technical personnel will be notified by email and text message, and a disaster recovery mechanism will be activated for some abnormal data to solve the abnormal problem.

8. A detection device for user operation behavior data, comprising:

a data collection module for collecting user operation behavior data, where the user operation behavior data is data describing various user operation behaviors;

an entity extraction module, configured to perform entity extraction on the user operation behavior data to obtain entity identification data, where the entity identification data is data related to abnormal data extracted from the user operation behavior data;

A feature selection module, configured to perform feature selection and feature dimensionality reduction on the entity identification data, to obtain feature data after dimensionality reduction, and the feature data is data obtained by feature selection and feature dimensionality reduction to realize feature extraction and data compression;

a cluster analysis module, configured to perform cluster analysis on the feature data to obtain classification data of various operation behaviors, and the classification data is used to classify various operation behaviors of the user;

The anomaly detection module is used for performing data analysis on the classified data by using an anomaly detection algorithm to obtain normal data of the user's normal operation behavior and abnormal data of the user's illegal operation behavior.

9. An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the program as claimed in claim 1 when executing the program Steps of any one of to 7 of the method for detecting user operation behavior data.

10. A non-transitory computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the user operation behavior data according to any one of claims 1 to 7 is implemented. The steps of the detection method.