CN115048276B

CN115048276B - Log grouping method and device and electronic equipment

Info

Publication number: CN115048276B
Application number: CN202210553506.XA
Authority: CN
Inventors: 吴雷; 谭佐艳; 罗聪; 陈川; 辛晨; 李忆蕾
Original assignee: China Telecom Cloud Technology Co Ltd
Current assignee: China Telecom Cloud Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2025-08-15
Anticipated expiration: 2042-05-20
Also published as: CN115048276A

Abstract

The application relates to a method, a device and electronic equipment for grouping logs, which are used for solving the problem that the log analysis result is inaccurate due to the fact that the existing log grouping is not suitable for a real log analysis scene. The method comprises the steps of determining a first target field sequence and a second target sequence corresponding to a log to be grouped, wherein the first target field sequence at least comprises a numerical value type target field, the second target sequence at least comprises a text type target field, then calculating a first distance between the first target field sequence and a first reference field sequence corresponding to a first log group and a second distance between the second target field sequence and a second reference field sequence corresponding to the first log group, and adding the log to be grouped into the first log group if the sum of the first distance and the second distance is smaller than or equal to a preset threshold value. The method can help to improve the accuracy of the log analysis result.

Description

Log grouping method and device and electronic equipment

Technical Field

The present application relates to the field of information security technologies, and in particular, to a method and an apparatus for log grouping, and an electronic device.

Background

In network security operation and maintenance, operation staff typically discover security risks such as network attack and vulnerability risks by analyzing log operations, and then adopt corresponding security policies to protect the discovered security risks.

However, in the face of massive logs, it is difficult for operators to complete manual analysis of the massive logs in a short time. Based on the above, in the prior art, a log text clustering mode is generally adopted, logs similar to text data in the logs are divided into the same group according to the similarity of the text data in the logs, and analysis is performed based on the divided groups, so that the efficiency of manual analysis is improved.

The log is typically a heterogeneous data, in other words, the log may contain text-like data, or may contain other types of data. Therefore, in the prior art, the log grouping is performed through text data in the log, only single type data of the log is considered, so that the obtained grouping is not suitable for a real log analysis scene, and the result of log analysis is inaccurate.

Disclosure of Invention

The application provides a log grouping method, a log grouping device and electronic equipment, which are used for carrying out portrait clustering on massive portrait data.

In a first aspect, the present application provides a method of log grouping, the method comprising:

determining a first target field sequence and a second target sequence corresponding to a log to be grouped, wherein the first target field sequence at least comprises a numerical value type target field, and the second target sequence at least comprises a text type target field;

Calculating a first distance between the first target field sequence and a first reference field sequence corresponding to a first log packet, wherein the first reference field sequence at least comprises a numerical value class reference field;

Calculating a second distance between the second target field sequence and a second reference field sequence corresponding to the first log packet, wherein the second reference field sequence at least comprises one text type reference field;

And if the sum of the first distance and the second distance is smaller than or equal to a preset threshold value, adding the log to be grouped into the first log group.

By the method, the logs are grouped based on different types of fields in the logs, so that the accuracy of grouping results can be effectively improved, and further the efficiency of subsequent log analysis and the accuracy of analysis results can be improved. On the one hand, the distances of the numerical value fields and the text fields are calculated separately, so that the method can be more fit for actual application scenes, and further, the method further provides the step of dynamically adjusting the weight in the distance calculation, so that more accurate log grouping is obtained, and better interpretation significance is provided for the calculated distances. On the other hand, the first reference field sequence and the second reference field sequence are respectively used for representing the numerical value type field and the text type field of the log in the first log group, and in this way, the processing performance in an online processing scene is effectively improved, and the calculation time and the calculation resources are saved.

In one possible design, the determining the first target field sequence and the second target field sequence corresponding to the log to be grouped includes determining a first value class field corresponding to each classification field in the log to be grouped according to a mapping relation between the classification field and the value class field, taking each value class field in the log to be grouped and each determined first value class field as a value class target field forming the first target field sequence, and taking each text class field in the log to be grouped as a text class target field forming the second target field sequence.

By the method, the type of the field which needs to calculate the distance in the log can be effectively reduced by mapping the classified field into the numerical value type field in consideration of the type of the classified field of the log, and the mapping into the numerical value type field is also beneficial to improving the calculation time and calculation efficiency consumed by calculating the distance later.

In one possible design, the calculating the first distance between the first target field sequence and the first reference field sequence corresponding to the first log packet includes determining an attribute corresponding to each value class target field in the first target field sequence and an attribute corresponding to each value class reference field in the first reference field sequence corresponding to the first log packet, calculating a field distance between the value class target field and the value class reference field corresponding to the same attribute to obtain a plurality of field distances between each value class target field and each value class reference field, and taking a sum of the field distances as the first distance between the first target field sequence and the second reference field sequence.

By the method, the first distance is obtained by calculating the field distances between the numerical value class target fields and the numerical value class reference fields corresponding to the same attribute one by one, and the calculated first distance is more fit for the actual application scene and has better interpretation significance.

In one possible design, the calculating the second distance between the second target field sequence and the second reference field sequence corresponding to the first log packet includes determining an attribute corresponding to each text-class target field in the second target field sequence and an attribute corresponding to each text-class reference field in the second reference field sequence corresponding to the first log packet, calculating a minimum edit distance between the text-class target field and the text-class reference field corresponding to the same attribute to obtain a plurality of minimum edit distances between each text-class target field and each text-class reference field, and weighting and summing the plurality of minimum edit distances to obtain the second distance between the second target field sequence and the second reference field sequence.

By the method, in some scenes, a plurality of editing distances between the text class reference field and the text class target field corresponding to the same attribute can be obtained through calculation, the minimum editing distance is selected from the plurality of editing distances, the minimum editing distances corresponding to different attributes are calculated one by one in the mode, the minimum editing distances are weighted and summed to obtain a second distance, and the method can be more fit with the actual application scene.

In one possible design, after the adding the log to be grouped to the first log group if the sum of the first distance and the second distance is less than or equal to a preset threshold, the method further comprises creating a second log group if the sum of the first distance and the second distance is greater than the preset threshold and no other log group exists, and adding the log to be grouped to the second log group.

The method can be suitable for an online scene, namely, the self-adaptive dynamic grouping is realized, and further, the logs are added into the log grouping meeting the condition.

In one possible design, the adding the log to be grouped to the first log group includes determining a first tag corresponding to the first log group and adding the first tag to the log to be grouped.

By the method, the method can be suitable for an online scene, the self-adaptive dynamic grouping is realized, the logs are added to the log grouping meeting the conditions, and the labels corresponding to the grouping are also added to the log, so that the relevance among the log grouping is established, and the method is more suitable for an actual application scene, particularly some complex log analysis scenes.

In one possible design, after adding the first label to the log to be grouped, the method further comprises, in response to removing the first label of the log to be grouped in the first log group, calculating a sum value of a first distance and a second distance between each log in the first log group and the first reference field sequence and the second reference sequence, respectively, determining that the calculated sum value is greater than an associated log corresponding to the calculated sum value of the log to be grouped in other logs of the first log group, and removing the first label of the associated log, wherein the other logs are logs except the log to be grouped in the first log group.

By the method, when the label of a certain log is deleted, other logs in the log group corresponding to the label are determined, and the labels of the logs with the calculated distance larger than the distance between the log and the reference field are also deleted by calculating the distance between the respective fields of the other logs and the reference field corresponding to the log group. According to the method for deleting the log labels, based on the labels added by the logs, the relevance of the logs in different groups and the relevance of the logs in the same group can be realized, namely, when deleting a certain log, the corresponding label can be deleted, and other similar logs are correspondingly deleted based on the label, so that the analysis efficiency of log analysis and the accuracy of analysis results can be improved.

In a second aspect, the present application provides an apparatus for log grouping, the apparatus comprising:

the method comprises the steps of determining a first target field sequence and a second target sequence corresponding to a log to be grouped, wherein the first target field sequence at least comprises a numerical value type target field, and the second target sequence at least comprises a text type target field;

The first calculation module calculates a first distance between the first target field sequence and a first reference field sequence corresponding to a first log packet, wherein the first reference field sequence at least comprises a numerical value class reference field;

A second calculation module, configured to calculate a second distance between the second target field sequence and a second reference field sequence corresponding to the first log packet, where the second reference field sequence includes at least one text-based reference field;

and the grouping module is used for adding the log to be grouped into the first log group if the sum of the first distance and the second distance is smaller than or equal to a preset threshold value.

In one possible design, the determining module is specifically configured to determine, according to a mapping relationship between the classification field and the value class field, a first value class field corresponding to each classification field in the log to be grouped, take each value class field in the log to be grouped and each determined first value class field as a value class target field forming a first target field sequence, and take each text class field in the log to be grouped as a text class target field forming a second target field sequence.

In one possible design, the first calculation module is specifically configured to determine an attribute corresponding to each value class target field in the first target field sequence and an attribute corresponding to each value class reference field in a first reference field sequence corresponding to a first log packet, calculate a field distance between the value class target field and the value class reference field corresponding to the same attribute, obtain a plurality of field distances between each value class target field and each value class reference field, and use a sum of the plurality of field distances as a first distance between the first target field sequence and the second reference field sequence.

In one possible design, the second calculation module is specifically configured to determine an attribute corresponding to each text class target field in the second target field sequence and an attribute corresponding to each text class reference field in the second reference field sequence corresponding to the first log packet, calculate a minimum edit distance between the text class target field and the text class reference field corresponding to the same attribute to obtain a plurality of minimum edit distances between each text class target field and each text class reference field, and perform weighted summation for the plurality of minimum edit distances to obtain a second distance between the second target field sequence and the second reference field sequence.

In one possible design, the grouping module further includes creating a second log group and adding the log to be grouped to the second log group if the sum of the first distance and the second distance is greater than the preset threshold and no other log group exists.

In one possible design, the grouping module is specifically configured to determine a first tag corresponding to the first log packet, and add the first tag to the log to be grouped.

In one possible design, the grouping module further comprises, in response to removing the first label of the log to be grouped in the first log group, calculating a sum value of a first distance and a second distance between each log in the first log group and the first reference field sequence and the second reference sequence, respectively, and in other logs in the first log group, determining that the calculated sum value is greater than an associated log corresponding to the calculated sum value of the log to be grouped, and removing the first label of the associated log, wherein the other logs are logs in the first log group except the log to be grouped.

In a third aspect, the present application provides an electronic device, including:

A memory for storing a computer program;

And the processor is used for realizing the method steps of the log grouping when executing the computer program stored in the memory.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein a computer program which when executed by a processor performs the method steps of a log grouping as described above.

The technical effects of each of the second to fourth aspects and the technical effects that may be achieved by each aspect are referred to above for the technical effects that may be achieved by the first aspect or each possible aspect in the first aspect, and the detailed description is not repeated here.

Drawings

FIG. 1 is a flow chart of a method for log grouping provided by the present application;

FIG. 2 is a schematic diagram of an exemplary log provided by the present application;

FIG. 3 is a flow chart of an online log data processing method according to the present application;

FIG. 4 is a schematic view of a removable label according to the present application;

FIG. 5 is a flow chart of a method for removing tags according to the present application;

FIG. 6 is a schematic diagram of an apparatus for log grouping according to the present application;

Fig. 7 is a schematic diagram of a structure of an electronic device according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings. The specific method of operation in the method embodiment may also be applied to the device embodiment or the system embodiment.

In the description of the present application, "plurality" is understood as "at least two". "and/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. A and B are connected, and it can be represented that A and B are directly connected and A and B are connected through C. In addition, in the description of the present application, the words "first," "second," and the like are used merely for distinguishing between the descriptions and not be construed as indicating or implying a relative importance or order.

The embodiment of the application provides a method, a device and electronic equipment for grouping logs, which are used for improving the accuracy of grouping logs.

According to the method provided by the embodiment of the application, first, a first target field sequence and a second target sequence corresponding to a log to be grouped are determined, wherein the first target field sequence at least comprises a numerical value type target field, the second target sequence at least comprises a text type target field, then, a first distance between the first target field sequence and a first reference field sequence corresponding to a first log group and a second distance between the second target field sequence and a second reference field sequence corresponding to the first log group are calculated, and if the sum of the first distance and the second distance is smaller than or equal to a preset threshold value, the log to be grouped is added into the first log group. By the method, the logs are grouped by combining the numerical value fields and the text fields of the logs, and the obtained log grouping has higher relevance of the logs, so that the accuracy of the log analysis result is improved.

Further, after the log to be grouped is added to the first log group, a first tag corresponding to the first log group may also be added to the log to be grouped. In this way, the logs in the same log group can be added with the same label, that is, the relevance of each log in the same log group is realized, in addition, one log can also belong to a plurality of groups, in this case, the log is correspondingly provided with a plurality of identifications, that is, the relevance of each log in different log groups is realized by the identification mode.

Based on the addition of the labels, the deletion of a certain label of the log can be realized, namely, when deleting a certain log, the corresponding label can be deleted, and other similar logs are correspondingly deleted based on the label, so that the analysis efficiency of log analysis and the accuracy of analysis results can be improved.

It should be noted that, the technical features included in the embodiment of the present application may be used in any combination, and those skilled in the art should understand that, from the practical application situation, the technical scheme obtained by reasonably combining the technical features in the embodiment of the present application may also solve the same technical problem or achieve the same technical effect.

The method provided by the embodiment of the application is further described in detail below with reference to the accompanying drawings.

Referring to fig. 1, the embodiment of the application provides a log grouping method, which specifically comprises the following steps:

Step 101, determining a first target field sequence and a second target sequence corresponding to a log to be grouped;

the first target field sequence at least comprises a numeric class target field, and the second target field sequence at least comprises a text class target field.

Before executing step 101, a first value class field corresponding to each classification field in the log to be grouped may be determined according to a mapping relationship between the classification field and the value class field.

And then, taking each numerical value class field in the log to be grouped and each determined first numerical value class field as a numerical value class target field forming a first target field sequence, and taking each text class field in the log to be grouped as a text class target field forming a second target field sequence.

Specifically, the log to be grouped is a security log, and the security log may include fields of various types, such as a numeric class field, a classification field, and a text class field. In detail, the numeric class field includes a source IP (Internet Protocol, internetworking protocol), event time, status code, number of requested data bytes, etc., the classification field includes a rule ID (Identity document, identification number of identification card), attack type, action, threat level, etc., and the text class field includes a request URI (Uniform Resource Identifier ), user-Agent (UA), etc.

For example, fig. 2 is an exemplary log, which may be one under the default configuration of open source WAF (Web Application Firewall ) project OpenWAF. Here OpenWAF may be an open source WAF item.

It should be noted that the log shown in fig. 2 is an exemplary log, and the specific matters related thereto are also used in the following exemplary description.

In the log shown in fig. 2, the source IP of the log is "192.168.0.1", the event type is "2021-10-11:16:12:42", the attack type is "SQLi", the action is "Deny", the threat level is "High", the rule ID is "10001", the request URI is "/index.phpp=1%20or%201=1", and the User-Agent is "curl/7.29.0", specifically see table 1 below.

Source IP	192.168.0.1
		Event time	2021-10-11 16:12:42
Attack type	SQLi
		Action	Deny
Threat level	High
		Rule ID	10001
URI	/index.phpp=1%20or%201=1
		User-Agent	curl/7.29.0

TABLE 1

Wherein the left column represents each attribute in the log, the right column represents each field in the log, and each field has a one-to-one correspondence with each attribute.

As shown in table 1, the fields corresponding to the attack type, the action, the threat level and the rule ID are classified fields of the log, where the numerical value class fields corresponding to the classified fields can be obtained according to the mapping relationship between the classified fields and the numerical value class fields. The mapping can be specifically performed according to a tag coding mode, namely N classification fields are mapped into an integer of a section of 0-N-1 one by one, wherein N is an integer greater than or equal to 1. Taking the threat level attribute as an example, if a preset threat level is set to correspond to four fields, the four fields are mapped on four numerical value type fields of 0,1, 2 and 3 respectively, and when the threat level is "High", the numerical value type field of "3" is corresponding to the numerical value type field of "High", the numerical value type field of "3" is obtained after mapping the field of "High" in the log shown in fig. 2.

Based on the mapping process, rounding processing can be performed on the fields corresponding to the two attributes of the source IP and the event time, and the processed fields are shown in the following table 2.

Source IP	3232235521
		Event time	1633939962
Attack type	0
		Action	1
Threat level	3
		Rule ID	0
URI	/index.phpp=1%20or%201=1
		User-Agent	curl/7.29.0

TABLE 2

The source IP is changed into 3232235521 from the original 192.168.0.1, the event type is changed into 1633939962 from the original 2021-10-11:12:42, the attack type is mapped into 0 from the original SQLi, the action is mapped into 1 from the original Deny, the threat level is mapped into 3 from the original High, and the rule ID is mapped into 0 from the original 10001 through rounding.

Through the processing, the classification field originally existing in the log is converted into the numerical value type field, so that the field type of the log is simplified while the content of the log is not lost, and further the calculation time and calculation resources required by subsequent log analysis are reduced.

In other words, the log to be grouped can obtain d target fields corresponding to the log to be grouped through the above processing, wherein d is an integer greater than or equal to 2, and the target fields comprise a numerical value class target field and a text class target field. Here, since the attributes corresponding to the respective target fields are different, if a single attribute is taken as one dimension of the target fields, the log to be grouped corresponds to the target field with d dimensions, and in the embodiment of the present application, the attribute of each target field is represented by a dimension, and the description thereof will not be repeated.

Further, the numerical class object fields of the log to be grouped constitute a first object field sequence, and the text class object fields of the log to be grouped constitute a second object field sequence. The numerical value target field and the text target field may be collectively referred to herein as target fields, and the target fields in the first target field sequence and the second target field sequence are arranged in a certain order according to the size of the dimension.

For example, the log shown in table 2 may include eight-dimensional numeric class fields, if the first dimension is the source IP and the eighth dimension is the User-Agent in the order from top to bottom as shown in table 2, the first target field sequence of the log may be {3232235521,1633939962,0,1,3,0}, and the second target field sequence of the log may be {/index.phpp=1%20or%201=1, curl/7.29.0}.

Based on the method, a first target field sequence and a second target field sequence corresponding to the log to be grouped can be determined.

Step 102, calculating a first distance between the first target field sequence and a first reference field sequence corresponding to a first log packet, wherein the first reference field sequence at least comprises a numerical value class reference field;

In the embodiment of the application, firstly, the sequence position of each numerical value type target field in a first target field sequence and the sequence position of each numerical value type reference field in a first reference field sequence corresponding to a first log packet are determined, then, a plurality of field distances between each numerical value type target field and each numerical value type reference field are obtained by calculating the field distances between the numerical value type target field and the numerical value type reference field corresponding to the same sequence position, and finally, the sum of the field distances is taken as the first distance between the first target field sequence and the first reference field sequence.

It should be noted that, although the first log packet is described in the embodiment of the present application, the first log packet herein should not constitute a limitation on the number of log packets. It can be understood that a plurality of log packets may be provided, and only a processing method of one log packet is specifically described herein, and repeated description of other log packets is omitted.

Specifically, the first log group includes at least one log, if the first log group includes only one log, the value class reference field in the first reference field sequence corresponding to the first log group is the value class field of the log, and if the first log group includes a plurality of logs, the value class reference field in the first reference field sequence corresponding to the first log group is the average value of the value class fields of the plurality of logs, which is described in detail below with reference to the formula.

For a first log group data= { L ₁,L₂,…,L_n } containing n logs, wherein n is greater than or equal to 2, L ₁、L₂、L_n is a single log in the first log group, each single log corresponds to a field with d dimension, d is an integer greater than or equal to 2, and the field with d dimension comprises a numerical class field with a dimension and a text class field with b dimension. Here, for the dimension i= 1~a, an average value of the numerical class fields corresponding to the i-th dimension of the n logs in the first log group is calculated, and the calculated average value is used as the numerical class reference field corresponding to the i-th dimension of the first log group, specifically, see the following formula:

Wherein, the For the first log packet Data corresponding to the numeric class reference field of the ith dimension,For the value class field of the ith dimension corresponding to the jth log in the first log group Data, n is the total number of the logs in the first log group Data.

By the method, the first reference field sequence corresponding to the first log packet can be obtained, and the first reference field sequence comprises a numerical class reference field of a dimension.

It can be appreciated that the first target field sequence corresponding to the log to be grouped also includes a numerical class target field having an a-dimension. Here, for the i= 1~a dimension, calculating the weighted difference square between the i-th dimension value class reference field in the first reference field sequence and the i-th dimension value class target field in the first target field sequence, performing sum-of-squares calculation on the a-weighted difference squares calculated by the a-dimension, and taking the calculated result as a first distance between the first target field sequence and the first reference field sequence, specifically referring to the following formula:

Wherein L1 is a first distance between the first target field sequence and the first reference field sequence, L ⁱ is a value class target field of the ith dimension in the first target field sequence, C ⁱ is a value class reference field of the ith dimension in the first reference field sequence, w _i is a weight of the ith dimension, and the weight is set according to the actual situation, for example, the dimension of the source IP is important, and then the weight of the dimension is increased.

It is noted that the first distance here characterizes the distance between the first target field sequence and the first reference field sequence, so that a smaller first distance proves that the first target field sequence is more similar to the first reference field sequence and a larger first distance proves that the first target field sequence is less similar to the first reference field sequence.

By the method, the first distance between the first target field sequence and the first reference field sequence is also obtained.

Step 103, calculating a second distance between the second target field sequence and a second reference field sequence corresponding to the first log packet, wherein the second reference field sequence at least comprises a text type reference field;

In the embodiment of the application, firstly, the sequence position of each text type target field in a second target field sequence is determined, the sequence position of each text type reference field in a second reference field sequence corresponding to a first log packet is determined, then the minimum editing distance between the text type target field and the text type reference field corresponding to the same sequence position is calculated, a plurality of minimum editing distances between each text type target field and each text type reference field are obtained, and finally, weighted summation is carried out on the plurality of minimum editing distances, so that a second distance between the second target field sequence and the second reference field sequence is obtained.

Specifically, the first log group includes at least one log, if the first log group includes only one log, the text class reference field in the second reference field sequence corresponding to the first log group is the text class field of the log, and if the first log group includes a plurality of logs, the text class reference field in the second reference field sequence corresponding to the first log group is the average value of the text class fields of the logs, which is described in detail below with reference to the formula.

For a first log group data= { L ₁,L₂,…,L_n } containing n logs, wherein n is greater than or equal to 2, L ₁、L₂、L_n is a single log in the first log group, each single log corresponds to a field with d dimension, d is an integer greater than or equal to 2, and the field with d dimension comprises a numerical class field with a dimension and a text class field with b dimension. Here, for the dimension k= 1~b, calculating text class fields corresponding to n logs in the first log group, to obtain n text class fields corresponding to the first log group in the kth dimension, randomly selecting t text class fields from the n text class fields, and using the t text class fields together as a text class reference field corresponding to the first log group in the kth dimension, where the following formula is specifically referred to as:

Wherein, the A text class reference field corresponding to the kth dimension for the first log packet Data,And a text class field corresponding to the kth dimension for any t logs in the first log group Data.

By the method, the second reference field sequence corresponding to the first log group can be obtained, and the second reference field comprises the text class reference field of the b dimension. It can be appreciated that the second target field sequence corresponding to the log to be grouped also includes a text class target field having a b dimension. Here, for the k= 1~b dimension, calculating the minimum edit distance between each text class reference field of the k dimension in the second reference field sequence and the text class target field of the k dimension in the second target field sequence, performing weighted summation calculation on the b minimum edit distances calculated by the b dimension, and taking the calculated result as a second distance between the second target field sequence and the second reference field sequence, specifically referring to the following formula:

Where L2 is a second distance between the second target field sequence and the second reference field sequence, L ^k is a text class target field of a kth dimension in the second target field sequence, C ^k is a single or multiple text class reference fields of the kth dimension in the second reference field sequence, v is a single text class reference field of the kth dimension in the second reference field sequence, E (L ^k, v) is an edit distance between L ^k and v, w _k is a weight of the kth dimension, the weight is set according to the actual situation, for example, the dimension of the request URI is important, and then the weight of this dimension is increased.

It is noted that the second distance here characterizes the distance between the second target field sequence and the second reference field sequence, so that a smaller second distance proves that the second target field sequence is more similar to the second reference field sequence and a larger second distance proves that the second target field sequence is less similar to the second reference field sequence.

By the above method, a second distance between the second target field sequence and the second reference field sequence is also obtained.

And 104, if the sum of the first distance and the second distance is smaller than or equal to a preset threshold value, adding the log to be grouped into the first log group.

The first distance in step 102 and the second distance in step 103 are summed to obtain a target distance between the log to be grouped and the first log group, and the specific calculation formula is as follows:

L=L1+L2

Wherein L1 is a first distance, L2 is a second distance, and L is a target distance.

In the embodiment of the application, whether the target distance is smaller than or equal to a preset threshold value is judged, if the target distance is smaller than or equal to the preset threshold value, the logs to be grouped are added into the first log group, if the target distance is larger than the preset threshold value and no other log groups exist, a second log group is created and the logs to be grouped are added into the second log group, if the target distance is larger than the preset threshold value and other log groups exist, the respective target distances between the logs to be grouped and the other log groups are calculated according to the method from step 101 to step 104, and the logs to be grouped are added into the log groups corresponding to the calculated target distance larger than or equal to the preset threshold value.

Further, if the log to be grouped is added to the first log group, a first tag corresponding to the first log group is also determined, and then the first tag is added to the log to be grouped. In addition, since the first log packet adds a new log to be grouped, the first reference field sequence and the second reference field sequence corresponding to the new first log packet will also be provided.

The idea of updating the first reference field sequence is the same as step 102, specifically, for the dimensions i=1 to (a+1), an average value of the numerical value fields corresponding to the ith dimension in the n+1 logs in the first log group is calculated, and specifically, see the following formula:

Wherein, to the left of the equation To the right of the equation for the updated first log packet to correspond to the value class reference field of the ith dimensionFor the numerical class reference field corresponding to the ith dimension of the first log group before updating, n is the number of logs in the first log group before updating, (n+1) is the number of logs in the first log group after updating, and L ⁱ is the numerical class target field corresponding to the ith dimension of the log to be grouped.

In the following, taking the j dimension as an example, in the original second reference field sequence And adding a text class target field L ^j of the j-th dimension corresponding to the log to be grouped to form a new field sequence S= [ text ₁,text₂,…,text_t,L^j ]. Here, text ₁ is taken as the first element of field sequence S, text ₂ is taken as the second element of field sequence S, and so on, L ^j is taken as the t+1th element of field sequence S. For each element S _i in the field sequence S, i=0 to (t+1), calculating the edit distances between S _i and other elements in the field sequence S, to obtain a set D _i＝[E(S_i,S₁),…,E(S_i,S_t+1 of t edit distances) ]. Then, the variance of the corresponding set of each element is recalculatedWherein E _ij is the edit distance between the ith element and the jth element in the collection,Is the average of t edit distances in the collection. After all variances are calculated, removing the element with the maximum calculated variance from the field sequence S, and then removing the field sequence with one element as the updated second reference field sequence.

From the update of the second reference field sequence, it can be seen that the updated second reference field sequence reserves the text reference field with smaller relative variance, so as to ensure the stability of the text characteristics of the text reference field in the second reference field sequence.

By the method, real-time update of the log grouping is realized, the method is suitable for a scene of online iterative processing, timeliness of log analysis is guaranteed, and efficiency and accuracy of the log analysis are improved.

In addition, the method of adding the same label to the logs in the same log group can establish the relevance among the logs of the same type in the same group. In addition, one log may be added to a plurality of log packets, in which case the labels corresponding to each of the plurality of log packets are added to the log, and thus, the association of the same log in different packets is also established.

In order to facilitate a better understanding of the solutions provided by the embodiments of the present application, the application of the above process in a practical application scenario is further described below with reference to the accompanying drawings.

As shown in fig. 3, for the flow chart of the online processing of log data, it should be noted that the online processing is a suitable scenario, and the following method is equally suitable for the scenario of the offline processing.

S301, acquiring a first log;

The first log is a log to be grouped.

S302, calculating the distance between a first log and a data center of a log group;

the data center is a combination name of a first reference field sequence and a second reference field sequence corresponding to the log packet, and the distance is the target distance.

S303, judging whether the distance is larger than or equal to a preset threshold value;

if the distance is greater than or equal to the preset threshold, step S305 is executed, and if the distance is less than the preset threshold, step 304 is executed.

S304, adding the first log to the log group, and updating the data center and the label corresponding to the log group;

S305, creating a new log packet, adding the first log into the new log packet, and updating the data center and the label corresponding to the log packet.

In one possible design, if a specific label needs to be added to a log and the logs of the same type of the log, then the log group in which the log is located is first determined, and then the characteristic label is added to all the logs in the log group in which the log is located. Here, the specific tag may be a tag that does not pay attention to the log, or may be a tag that focuses attention to the log.

By the method, the self-adaption of log analysis is realized, and the efficiency of log analysis can be effectively improved.

In one possible design, if it is desired to remove the label lb of one log L _k, while also removing the label lb of the same log as the one log, this may be accomplished by determining a plurality of log packets corresponding to the label lb and containing log L _k, and calculating the respective target distances of log L _k from the plurality of log packets as the final target distances. For better understanding by those skilled in the art, taking one log group as an example, calculating the final target distance between the log L _k and the log group, then calculating the target distances between the other logs in the log group and the log group, obtaining a plurality of target distances, selecting the log corresponding to the target distance greater than the final target distance as the log to be removed, and removing the labels lb of the logs with the removed labels.

For example, referring to fig. 4, a schematic diagram of removing the labels is shown, wherein the circle center is a reference field of the log packet, the distance between the log a and the circle center is a final target distance, and the distances between the log B, the log C, the log D, the log E, and the log F and the circle center are target distances.

Further, in order to facilitate a better understanding of the solutions provided by the embodiments of the present application, the application of the above process in a practical application scenario is further described below with reference to the accompanying drawings.

As shown in fig. 5, a flowchart for removing the tag is shown.

S501, removing a label lb of the log L _k;

S502, calculating the distance between the log L _k and the data center of each log packet containing the log L _k;

S503, removing all logs with the distance from the data center greater than the distance from the log L _k to the data center from the log group containing the log L _k;

s504, updating a data center of a log packet of the removal log;

s505, updating a preset threshold value of a log packet of the removal log;

the preset threshold is a distance threshold.

S506, judging whether the weight is to be adjusted;

If it is determined that the weight is to be adjusted, S507 is executed, and if it is determined that the weight is not to be adjusted, S510 is executed.

S507, judging whether the weight is to be increased;

if it is judged that the weight is to be increased, S509 is executed, if it is judged that the weight is not to be increased, S508 is executed;

s508, reducing the learning rate according to the corresponding weight;

s509, increasing the learning rate according to the corresponding weight;

S510, completing the process of removing the label lb.

Generally, removing the label means that the label has errors, so that after the label is found to have errors, detection and removal of the similar log label can be adaptively completed, unexpected situations can be well adapted, namely, the method is suitable for the requirements of various businesses and safe operation and maintenance, and the efficiency of log analysis is effectively improved. In addition, real-time interaction of the logs is realized based on the labels, and the updated log packets can be rapidly adapted to the requirements of various different services and different security operation and maintenance by dynamically adjusting parameters corresponding to the log packets.

The method provided by the embodiment of the application can achieve the technical effects that the method comprises the following steps:

1. The distances of the numerical value fields and the text fields are calculated separately, so that the method can be more fit for actual application scenes, and further, the method also provides the step of dynamically adjusting the weight in the distance calculation, so that more accurate log grouping is obtained, and better interpretation significance is provided for the calculated distances;

2. The method has the advantages that the reference fields corresponding to the log groups are provided, the characteristics of each log in the log groups are represented by the reference fields, the online processing performance is effectively improved, the calculation time and the calculation resources are saved, and the accuracy of the log groups is improved to a certain extent;

3. The method provided by the embodiment of the application is suitable for an online scene, namely, the same log can be added into a plurality of different log groups by a dynamic grouping method, and the same log is also added with the labels corresponding to the different groups, so that the relevance among the log groups is established, the actual scene can be more fit, and the method is further suitable for a complex log analysis scene.

Based on the same inventive concept, the application also provides a device for log grouping, which is used for improving the accuracy of log grouping, solving the problem that the log analysis result is inaccurate due to the fact that the existing log grouping is not suitable for a real log analysis scene, and being beneficial to improving the accuracy of the log analysis result, and referring to fig. 6, the device comprises:

The determining module 601 determines a first target field sequence and a second target sequence corresponding to a log to be grouped, wherein the first target field sequence at least comprises a numerical value type target field, and the second target sequence at least comprises a text type target field;

A first calculation module 602, configured to calculate a first distance between the first target field sequence and a first reference field sequence corresponding to a first log packet, where the first reference field sequence includes at least one numeric class reference field;

A second calculation module 603, configured to calculate a second distance between the second target field sequence and a second reference field sequence corresponding to the first log packet, where the second reference field sequence includes at least one text-based reference field;

And a grouping module 604, configured to add the log to be grouped to the first log group if the sum of the first distance and the second distance is less than or equal to a preset threshold.

In one possible design, the determining module 601 is specifically configured to determine, according to a mapping relationship between the classification field and the value class field, a first value class field corresponding to each classification field in the log to be grouped, take each value class field in the log to be grouped and each determined first value class field as a value class target field forming a first target field sequence, and take each text class field in the log to be grouped as a text class target field forming a second target field sequence.

In one possible design, the first calculation module 602 is specifically configured to determine an attribute corresponding to each value class target field in the first target field sequence and an attribute corresponding to each value class reference field in the first reference field sequence corresponding to the first log packet, calculate a field distance between the value class target field and the value class reference field corresponding to the same attribute, obtain a plurality of field distances between each value class target field and each value class reference field, and use a sum of the plurality of field distances as a first distance between the first target field sequence and the second reference field sequence.

In one possible design, the second calculating module 603 is specifically configured to determine an attribute corresponding to each text class target field in the second target field sequence and an attribute corresponding to each text class reference field in the second reference field sequence corresponding to the first log packet, calculate a minimum edit distance between the text class target field and the text class reference field corresponding to the same attribute to obtain a plurality of minimum edit distances between each text class target field and each text class reference field, and perform weighted summation for the plurality of minimum edit distances to obtain a second distance between the second target field sequence and the second reference field sequence.

In one possible design, the grouping module 604 further includes creating a second log group and adding the log to be grouped to the second log group if the sum of the first distance and the second distance is greater than the preset threshold and no other log group exists.

In one possible design, the grouping module 604 is specifically configured to determine a first tag corresponding to the first log packet, and add the first tag to the log to be grouped.

In one possible design, the grouping module 604 further includes, in response to removing the first label of the log to be grouped in the first log group, calculating a sum value of a first distance and a second distance between each log in the first log group and the first reference field sequence and the second reference sequence, respectively, determining, in other logs of the first log group, that the calculated sum value is greater than an associated log corresponding to the calculated sum value of the log to be grouped, and removing the first label of the associated log, the other logs being logs in the first log group other than the log to be grouped.

Based on the device, the distances of the numerical value fields and the text fields are calculated separately, so that the device can be more fit with the actual application scene, and further, the weight in the distance calculation is dynamically adjusted, so that more accurate log grouping is obtained, and the device has better interpretation significance on the calculated distances.

Based on the same inventive concept, the embodiment of the present application further provides an electronic device, where the electronic device may implement the function of the foregoing log grouping device, and referring to fig. 7, the electronic device includes:

At least one processor 701, and a memory 702 connected to the at least one processor 701, in which the specific connection medium between the processor 701 and the memory 702 is not limited in the embodiment of the present application, and in fig. 7, the connection between the processor 701 and the memory 702 through the bus 700 is taken as an example. Bus 700 is shown in bold lines in fig. 7, and the manner in which the other components are connected is illustrated schematically and not by way of limitation. The bus 700 may be divided into an address bus, a data bus, a control bus, etc., and is represented by only one thick line in fig. 7 for convenience of representation, but does not represent only one bus or one type of bus. Alternatively, the processor 701 may be referred to as a controller, and the names are not limited.

In an embodiment of the present application, the memory 702 stores instructions executable by the at least one processor 701, and the at least one processor 701 may perform the log grouping method as previously discussed by executing the instructions stored by the memory 702. The processor 701 may implement the functions of the various modules in the apparatus shown in fig. 6.

The processor 701 is a control center of the apparatus, and may connect various parts of the entire control device using various interfaces and lines, and by executing or executing instructions stored in the memory 702 and invoking data stored in the memory 702, various functions of the apparatus and processing data, thereby performing overall monitoring of the apparatus.

In one possible design, processor 701 may include one or more processing units, and processor 701 may integrate an application processor and a modem processor, wherein the application processor primarily processes operating systems, user interfaces, application programs, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701. In some embodiments, processor 701 and memory 702 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.

The processor 701 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the log grouping method disclosed in connection with the embodiment of the application can be directly embodied as a hardware processor or a combination of hardware and software modules in the processor.

The memory 702 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 702 may include at least one type of storage medium, and may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 702 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 702 in embodiments of the present application may also be circuitry or any other device/system capable of implementing a memory function for storing program instructions and/or data.

By programming the processor 701, the code corresponding to the log grouping method described in the foregoing embodiment may be solidified into a chip, so that the chip can execute the steps of the log grouping method of the embodiment shown in fig. 1 at the time of operation. How to design and program the processor 701 is a technology well known to those skilled in the art, and will not be described in detail herein.

Based on the same inventive concept, embodiments of the present application also provide a storage medium storing computer instructions that, when run on a computer, cause the computer to perform the log grouping method discussed above.

In some possible embodiments, aspects of the log grouping method provided by the present application may also be implemented in the form of a program product comprising program code for causing the control apparatus to carry out the steps in the log grouping method according to the various exemplary embodiments of the present application as described in the present specification when the program product is run on an apparatus.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus/system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A log grouping method, characterized in that the method comprises:

Determine a first target field sequence and a second target field sequence corresponding to the logs to be grouped, wherein the first target field sequence includes at least one numerical target field, and the second target field sequence includes at least one text target field;

Determining an attribute corresponding to each numerical target field in the first target field sequence, and an attribute corresponding to each numerical reference field in the first reference field sequence corresponding to the first log group, calculating field distances between the numerical target fields and the numerical reference fields corresponding to the same attribute, obtaining multiple field distances between the numerical target fields and the numerical reference fields, and using the sum of the multiple field distances as the first distance between the first target field sequence and the first reference field sequence;

Determining an attribute corresponding to each text-class target field in the second target field sequence, and an attribute corresponding to each text-class reference field in the second reference field sequence corresponding to the first log group, calculating a minimum edit distance between the text-class target field and the text-class reference field corresponding to the same attribute, obtaining multiple minimum edit distances between the text-class target field and the text-class reference field, and performing a weighted summation on the multiple minimum edit distances to obtain a second distance between the second target field sequence and the second reference field sequence;

If the sum of the first distance and the second distance is less than or equal to a preset threshold, the log to be grouped is added to the first log group.

2. The method according to claim 1, wherein determining the first target field sequence and the second target field sequence corresponding to the logs to be grouped comprises:

Determine the first numerical field corresponding to each of the classification fields in the logs to be grouped according to the mapping relationship between the classification fields and the numerical fields;

Using each numerical field in the log to be grouped and each determined first numerical field as a numerical target field constituting a first target field sequence;

Each text field in the log to be grouped is used as a text target field constituting a second target field sequence.

3. The method according to claim 1, characterized in that after adding the log to be grouped to the first log group if the sum of the first distance and the second distance is less than or equal to a preset threshold, the method further comprises:

If the sum of the first distance and the second distance is greater than the preset threshold and no other log group exists, a second log group is created, and the log to be grouped is added to the second log group.

4. The method according to any one of claims 1 to 3, wherein adding the log to be grouped into the first log group comprises:

Determine a first tag corresponding to the first log group, and add the first tag to the log to be grouped.

5. The method according to claim 4, characterized in that after adding the first tag to the log to be grouped, the method further comprises:

In response to removing the first label of the log to be grouped in the first log group, respectively calculating the sum of the first distance and the second distance between each log in the first log group and the first reference field sequence and the second reference field sequence;

Among other logs in the first log group, determine an associated log whose calculated sum value is greater than the calculated sum value of the log to be grouped, and remove the first tag of the associated log, where the other logs are logs in the first log group other than the log to be grouped.

6. A device for log grouping, characterized in that the device comprises:

A determination module determines a first target field sequence and a second target field sequence corresponding to the logs to be grouped, wherein the first target field sequence includes at least one numerical target field, and the second target field sequence includes at least one text target field;

a first calculation module, configured to determine an attribute corresponding to each numerical target field in the first target field sequence, and an attribute corresponding to each numerical reference field in the first reference field sequence corresponding to the first log group, calculate field distances between the numerical target fields and the numerical reference fields corresponding to the same attribute, obtain multiple field distances between the numerical target fields and the numerical reference fields, and use the sum of the multiple field distances as a first distance between the first target field sequence and the first reference field sequence;

a second calculation module, configured to determine an attribute corresponding to each text-type target field in the second target field sequence, and an attribute corresponding to each text-type reference field in the second reference field sequence corresponding to the first log group, calculate a minimum edit distance between the text-type target fields and the text-type reference fields corresponding to the same attribute, obtain multiple minimum edit distances between the text-type target fields and the text-type reference fields, and perform a weighted summation of the multiple minimum edit distances to obtain a second distance between the second target field sequence and the second reference field sequence;

An adding module is configured to add the log to be grouped to the first log group if the sum of the first distance and the second distance is less than or equal to a preset threshold.

7. An electronic device, comprising:

Memory for storing computer programs;

A processor, configured to implement the method steps of any one of claims 1 to 5 when executing the computer program stored in the memory.

8. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps described in any one of claims 1 to 5 are implemented.