CN112235264A

CN112235264A - Network traffic identification method and device based on deep migration learning

Info

Publication number: CN112235264A
Application number: CN202011042795.4A
Authority: CN
Inventors: 王进; 王丽宏; 陈训逊; 呼啸; 肖佃艳; 何跃鹰; 李政; 陈少鹏; 俞宙; 何清林; 孙中豪; 谷杰铭
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-15
Anticipated expiration: 2040-09-28
Also published as: CN112235264B

Abstract

The embodiment of the invention provides a network traffic identification method and device based on deep migration learning, relates to the technical field of network security, and can identify novel network traffic. The technical scheme of the embodiment of the invention comprises the following steps: and extracting message information and communication behavior information of a preset number of data packets from the network traffic to be identified. And then calculating the distance between the message information and the communication behavior information of the network flow to be identified and the clustering center of each cluster, wherein each cluster comprises the message information and the communication behavior information of the network flow of one category. And when the shortest distance in the calculated distances is smaller than the preset distance, obtaining the target category of the category cluster corresponding to the shortest distance. And inputting the message two-dimensional data matrix corresponding to the message information and the behavior two-dimensional data matrix corresponding to the behavior information into a network traffic identification model of the target category, and determining whether the network traffic to be identified is malicious traffic.

Description

Network traffic identification method and device based on deep migration learning

Technical Field

The invention relates to the technical field of network security, in particular to a network traffic identification method and device based on deep migration learning.

Background

With the rapid development of the fifth generation mobile communication (5G) technology, the internet of things, the industrial internet and other novel network technologies and the diversification trend of application scenes, the form of the network terminal is more diversified and the number of the network terminal is exponentially increased. Once network attacks such as remote control, information stealing, denial of service and the like initiated by malicious equipment successfully invade a network, the network attacks can form a significant threat to the user information security of the network terminal, and therefore the network security risk faced by the network terminal is increasingly highlighted.

At present, most network attacks need to achieve the malicious purpose through network communication, and if the protocol type of network traffic generated by network attack behaviors can be accurately identified, and whether the network traffic is the network attack or not is judged according to the protocol type, an attacked target system and equipment can be determined, so that effective countermeasures are implemented.

However, existing network monitoring and analyzing means such as port identification and deep packet inspection all need to utilize samples with classification labels in advance to train a classification network, however, in a novel network application scenario, a novel network flow of an unknown protocol type lacks samples with classification labels, and then it is impossible to detect whether the novel network flow is malicious flow.

Disclosure of Invention

The embodiment of the invention aims to provide a network traffic identification method and device based on deep migration learning, so as to solve the problem that novel network traffic cannot be identified. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a network traffic identification method based on deep migration learning, where the method includes:

extracting message information and communication behavior information of a preset number of data packets from network traffic to be identified, wherein the network traffic to be identified comprises network traffic generated in a session establishment stage and network traffic transmitted based on the established session;

calculating the distance between the message information and the communication behavior information of the network flow to be identified and the clustering center of each cluster, wherein each cluster comprises the message information and the communication behavior information of the network flow of one category;

when the shortest distance in the calculated distances is smaller than a preset distance, obtaining a target category of a category cluster corresponding to the shortest distance;

inputting a message two-dimensional data matrix corresponding to the message information and a behavior two-dimensional data matrix corresponding to the behavior information into the network traffic identification model of the target category, and determining whether the network traffic to be identified is malicious traffic;

the network traffic identification model of the target category is a model which is constructed by a deep migration learning method on the basis of a pre-training model corresponding to a target protocol type matched with the target category; the pre-training model corresponding to the target protocol type is as follows: training a deep learning model through a training sample set corresponding to the target protocol type to obtain a model; the training sample set corresponding to the target protocol type comprises: the sample two-dimensional data matrix of the sample network traffic of the target protocol type and the normal or malicious label corresponding to the sample network traffic of the target protocol type, where the sample two-dimensional data matrix of each sample network traffic includes: and respectively constructing a sample message two-dimensional data matrix and a sample behavior two-dimensional data matrix based on the message information and the communication behavior information of the data packets with the preset number in the sample network flow.

Optionally, before the message information and the communication behavior information of a preset number of data packets are extracted from the network traffic to be identified, the method further includes:

obtaining sample information sets of known protocol types, each sample information set of a known protocol type comprising: the message information and the communication behavior information of a number of data packets are preset in the sample network flow of the known protocol type;

dividing the pre-collected network traffic of undetermined protocol types by taking a session as a unit to obtain a plurality of unidentified network traffic;

extracting message information and communication behavior information of a preset number of data packets from each unidentified network flow;

clustering the message information and the communication behavior information of the plurality of unidentified network flows to obtain a cluster of each category;

calculating the maximum mean difference MMD between the class cluster of the class and the sample information set of each known protocol type according to each class, determining the known protocol type corresponding to the sample information set with the minimum MMD of the class cluster of the class, and taking the determined known protocol type as the protocol type matched with the class;

and constructing a network traffic identification model of the category by a deep migration learning method on the basis of a pre-training model corresponding to the protocol type matched with the category.

Optionally, the network traffic identification model of the target class is constructed through the following steps:

inputting a two-dimensional sample data matrix of the sample network traffic of the target protocol type into a pre-training model corresponding to the target protocol type;

step two, obtaining an output result of a pre-training model corresponding to the target protocol type;

calculating a loss value according to the output result, a normal or malicious label corresponding to the sample network flow of the target protocol type and the MMD between the sample information set of the target protocol type and the class cluster of the target class;

step four, if the pre-training model corresponding to the target protocol type is determined to be converged based on the loss value, determining that the network traffic identification model of the target type is the pre-training model corresponding to the target protocol type;

and step five, if the pre-training model corresponding to the target protocol type is determined not to be converged based on the loss value, adjusting model parameters of a full connection layer of the pre-training model corresponding to the target protocol type based on the loss value, and returning to the step one.

Optionally, the MMD between a class cluster of a class and a sample information set of a known protocol type is calculated by the following formula:

wherein D is_tiClass clusters of class i, D_skSample information set, n, for a known protocol type k_tiIs D_tiA corresponding amount of unrecognized network traffic,

n_skis D_skThe corresponding number of sample network traffic volumes,

h denotes the calculation of the distance measured by Φ (-) mapping the data into the regenerated kernel hilbert space RKHS.

Optionally, after the pre-training model corresponding to the protocol type matched with the category is used as a basis to construct the network traffic identification model of the category through a deep migration learning method, the method further includes:

aiming at each unidentified network flow of the category, constructing a message two-dimensional data matrix according to the message information of the unidentified network flow, and constructing a behavior two-dimensional data matrix according to the communication behavior information of the unidentified network flow;

and inputting the constructed message two-dimensional data matrix and the behavior two-dimensional data matrix into the network traffic identification model of the category, and determining whether the unidentified network traffic is malicious traffic.

Optionally, the network traffic identification model of the target class includes: a first convolution layer, a second convolution layer, a full connection layer and an output layer; the network traffic identification model of the target category identifies whether the network traffic to be identified is malicious traffic or not through the following steps:

the first convolution layer performs convolution on the message two-dimensional data matrix by using a two-dimensional convolution core to obtain a first characteristic diagram;

the second convolution layer performs convolution on the behavior two-dimensional data matrix by using a two-dimensional convolution core to obtain a second characteristic diagram;

the full connection layer integrates the first characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram;

and the output layer calculates the third characteristic diagram by using a preset classification algorithm to obtain and output whether the network traffic to be identified is malicious traffic.

Optionally, after obtaining the target category of the class cluster corresponding to the shortest distance, the method further includes:

determining the protocol type of the network traffic to be identified as a target protocol type matched with the target type;

if the target protocol type is a protocol type in a preset white list, determining that the network traffic to be identified is trusted network traffic, wherein the preset white list comprises the protocol type of the trusted network traffic;

and if the target protocol type is a protocol type in a preset blacklist, determining that the network traffic to be identified is untrusted network traffic, wherein the preset blacklist comprises the protocol type of the untrusted network traffic.

In a second aspect, an embodiment of the present invention provides a network traffic identification device based on deep migration learning, where the device includes:

the data acquisition module is used for extracting message information and communication behavior information of a preset number of data packets from network traffic to be identified, wherein the network traffic to be identified comprises network traffic generated in a session establishment stage and network traffic transmitted based on the established session;

the distance calculation module is used for calculating the distance between the message information and the communication behavior information of the network flow to be identified and the clustering center of each cluster, and each cluster comprises the message information and the communication behavior information of one type of network flow;

the classification module is used for obtaining the target class of the class cluster corresponding to the shortest distance when the shortest distance in the calculated distances is smaller than a preset distance;

the traffic identification module is used for inputting the message two-dimensional data matrix corresponding to the message information and the behavior two-dimensional data matrix corresponding to the behavior information into the network traffic identification model of the target category, and determining whether the network traffic to be identified is malicious traffic;

Optionally, the apparatus further comprises: the system comprises a dividing module, an unknown protocol clustering module, a protocol type matching module and an identification model building module;

the data acquisition module is further configured to obtain sample information sets of known protocol types before extracting message information and communication behavior information of a preset number of data packets from the network traffic to be identified, where each sample information set of a known protocol type includes: the message information and the communication behavior information of a number of data packets are preset in the sample network flow of the known protocol type;

the dividing module is used for dividing the pre-collected network traffic of undetermined protocol type by taking a session as a unit to obtain a plurality of unidentified network traffic;

the data acquisition module is also used for extracting message information and communication behavior information of a preset number of data packets from each unidentified network flow;

the unknown protocol clustering module is used for clustering the message information and the communication behavior information of the plurality of unidentified network flows to obtain clusters of various categories;

the protocol type matching module is used for calculating the maximum mean value difference MMD between the class cluster of the class and the sample information set of each known protocol type aiming at each class, determining the known protocol type corresponding to the sample information set with the minimum MMD of the class cluster of the class, and taking the determined known protocol type as the protocol type matched with the class;

the identification model construction module is used for constructing the network traffic identification model of the category by a deep migration learning method on the basis of a pre-training model corresponding to the protocol type matched with the category.

Optionally, the identification model building module is specifically configured to:

n_skis D_skThe corresponding number of sample network traffic volumes,

Optionally, the apparatus further comprises: a data matrix construction module;

the data matrix construction module is used for constructing a network traffic identification model of the category by a deep migration learning method on the basis of the pre-training model corresponding to the protocol type matched with the category, constructing a message two-dimensional data matrix according to message information of unidentified network traffic aiming at each unidentified network traffic of the category, and constructing a behavior two-dimensional data matrix according to communication behavior information of the unidentified network traffic;

the traffic identification module is further configured to input the constructed two-dimensional data matrix of the message and the two-dimensional data matrix of the behavior into the network traffic identification model of the category, and determine whether the unidentified network traffic is malicious traffic.

Optionally, the network traffic identification model of the target class includes: a first convolution layer, a second convolution layer, a full connection layer and an output layer; the flow identification module is specifically configured to:

Optionally, the apparatus further comprises: a flow determination module to:

after the target class of the class cluster corresponding to the shortest distance is obtained, determining the protocol type of the network traffic to be identified as a target protocol type matched with the target class;

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of any network traffic identification method based on deep migration learning when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program implements any of the steps of the deep migration learning based network traffic identification method described above.

In a fifth aspect, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the above-mentioned network traffic identification methods based on deep migration learning.

The technical scheme of the embodiment of the invention can at least bring the following beneficial effects: because the message two-dimensional data matrix and the behavior two-dimensional data matrix of the network traffic identification model can be automatically extracted, the traffic characteristics do not need to be manually designed and extracted, and the identification efficiency of the network traffic protocol type is improved. The network traffic to be identified of unknown protocol type is classified to obtain a target class with the shortest distance to the network traffic to be identified, and the network traffic to be identified is identified based on a network traffic identification model of the target class. Network flow types and sample distribution in the novel application scene are obtained through an unsupervised clustering method, and the bottleneck that the novel network lacks unknown network protocol sample labels and cannot be supervised and learned can be effectively overcome; by comparing the distribution difference after the network traffic clustering, the pre-training model of the known network protocol matched with various clusters is found out for transfer learning, so that the identification accuracy of the network traffic of the unknown protocol can be improved.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a flowchart of a network traffic identification method based on deep migration learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a network traffic identification model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another network traffic identification model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another network traffic identification model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another network traffic identification model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a network traffic identification apparatus based on deep migration learning according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another network traffic identification apparatus based on deep migration learning according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to identify a novel network traffic of an unknown protocol type, the embodiment of the invention provides a network traffic identification method based on deep learning, and the method can be applied to electronic equipment, wherein the electronic equipment can be equipment with data processing capability, such as a mobile phone, a computer, a tablet computer and the like. As shown in fig. 1, the method includes the following steps.

Step 101, extracting message information and communication behavior information of a preset number of data packets from network traffic to be identified.

In one embodiment, a preset number of data packets in the network traffic to be identified may be collected based on a probe deployed by-pass on a preset network node, and then message information and communication behavior information of each collected data packet may be obtained.

Optionally, the data acquisition probe may bypass to acquire the network traffic in a light splitting or splitting manner, and divide the network traffic by taking a session as a unit, to obtain message information of a data packet in the network traffic corresponding to each session, and to monitor communication behavior information of the data packet in the network traffic corresponding to each session. And then the probe stores the acquired message information of the data packets into a database in a pcap format file form, wherein one pcap format file corresponds to a session data packet set. And storing the obtained communication behavior information of the data packet into a database in a log form. The probe can acquire network traffic under the condition of not influencing network traffic transmission and service application of a network, and meanwhile, the probe also has omnidirectional data acquisition capacity of the network.

When the electronic equipment acquires information, the electronic equipment can read a file in a pcap format from a database, extract the payloads of a preset number of data packets to obtain message information, and read the log corresponding to the session to which the network traffic to be identified belongs to obtain communication behavior information. Wherein the message information belongs to the payload of the data packet.

Specifically, a network session communication phase may be divided into two phases, where the first phase is a session establishment phase and the second phase is a data transmission phase. Optionally, the network session in the embodiment of the present invention may be an encrypted network session or a plaintext network session.

An encrypted web session communication phase can be divided into two phases: the first phase is a plaintext communication phase for establishing connection, which may be called a session establishment phase, and includes handshaking, authentication and key exchange, and a session key is generated in the first phase; the second stage encrypts the transmission data using the key generated in the first stage.

Therefore, the network traffic in the embodiment of the present invention includes the network traffic generated in the session establishment stage and the network traffic transmitted based on the established session.

Illustratively, the preset number may be 6. When the number of the data packets of the network flow is smaller than the preset number, a plurality of data packets with the numerical value of 0 can be added until the number of the data packets after the completion is equal to the preset number.

And 102, calculating the distance between the message information and the communication behavior information of the network flow to be identified and the clustering center of each cluster.

Each class cluster comprises message information and communication behavior information of a preset number of data packets in a class of network flow.

For example, euclidean distances between message information and communication behavior information of network traffic to be identified and the clustering centers of the various clusters may be calculated.

And 103, when the shortest distance in the calculated distances is smaller than a preset distance, obtaining the target category of the category cluster corresponding to the shortest distance.

Optionally, when the shortest distance in the calculated distances is not less than the preset distance, it is determined that the network traffic to be identified is unknown network traffic.

It can be understood that when the shortest distance is less than the preset distance, it indicates that the network traffic to be identified is similar to the network traffic of the category, and it may be determined that the network traffic to be identified belongs to the category. And when the shortest distance is not less than the preset distance, the network to be identified is not similar to the network flow of each category.

And 104, inputting the message two-dimensional data matrix corresponding to the message information and the behavior two-dimensional data matrix corresponding to the behavior information into a network traffic identification model of the target category, and determining whether the network traffic to be identified is malicious traffic.

The network traffic identification model of the target category is a model which is constructed by a deep migration learning method on the basis of a pre-training model corresponding to a target protocol type matched with the target category; the pre-training model corresponding to the target protocol type is as follows: training the deep learning model through a training sample set corresponding to the target protocol type to obtain a model; the training sample set corresponding to the target protocol type comprises: the method comprises the following steps that a sample two-dimensional data matrix of sample network traffic of a target protocol type and a normal or malicious label corresponding to the sample network traffic of the target protocol type are obtained, wherein the sample two-dimensional data matrix of each sample network traffic comprises: and respectively constructing a sample message two-dimensional data matrix and a sample behavior two-dimensional data matrix based on the message information and the communication behavior information of the data packets with the preset number in the sample network flow.

Optionally, the deep learning model may include: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long/short term memory networks (LSTM), and the like.

In one embodiment, after the message two-dimensional data matrix and the behavior two-dimensional data matrix are input into the network traffic identification model of the target category, the output result of the network traffic identification model is obtained. Optionally, the output result may be 0 or 1, where 0 represents that the network traffic to be identified is malicious traffic, and 1 represents that the network traffic to be identified is normal traffic.

Optionally, before the network traffic identification model is input, the two-dimensional message data matrix and the two-dimensional behavior data matrix may be preprocessed respectively, and then the preprocessed two-dimensional message data matrix and the preprocessed two-dimensional behavior data matrix are input into the network traffic identification model. For example, the pre-processing may be a normalization of the two-dimensional data matrix.

The technical scheme of the embodiment of the invention can at least bring the following beneficial effects: because the message two-dimensional data matrix and the behavior two-dimensional data matrix input into the network traffic identification model can be automatically extracted without manual design and extraction of traffic characteristics, the identification efficiency of the network traffic protocol type is improved. The network traffic to be identified of unknown protocol type is classified to obtain a target class with the shortest distance to the network traffic to be identified, and the network traffic to be identified is identified based on a network traffic identification model of the target class. Network flow types and sample distribution in the novel application scene are obtained through an unsupervised clustering method, and the bottleneck that the novel network lacks unknown network protocol sample labels and cannot be supervised and learned can be effectively overcome; by comparing the distribution difference after the network traffic clustering, the pre-training model of the known network protocol matched with various clusters is found out for transfer learning, so that the identification accuracy of the network traffic of the unknown protocol can be improved.

In the embodiment of the present invention, before step 104, a message two-dimensional data matrix and a behavior two-dimensional data matrix of the network traffic to be identified may also be constructed.

In one embodiment, for each data packet of a preset number, message information of a first preset length of the data packet may be extracted; and then according to the arrangement sequence of the preset number of data packets, the message information with the specified length of each data packet is formed into a message two-dimensional data matrix.

Optionally, by r₁The number of the data packets is one,construction of m₁*m₁The manner of behaving as a two-dimensional data matrix of (a) includes: for each data packet, extracting the first m of the message information of the data packet according to the sequence₁K bytes, wherein k<m₁And m is₁＝k*r₁. If the message information of the data packet is less than m₁And k bytes, performing zero padding on the message information. Corresponding m to the first data packet₁Filling 1 to k columns of the two-dimensional matrix with k bytes, and filling m corresponding to the second data packet₁Filling k bytes into k +1 to 2k columns of the two-dimensional matrix respectively, and so on, wherein m corresponding to the r-th data packet₁Filling m of two-dimensional matrix of behaviors by k bytes respectively₁-k +1 to m₁And (4) columns.

In the examples of the present invention, m₁Can be set according to actual needs, e.g. m₁42. Let r be₁For 6, the row two-dimensional data matrix with the first preset length 294, 42 × 42 may be constructed by the following method: and extracting the first 294 bytes of the message information of each data packet, and if the message information of the data packet is less than 294 bytes, performing zero padding on the message information. And filling 294 bytes corresponding to the first data packet into 1 to 7 columns of the two-dimensional matrix respectively, filling 294 bytes corresponding to the second data packet into 8 to 14 columns of the two-dimensional matrix respectively, and so on, and filling 294 bytes corresponding to the sixth data packet into 36 to 42 columns of the two-dimensional matrix respectively.

In one embodiment, for each data packet of a preset number, specific information in the communication behavior information of the data packet may be extracted; and then according to the arrangement sequence of the preset number of data packets, forming a behavior two-dimensional data matrix by specified information in the communication behavior information of the data packets.

Optionally, the specific information may be: the method comprises the following steps of counting information, the length of a data packet, a timestamp difference value of adjacent data packets and data packet sequence information, wherein the counting information can comprise a session communication port, the total number of data packets in a session, the direction of the data packets, the session communication time length and the like; the sequence information may be a sequence number. Construction of m₂*m₂NewspaperThe literal two-dimensional matrix is: the first i columns of the matrix correspond to r₂The length of each data packet, i +1 th to i + j th columns of the matrix correspond to r₂The time stamp difference value of the adjacent data packet in each data packet corresponds to r from the i + j +1 th column to the i + j + l th column of the matrix₂Sequence information of each data packet, i + j + l +1 to n columns of the matrix correspond to r₂Statistics of individual packets. Wherein, i, j, l, i + j + l<n。

In the examples of the present invention, m₂Can be set according to actual needs, e.g. m₂32. Let r be₂The two-dimensional data matrix of behavior 12, 32 × 32 is: the 1 st to 6 th columns of the matrix correspond to the length of the first 6 data packets; the 8 th to 14 th columns of the matrix correspond to the time stamp difference value of the adjacent data packet in the first 6 data packets; the 15 th to 21 st columns correspond to the sequence information of the first 6 data packets; the 22 nd to 32 th columns of the matrix correspond to the statistics of the first 6 data packets.

Because the information of the input model in the embodiment of the invention is the message two-dimensional data matrix and the behavior two-dimensional data matrix, the message information and the communication behavior information of the network flow can be simultaneously embodied, and the method is more suitable for the structural form of the network flow data.

In the embodiment of the present invention, the message information includes: original messages of a preset number of data packets in the network flow to be identified; the communication behavior information includes at least one of the following information: the method comprises the steps of counting information of a preset number of data packets in the network flow to be identified, data packet sequence information, data packet length, data packet time stamps and time stamp difference values of adjacent data packets.

The timestamp of the data packet may be the sending time of the data packet, and the timestamp difference of the adjacent data packets may be: starting with the second packet, the difference between the time of issuance of each packet and the time of issuance of the last packet in the session.

The packet information of the data packet may be understood as information reflecting the content in the data packet, that is, information reflecting the static characteristics of the data packet in the communication process. For example, the message information may include a field value of a packet, header information of the packet, and the like.

The communication behavior information of the data packet may be understood as attribute information of the data packet related to the communication process, that is, information reflecting the dynamic characteristics of the data packet during the communication process.

The technical scheme of the embodiment of the invention can also bring the following beneficial effects: because the message information and the communication behavior information of the data packet in the network flow can fully reflect the protocol type of the network flow, the message information and the communication behavior information of the data packet in the network flow to be identified can be obtained, and the network flow to be identified can be identified more accurately.

In the embodiment of the present invention, before the step 101, a network traffic identification model of each category may be further constructed, and specifically, the following steps may be included.

Step 1, obtaining a sample information set of each known protocol type. Wherein the sample information set for each known protocol type comprises: the message information and the communication behavior information of a number of data packets are preset in the sample network flow of the known protocol type.

Optionally, the sample information set of each protocol type may further include: the protocol type comprises a message two-dimensional data matrix and a behavior two-dimensional data matrix of the sample network flow.

For example, the source domain contains sample network traffic of N known protocol types, P for each of the N known protocol types_s1,...,P_sN. Dividing the sample network traffic into N sample information sets according to the protocol type: d_s1、D_s2、…、D_sN. Each sample information set not only contains message information and communication behavior information of normal network flow, but also contains message information and communication behavior information of malicious network flow.

Optionally, for each known protocol type, the deep learning model may be trained with the training sample set of the known protocol type, so as to obtain a pre-training model of the known protocol type. N independent models were formed. Each pre-trained model can identify whether network traffic of the known protocol type is normal traffic or malicious traffic.

In an alternative implementation, N may be set to 15.

And 2, dividing the pre-collected network traffic of undetermined protocol types by taking the session as a unit to obtain a plurality of unidentified network traffic.

And 3, extracting the message information and the communication behavior information of the data packets with the preset number from each unidentified network flow.

And 4, clustering the message information and the communication behavior information of the plurality of unidentified network flows to obtain a cluster of each category.

Optionally, a K-means clustering algorithm (K-means clustering algorithm) may be used for clustering, or other clustering algorithms may also be used, which is not specifically limited in this embodiment of the present invention. For example, a Clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) or the like may be used.

The clustering result is M clusters: d_t1、D_t2、…、D_tM. In an alternative implementation, M may be set to 7.

And 5, calculating the Maximum Mean Difference (MMD) between the class cluster of the class and the sample information set of each known protocol type according to each class, determining the known protocol type corresponding to the sample information set with the minimum MMD of the class cluster of the class, and taking the determined known protocol type as the protocol type matched with the class. And determining the protocol type of the network traffic of the category as: the type of protocol that matches the category.

In one embodiment, the MMD between a class cluster of a class and a sample information set of a known protocol type can be calculated by equation (1):

n_skis D_skThe corresponding number of sample network traffic volumes,

h denotes that the calculation of distance is measured by Φ (-) mapping the data into Regenerative Kernel Hilbert Space (RKHS).

Exemplary, n_tiMay be 1000, n_skMay be 700.

And 6, constructing a network traffic identification model of the category by a deep migration learning method on the basis of a pre-training model corresponding to the protocol type matched with the category.

And 7, aiming at each unidentified network flow of the category, constructing a message two-dimensional data matrix according to the message information of the unidentified network flow, and constructing a behavior two-dimensional data matrix according to the communication behavior information of the unidentified network flow.

And 8, inputting the constructed message two-dimensional data matrix and the behavior two-dimensional data matrix into the network traffic identification model of the category, and determining whether the unidentified network traffic is malicious traffic.

In the embodiment of the invention, the type and the sample distribution of the network flow of the unknown network protocol are obtained by the unsupervised clustering method, so that the bottleneck that the novel network lacks unknown network protocol sample labels and can not be supervised and learned can be effectively overcome. By comparing the sample distribution difference, the pre-training model of the known network protocol most similar to each unknown protocol is found out for transfer learning, and the accuracy of unknown protocol identification can be improved to the greatest extent.

In the embodiment of the invention, the mode of obtaining the network traffic identification models of all categories by the transfer learning method is the same. The following describes a procedure of constructing a network traffic recognition model, taking the construction of a network traffic recognition model of a target class as an example.

Step one, inputting a sample two-dimensional data matrix of sample network flow of a target protocol type into a pre-training model corresponding to the target protocol type.

And step two, obtaining an output result of the pre-training model corresponding to the target protocol type.

Alternatively, the output result of the pre-trained model may be 1 or 0. Where 1 indicates that the input network traffic is normal traffic, and 0 indicates that the input network traffic is malicious traffic.

And step three, calculating a loss value according to the output result, a normal or malicious label corresponding to the sample network flow of the target protocol type and the MMD between the sample information set of the target protocol type and the class cluster of the target class.

In an alternative embodiment, in order to make D_tiAnd D_skThe data distribution of (2) is closer, the loss function can be calculated based on the DDC method, and the loss function can be minimized by adopting a gradient descent method.

The loss function may be formula (2).

Wherein L is a loss function,

represents a classification loss, D_skA sample information set representing a known protocol type k,

representing the output result of the network flow identification model to the sample information set of the known protocol type k, wherein lambda is a preset hyper-parameter, D_tiIs a class cluster of class i. The MMD can be calculated by referring to equation (1).

In the embodiment of the invention, by adding an adaptation layer between the source domain and the target domain and adding a loss function for domain confusion, the model learns how to classify, and reduces the distribution difference between the source domain and the target domain, thereby realizing the domain adaptation. Wherein the setting of the hyper-parameter lambda value determines the strength of the confusion field.

For example, as shown in fig. 2, the network traffic identification model includes a first convolutional layer 201, a second convolutional layer 202, a first fully-connected layer 203, a second fully-connected layer 204, a third fully-connected layer 205, a domain adaptation layer 206, a fourth fully-connected layer 207, and an output layer 208. Wherein, the first convolution layer 201 and the second convolution layer 202 both include 5 layers, and the first full connection layer 203 and the second full connection layer 204 both include 3 layers. The domain adaptation layer 206 is used to compute the MMD between the sample information set of the target protocol type and the class cluster of the target class.

And step four, if the pre-training model corresponding to the target protocol type is determined to be converged based on the loss value, determining that the network traffic identification model of the target type is the pre-training model corresponding to the target protocol type.

In one embodiment, if the difference between the loss value calculated this time and the loss value calculated last time is less than a preset difference, it is determined that the model is converged. And if the difference value between the loss value calculated this time and the loss value calculated last time is not less than the preset difference value, determining that the model is not converged.

In another embodiment, if the loss value calculated this time is smaller than a preset value, it is determined that the model converges. And if the loss value calculated this time is not less than the preset value, determining that the model is not converged.

The embodiment of the invention is based on a deep migration learning method, migrates the message information and the communication behavior information of the known network protocol to construct the network flow identification model of the unknown network protocol, and can quickly and effectively realize the identification capability of the malicious flow of the unknown network protocol by utilizing the existing network protocol knowledge.

The migration network can be regarded as a dual-channel structure and is formed in a network mode of sharing two channel weights. Where the target domain may be a category and the source domain refers to a known protocol type that matches the category. For channel a of the input source domain sample, referring to fig. 2, fig. 2 may be channel a, and a domain adaptation layer is added between fully-connected layers for determining a data distribution difference between the source domain and the target domain, which may also be referred to as a domain adaptation loss.

It will be appreciated that in order to accommodate two different domains, the difference in distribution between the two domains needs to be evaluated, and the difference in probability distribution between the two domains can be estimated by embedding the different domain samples evenly into the RKHS using the MMD algorithm.

For channel B, which is ingress network traffic for the target domain, there is no domain adaptation layer as compared to channel a. And the weight of each network layer is the same as the weight corresponding to the channel A.

Since the source domain data generates output through channel a, a classification penalty is generated with the tag computation, which penalty is minimized to ensure that the model is updated to a more accurate output. Therefore, in the embodiment of the present invention, after the source domain data is output through the channel a, the classification loss value is calculated with the tag, and the back propagation of the error is performed simultaneously with the domain adaptive loss. Since the target domain data is not tagged with data, channel B does not perform task specific penalty calculations and back propagation. And further target domain adaptation is realized.

Namely, the migration process of the model, may be regarded as training the channel a, and the network layers except the adaptive layer in the trained channel a constitute the channel B, namely, the network traffic recognition model.

In an implementation manner of the embodiment of the present invention, as shown in fig. 3, the network traffic identification model of the target class includes: a first convolution layer 301, a second convolution layer 302, a full-link layer 303, and an output layer 304; the network traffic identification model of the target category identifies whether the network traffic to be identified is malicious traffic or not through the following steps:

step (1), the first convolution layer 301 performs convolution on the two-dimensional data matrix of the packet by using a two-dimensional convolution kernel to obtain a first characteristic diagram.

Optionally, the first buildup layer 301 may include one or more layers. The convolutional layer may extract features of the input data.

And (2) performing convolution on the row two-dimensional data matrix by using the second convolution layer 302 by using a two-dimensional convolution kernel to obtain a second characteristic diagram.

Optionally, second convolutional layer 302 may include one or more layers. The convolutional layer may extract features of the input data.

In an alternative embodiment, the first convolutional layer 301 and the second convolutional layer 302 may be shared by weight.

And (3) integrating the first characteristic diagram and the second characteristic diagram by the full connection layer 303 to obtain a third characteristic diagram.

Optionally, the fully-connected layer 303 may include one or more layers. The fully connected layer 303 correlates its input feature images to the size of the category dimension.

In an implementation manner, the network traffic identification model in the embodiment of the present invention may be a model based on a convolutional neural network (AlexNet), and based on this, 5 full-link layers may be added after convolutional layers, as shown in fig. 2.

And (4) calculating the third feature map by using a preset classification algorithm by the output layer 304, and obtaining and outputting whether the network traffic to be identified is malicious traffic.

In one embodiment, the output layer may employ a classification algorithm, for example, the classification algorithm may be a logistic regression (Softmax) algorithm, and the output of the model is normalized to obtain whether the network traffic to be identified is normal traffic or malicious traffic.

The technical scheme of the embodiment of the invention can also bring the following beneficial effects: the network flow identification model is combined with the static characteristics (message information) and the dynamic characteristics (communication behavior information) of the message to identify whether the network flow is malicious flow or not, so that the accuracy of model identification is improved.

In the embodiment of the present invention, the structure of the network traffic identification model is not limited to the structure shown in fig. 2 or fig. 3, and the structure of the network traffic identification model may be determined according to actual requirements. Examples of network traffic recognition models for other architectures are given below.

Optionally, in the traffic identification model, after the convolutional layer, a pooling layer may be further added. For example, as shown in fig. 4, the network traffic identification model includes: a first buildup layer 401, a first pooling layer 402 after the first buildup layer 401, a second buildup layer 403, a second pooling layer 404 after the second buildup layer 403, a global connection layer 405, and an output layer 406.

In an alternative embodiment, the first convolutional layer 401 and the second convolutional layer 403 may be shared by weight; the first pooling layer 402 and the second pooling layer 404 may be weight-shared.

Optionally, the traffic identification model may further include a plurality of pooling layers. For example, as shown in fig. 5, the network traffic identification model includes: a first convolutional layer 501, a first pooling layer 502, a first convolutional layer 501, a second pooling layer 503, a second convolutional layer 504, a third pooling layer 505, a second convolutional layer 504, a fourth pooling layer 506, a full-link layer 507, and an output layer 508.

The full connection layer in the network traffic identification model can be connected with each convolution layer and each pooling layer so as to retain the identification result of each network layer to a greater extent.

Besides being based on a convolutional neural network, the network traffic identification model in the embodiment of the present invention may also be based on other neural networks, which is not specifically limited in the embodiment of the present invention.

The technical scheme of the embodiment of the invention can also bring the following beneficial effects: the preprocessed two-dimensional data matrix of the sample message and the preprocessed two-dimensional data matrix of the sample behavior corresponding to the same session are respectively used as input data of two sub-models, static characteristics and dynamic behavior characteristics of network flow are deeply learned through local perception and weight sharing, weight parameters of the models are independently learned during training, and accurate recognition results can be achieved.

In the embodiment of the present invention, after the step 103, network traffic may be further classified based on a black and white list mechanism. The preset white list includes a protocol type of trusted network traffic, the preset black list includes a protocol type of untrusted network traffic, such as a protocol type of attack or abnormal network traffic, and the gray list includes a protocol type that does not belong to either the white list or the black list.

The white list and the black list can be established based on traditional traffic identification methods such as port identification and Deep Packet Inspection (DPI) identification.

Optionally, the specific classification manner includes:

and step one, determining the protocol type of the network flow to be identified as a target protocol type matched with the target type.

And (II) if the target protocol type is the protocol type in the preset white list, determining that the network traffic to be identified is the credible network traffic.

And step three, if the target protocol type is the protocol type in the preset blacklist, determining that the network traffic to be identified is the untrusted network traffic.

And step four, if the target protocol type is not the protocol type in the preset white list and is not the protocol type in the preset black list, determining that the network traffic to be identified is unknown network traffic.

The scheme provided by the embodiment of the invention is to identify whether the network traffic to be identified is malicious traffic or not, and when a hacker initiates network attack by using the network traffic, the network traffic to be identified may be modified into the network traffic containing malicious attack codes. Therefore, the scheme provided by the embodiment of the invention can identify the network traffic containing the malicious attack codes. Therefore, the scheme provided by the embodiment of the invention can identify abnormal/malicious network traffic, so that the application range of the embodiment of the invention is wider.

Based on the same inventive concept, corresponding to the above method embodiment, an embodiment of the present invention provides a network traffic identification apparatus based on deep migration learning, and referring to fig. 6, the apparatus includes: a data acquisition module 601, a distance calculation module 602, a classification module 603, and a flow identification module 604.

The data acquisition module 601 is configured to extract message information and communication behavior information of a preset number of data packets from a to-be-identified network traffic, where the to-be-identified network traffic includes a network traffic generated in a session establishment stage and a traffic transmitted based on an established session;

a distance calculation module 602, configured to calculate distances between packet information and communication behavior information of network traffic to be identified and a cluster center of each cluster, where each cluster includes packet information and communication behavior information of network traffic of one category;

a classification module 603, configured to obtain a target class of a class cluster corresponding to a shortest distance when the shortest distance in the calculated distances is smaller than a preset distance;

the traffic identification module 604 is configured to input the message two-dimensional data matrix corresponding to the message information and the behavior two-dimensional data matrix corresponding to the behavior information into a network traffic identification model of a target category, and determine whether network traffic to be identified is malicious traffic;

Optionally, as shown in fig. 7, the apparatus further includes: a dividing module 605, an unknown protocol clustering module 606, a protocol type matching module 607 and an identification model constructing module 608;

the data acquisition module 601 is further configured to obtain sample information sets of known protocol types before extracting message information and communication behavior information of a preset number of data packets from the network traffic to be identified, where each sample information set of a known protocol type includes: the message information and the communication behavior information of a number of data packets are preset in the sample network flow of the known protocol type;

a dividing module 605, configured to divide pre-collected network traffic of an undetermined protocol type by taking a session as a unit, so as to obtain a plurality of unidentified network traffic;

the data acquisition module 601 is further configured to extract message information and communication behavior information of a preset number of data packets from each unidentified network traffic;

an unknown protocol clustering module 606, configured to cluster the message information and the communication behavior information of multiple unidentified network flows to obtain a cluster of each category;

the protocol type matching module 607 is configured to calculate, for each category, a maximum mean difference MMD between the cluster of the category and the sample information set of each known protocol type, determine a known protocol type corresponding to the sample information set with the minimum MMD of the cluster of the category, and use the determined known protocol type as the protocol type matched with the category;

and the identification model construction module 608 is configured to construct a network traffic identification model of the category by a deep migration learning method based on a pre-training model corresponding to the protocol type matched with the category.

Optionally, the identification model building module 608 is specifically configured to:

inputting a two-dimensional sample data matrix of sample network traffic of a target protocol type into a pre-training model corresponding to the target protocol type;

if the pre-training model corresponding to the target protocol type is determined to be converged based on the loss value, determining that the network traffic identification model of the target type is the pre-training model corresponding to the target protocol type;

n_skis D_skThe corresponding number of sample network traffic volumes,

Optionally, referring to fig. 7, the apparatus further includes: a data matrix construction module 609;

a data matrix construction module 609, configured to construct, based on a pre-training model corresponding to a protocol type matched with the category, a network traffic identification model of the category through a deep migration learning method, and then, for each unidentified network traffic of the category, construct a two-dimensional data matrix of a message according to message information of the unidentified network traffic, and construct a two-dimensional data matrix of a behavior according to communication behavior information of the unidentified network traffic;

the traffic identification module 604 is further configured to input the constructed two-dimensional data matrix of the packet and the two-dimensional data matrix of the behavior into the network traffic identification model of the category, and determine whether the unrecognized network traffic is malicious traffic.

Optionally, the network traffic identification model of the target class includes: a first convolution layer, a second convolution layer, a full connection layer and an output layer; the flow identification module 604 is specifically configured to:

the first convolution layer performs convolution on the two-dimensional data matrix of the message by using a two-dimensional convolution core to obtain a first characteristic diagram;

the second convolution layer performs convolution on the behavior two-dimensional data matrix by using a two-dimensional convolution kernel to obtain a second characteristic diagram;

and the output layer calculates the third characteristic graph by using a preset classification algorithm to obtain and output whether the network traffic to be identified is malicious traffic.

Optionally, as shown in fig. 7, the apparatus further includes: a flow determination module 610, the flow determination module 610 configured to:

if the target protocol type is the protocol type in a preset white list, determining that the network traffic to be identified is the trusted network traffic, wherein the preset white list comprises the protocol type of the trusted network traffic;

and if the target protocol type is the protocol type in a preset blacklist, determining that the network traffic to be identified is the untrustworthy network traffic, wherein the preset blacklist comprises the protocol type of the untrustworthy network traffic.

An embodiment of the present invention further provides an electronic device, as shown in fig. 8, which includes a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete mutual communication through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801 is configured to implement the method steps in the above-described method embodiments when executing the program stored in the memory 803.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-mentioned deep migration learning-based network traffic identification methods.

In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer, causes the computer to execute any of the above-mentioned network traffic identification methods based on deep migration learning.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A network traffic identification method based on deep migration learning is characterized by comprising the following steps:

2. The method according to claim 1, wherein before the extracting message information and communication behavior information of a preset number of data packets from the network traffic to be identified, the method further comprises:

3. The method of claim 2, wherein the network traffic recognition model for the target class is constructed by:

4. A method according to claim 2 or 3, characterized by calculating the MMD between a class cluster of a class and a sample information set of a known protocol type by the following formula:

n_skis D_skCorresponding sample networkThe amount of the flow rate is such that,

5. The method according to claim 2, wherein after the pre-trained model corresponding to the protocol type matching the category is used as a basis to construct the network traffic recognition model of the category through a deep migration learning method, the method further comprises:

6. The method of claim 1, wherein the network traffic identification model for the target class comprises: a first convolution layer, a second convolution layer, a full connection layer and an output layer; the network traffic identification model of the target category identifies whether the network traffic to be identified is malicious traffic or not through the following steps:

7. The method according to claim 1, wherein after the obtaining the target class of the class cluster corresponding to the shortest distance, the method further comprises:

8. An apparatus for identifying network traffic based on deep migration learning, the apparatus comprising:

9. The apparatus of claim 8, further comprising: the system comprises a dividing module, an unknown protocol clustering module, a protocol type matching module and an identification model building module;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.