Disclosure of Invention
The invention aims to solve the problem that the prior art cannot effectively cope with conceptual drift in a network environment in a network intrusion detection task, and provides a network intrusion detection method and system based on reliability sample selection. The invention provides a network intrusion detection method based on reliability sample selection, which comprises the following steps:
In the initial training phase of the training device,
S1, dividing a network intrusion data set, wherein the network intrusion data set comprises an original training data set, an online training data set and a test data set;
s2, carrying out data enhancement on the samples according to the risk and reliability of the original training samples in the S1, and selecting high-quality samples from the enhanced data;
S3, inputting the high-quality sample data selected in the S2 into a model for training to obtain an initial training model;
In the on-line training stage, the device comprises a device for training,
S4, detecting whether the distribution of the input sample and the original sample is the same according to the input online training data, so as to judge whether concept drift occurs;
S5, selecting a certain number of reliability samples by adopting an attention mechanism according to the judgment result of S4 and updating a training data set, wherein when concept drift occurs, calculating attention weights of old samples and new samples by using a self-attention model, judging a reliability threshold according to the attention weights, screening samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, selecting the first k candidate samples as the reliability samples, deleting the k samples from the old samples, and when the concept drift does not occur, calculating the attention weights of the old samples and the new samples by using the self-attention model, judging the reliability threshold according to the attention weights, screening samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, selecting the first k candidate samples as the reliability samples, and combining the selected reliability samples into the current training set without deleting the old samples, and updating the labels of the training set;
S6, when the reliability samples in the input samples are insufficient, selecting samples with high model loss values from the non-reliability samples as supplementary samples;
S7, inputting the training data sets updated in the S5 and the S6 into the model for training until the online training data are completely input;
in the test phase of the process, the test device,
S8, inputting test data into the model, outputting a classification prediction result of the sample by the model, and evaluating the performance of the model by comparing the prediction result of the model with the real label.
Preferably, S1 comprises the steps of:
s1.1, inputting original training data into a pre-training model to generate pseudo labels and scoring graphs of the data, and setting an original training data set as WhereinRepresenting the ith sample, a pre-trained modelThe structure is as follows:
;
Generating pseudo tags And scoring graph;
;
;
S1.2 according to the generated pseudo tagAnd scoring graphComputing risk for each sample in the raw training datasetAnd reliability;
;
Wherein g (·) and h (·) are calculation functions of risk and reliability, respectively.
Preferably, S2 comprises the steps of:
s2.1 based on the risk and reliability of the samples in S1.2, genetic programming is used Enhancing training data, performing t iterations, randomly selecting samples and corresponding labels in each iteration, and performing on each sample according to risk and reliabilityPerforming mutation or crossover operation;
Mutation operation: ;
Wherein, Is a binary mask, each element has a mutation rate mu of 1, otherwise, 0;
crossover operation: ;
Wherein, Is a binary mask with each element having a crossing rate gamma of 1, otherwise 0, the enhanced data set beingWhereinIs the number of samples after enhancement;
s2.2 based on the enhanced data in S2.1 Re-evaluating the reliability of the sampleCalculating the average reliability of all samplesSelecting the satisfaction of>As high quality samples, forming a high quality sample set;
。
preferably, in S4, the Kolmogorov-Smirnov test is used to detect the distribution difference between the new sample and the old sample, and the input new sample is set as The old sample isCalculating KS statistic S and P value P, and when the P value is smaller than a concept drift judgment threshold value drift_threshold, indicating that the concept drift occurs;
。
Preferably, in S5, according to the concept drift determination result, the reliability samples are selected and updated in two cases:
s5.1 when concept drift occurs, old samples are calculated using a multi-headed self-attention model And new samplesThe employed self-attention model outputs are expressed as follows:
;
wherein each head And attention weightThe calculation formula of (2) is as follows:
;
;
Wherein, Is the dimension of each of the heads,、AndIs a learnable weight matrix;
;
;
judging a reliability sample threshold value theta according to the attention weight, screening samples with attention weight higher than the threshold value as candidate samples, sorting the candidate samples, and selecting the first k candidate samples as reliability samples And taking other samples as unreliable samples;
;
;
Then fromDelete k samplesCombining the selected reliability samples into the current training set, and updating the labels of the training set;
;
S5.2 calculating old samples using the self-attention model when no concept drift occurs And new samplesJudging a reliability threshold value theta according to the attention weight, screening out samples with the attention weight higher than the threshold value as candidate samples, sequencing the candidate samples, selecting the first k candidate samples as reliability samples, merging the selected reliability samples into a current training set, and updating the labels of the training set;
。
Preferably, in S6, when the new reliability sample n is smaller than the set number k of samples, a loss value of the non-reliability sample is calculated WhereinSelecting k-n samples as supplementary samples for the number of unreliable samples according to the loss value sequence, merging the supplementary samples into the current training set, and updating the labels of the training set;
;
。
preferably, in S8, the test dataset Input to the modelIn which, the classification prediction result is outputBy comparing the prediction results of the modelsWith real labelsThe performance of the model was evaluated using indexes including accuracy, precision, recall, and F1 score.
The embodiment of the application provides a network intrusion detection system based on reliability sample selection, which comprises a processor and a memory;
a memory for storing a computer program;
and a processor for implementing any of the method steps when executing the program stored on the memory.
The invention carries out data enhancement on network intrusion data through genetic programming to generate a more representative sample. The high-quality sample is selected for initial model training through reliability evaluation, so that the quality and the representativeness of training data are ensured, the initial performance of the model is improved, and the problems of uneven sample quality and unbalanced data distribution in network intrusion detection data can be effectively solved;
When the concept drift is detected during online training of the model, the invention adopts a reliability sample selection strategy guided by an attention mechanism. This strategy can preferentially select the most valuable samples for model training to update online, thereby quickly adapting to new data distributions. Through an attention mechanism, the model can dynamically adjust the selection of samples, so that the model can better adapt to the change of data distribution, and the adaptability problem of the model when facing the change of data distribution is effectively solved;
When the number of the reliability samples is insufficient, the invention selects the sample with high model loss value from the non-reliability samples as the supplementary sample. The strategy can effectively utilize the uncertainty information of the model, select samples which are most helpful to the improvement of the model performance, and effectively solve the problem of how to effectively utilize the existing data to improve the model performance when the number of the samples is limited.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
For network traffic attacks, in the network intrusion detection problem, an attacker aims to bypass the detection system by various means (such as data packet tampering, disguising normal traffic, etc.), so that the network system performs illegal operations or leaks sensitive information. In this embodiment, we define a normal behavior sample in network traffic as a normal sample and a sample containing attack behavior as an attack sample. The system aims to accurately distinguish benign samples from malicious samples by analyzing network traffic, so that potential network attacks are timely discovered and prevented;
as shown in fig. 1, the network intrusion detection method based on reliability sample selection provided in this embodiment includes the following steps.
The method comprises the steps of S1, dividing a network intrusion data set NSL-KDD, wherein the NSL-KDD data set comprises a training data set and a test data set, the training data set totally comprises 125973 pieces of sample data, the training data set is divided into an original training data set and an online training data set, the original training data set comprises 25194 pieces of sample data, the online training data set comprises 100779 pieces of sample data, the test data set comprises 22544 pieces of sample data, each piece of sample data in the data set comprises 41 pieces of characteristics and attack type labels normal and anomaly, and the characteristics comprise basic characteristics, content characteristics and flow-based characteristics;
Evaluating risk and reliability of original training sample, inputting original training data into pre-training model to generate pseudo label and scoring graph of data, setting original training data set as WhereinRepresenting the ith sample, a pre-trained modelThe structure is as follows:
;
Generating pseudo tags And scoring graph ;
;
;
Based on the generated pseudo tagAnd scoring graphCalculating the risk of each sampleAnd reliability;
;
Wherein g (·) and h (·) are calculation functions of risk and reliability, respectively.
S2, carrying out data enhancement on the samples according to the risk and reliability of the original training samples in the S1, and selecting high-quality samples from the enhanced data;
In particular, according to the risk of the sample And reliabilityBy genetic programmingFor training dataPerforming data enhancement, performing t iterations, randomly selecting samples and corresponding labels in each iteration, and performing data enhancement on each sample according to risk and reliabilityPerforming mutation or crossover operation:
Mutation operation: ;
Wherein, Is a binary mask, each element has a mutation rate mu of 1, otherwise, 0, wherein the mutation rate mu is 0.1;
crossover operation: ;
Wherein, Is a binary mask with each element having a crossing rate gamma of 1, otherwise 0. Where the crossover rate is 0.5, the enhanced data set isWhereinIs the number of samples after enhancement;
from enhanced data Re-evaluating the reliability of the sampleCalculating the average reliability of all samplesSelecting the satisfaction of> As high quality samples, forming a high quality sample set;
。
S3, inputting the high-quality sample in the S2 into a model, training by adopting a random gradient descent (Stochastic GRADIENT DESCENT, SGD) optimizer, wherein the learning rate is 0.001, and completing the initial model training after 250 iterations, wherein the training is carried out by adopting a Pytorch framework in NVIDIA GTX 3060 GPU.
S4, detecting whether the distribution of the input sample and the original sample is the same according to the input online training data so as to judge whether the concept drift occurs, adopting a Kolmogorov-Smirnov test to detect the distribution difference between the new sample and the old sample, and setting the input new sample asThe old sample is. Calculating KS statistics and P value, when the P value is smaller than the concept drift judgment threshold valueThen it is indicated that a conceptual drift has occurred;
。
S5, according to the judging result of the S4, selecting a reliability sample from the input samples and updating a model:
When concept drift occurs, old samples are calculated using a self-attention model And new samplesThe employed self-attention model outputs are expressed as follows:
;
wherein each head And attention weightThe calculation formula of (2) is as follows:
;
;
Wherein, Is the dimension of each of the heads,、AndIs a learnable weight matrix;
;
;
And reliability sample threshold according to attention weight Judging, screening out samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, and selecting the first k candidate samples as reliability samplesAnd taking other samples as unreliable samples;
;
;
Then fromDelete k samplesCombining the selected reliability samples into the current training set, and updating the labels of the training set;
;
calculating old samples using a self-attention model when no concept drift occurs And new samplesAnd a reliability threshold is made according to the attention weightJudging, screening out samples with attention weight higher than the threshold value as candidate samples, sorting the candidate samples, selecting the first k candidate samples as reliability samples, merging the selected reliability samples into the current training set, and updating the labels of the training set;
。
s6, when the reliability samples in the input samples are insufficient, selecting samples with high model loss values from the non-reliability samples as supplementary samples, and when the new reliability samples n are smaller than the set sample number k, calculating the loss values of the non-reliability samples K-n samples are selected as supplementary samples according to the loss value ranking, and then the supplementary samples are combined into the current training set, and the labels of the training set are updated;
;
。
And S7, inputting the updated training data sets in the S5 and the S6 into the model for training until the online training data sets are completely input, training samples in the online training data sets in a segmented mode, inputting 5000 samples each time, and finishing model training after 20 iterations.
S8, testing the data setInput to the modelIn which, the classification prediction result is outputBy comparing the prediction results of the modelsWith real labelsThe performance of the model was evaluated using indexes including accuracy, precision, recall, and F1 score.
The method is compared with other classical continuous learning methods, namely SSF, AOC-IDS, EWC and LwF, by the method in the embodiment. The specific properties are shown in Table 1:
TABLE 1 comparison of the performance of the inventive method with other continuous learning methods
| Method of |
Accuracy (%) |
Accuracy (%) |
Recall (%) |
F1(%) |
| The invention is that |
92.5 |
91.2 |
96.1 |
93.6 |
| SSF |
90.5 |
89.2 |
94.7 |
91.9 |
| AOC-IDS |
81.7 |
78.9 |
92.5 |
85.2 |
| EWC |
81.7 |
89.1 |
77.4 |
82.7 |
| LwF |
82.7 |
89.2 |
79.1 |
83.8 |
As can be seen from Table 1, the invention can better complete network intrusion detection tasks, the invention improves the accuracy by 2%, which means that normal and intrusion samples can be accurately identified, improves the accuracy by 1%, which means that the more reliable the intrusion is predicted, the lower the false alarm rate is, improves the recall rate by 1.4%, which means that the detection of intrusion samples is more comprehensive, the lower the false alarm rate is, improves the F1 score by 1.7%, which means that better balance is achieved between the accuracy and the recall rate, and the overall performance is better;
In addition, FIG. 2 compares a thermodynamic diagram of a confusion matrix for a baseline algorithm and a network intrusion detection system selected based on reliability samples. When the network intrusion detection is carried out, the number of samples which are correctly predicted as attack traffic is increased, the number of samples which are incorrectly predicted as normal traffic is reduced, so that the network intrusion detection device has better capability of identifying the attack traffic, the number of samples which are correctly identified as normal traffic is increased, the number of samples which are incorrectly judged as attack traffic is reduced, and the network intrusion detection device has better capability of identifying the normal traffic.
TABLE 2 comparison of Performance of the inventive method in sequential use of the sample selection strategies
| Method of |
Accuracy (%) |
Accuracy (%) |
Recall (%) |
F1(%) |
| Baseline |
89.3 |
89.1 |
92.4 |
90.7 |
| Baseline+HS |
91.5 |
88.8 |
97.5 |
92.9 |
| Baseline+HS+AS |
92.1 |
89.3 |
97.7 |
93.4 |
| Baseline+HS+AS+URS |
92.5 |
91.2 |
96.1 |
93.6 |
Note that HS represents high quality sample selection, AS represents attention mechanism directed reliability sample selection, URS represents uncertain sample replenishment strategy;
As can be seen from table 2, the three strategies proposed by the present invention have improved performance on the baseline algorithm, and the high quality sample selection helping algorithm has improved 2.2% accuracy and 2.2% F1 fraction, while the accuracy is reduced by 0.3%, but the recall is improved by 5.1%. The reliability sample selection strategy guided by the attention mechanism helps the model to realize comprehensive improvement of the performance of the baseline method, and higher results are obtained on all evaluation indexes. The supplement strategy adopting the uncertainty sample helps the model achieve better balance between the accuracy rate and the recall rate;
In addition, fig. 3 also compares the impact of the high quality sample selection strategy on the initial model training, when the high quality sample selection strategy is adopted, the number of samples correctly predicted as attack traffic increases, the number of samples incorrectly predicted as normal traffic decreases, and the initial training model is better in identifying attack traffic.
TABLE 3 comparison of Performance of the inventive method for selection of different number of reliability samples
| k |
Accuracy (%) |
Accuracy (%) |
Recall (%) |
F1(%) |
| 50 |
91.5 |
89.7 |
96.1 |
92.8 |
| 150 |
92.1 |
89.5 |
97.3 |
93.3 |
| 250 |
92.2 |
89.4 |
97.7 |
93.4 |
| 350 |
92.5 |
91.2 |
96.1 |
93.6 |
| 450 |
92.4 |
90.5 |
96.9 |
93.5 |
As can be seen from table 3, the network intrusion detection performance was evaluated when k=50, 150, 250, 350, 450, it can be seen that using a larger number of reliability sample selections provides better accuracy and recall but results in a decrease in accuracy, and when k=350, an F1 score of 93.6% and an accuracy of 92.5% were achieved, which increased by 0.8% and 1% respectively over the k=50 results, and a better balance was achieved between accuracy and recall, so the experiment was evaluated using k=350.
The embodiment of the disclosure also provides a network intrusion detection system based on the reliability sample selection, which comprises a processor and a memory;
a memory for storing a computer program;
A processor for implementing any one of the method steps in the network intrusion detection system based on the reliability sample selection when executing the program stored on the memory;
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, or alternatives falling within the spirit and principles of the invention.