CN120185929A

CN120185929A - A network intrusion detection method and system based on reliability sample selection

Info

Publication number: CN120185929A
Application number: CN202510645750.2A
Authority: CN
Inventors: 张文翔; 刘帅; 田波; 陈小龙; 赵越
Original assignee: CETC 30 Research Institute; Hunan Normal University
Current assignee: CETC 30 Research Institute; Hunan Normal University
Priority date: 2025-05-20
Filing date: 2025-05-20
Publication date: 2025-06-20
Anticipated expiration: 2045-05-20
Also published as: CN120185929B

Abstract

The present invention discloses a network intrusion detection method and system based on reliability sample selection, which includes the following steps: data enhancement is performed on network intrusion data to generate more representative samples; then high-quality samples are selected for initial model training through reliability evaluation; the model is trained online, and when concept drift is detected, a reliability sample selection strategy guided by an attention mechanism is adopted to give priority to updating samples valuable for model training; when the number of reliability samples is insufficient, samples with high model loss values are selected from non-reliability samples as supplementary samples; the samples to be detected are input into the trained model, and the model outputs the classification prediction results of the samples. The performance of the model is evaluated by comparing the prediction results of the model with the true labels; through the above steps, the present invention can effectively detect abnormal behaviors in network intrusions and improve the adaptability and robustness of network intrusion detection systems.

Description

Network intrusion detection method and system based on reliability sample selection

Technical Field

The invention belongs to the field of network information security, and particularly relates to a network intrusion detection method and system based on reliability sample selection.

Background

Network intrusion detection systems (Intrusion Detection Systems, IDS) are key technologies for securing networks, and are widely used to monitor and identify malicious activity in networks. With the increasing complexity of network environments, conventional rule or statistics based intrusion detection methods have been difficult to cope with changing attack means. To improve the accuracy and adaptability of intrusion detection, continuous learning (Continual Learning) techniques are introduced into network intrusion detection. Continuous learning allows models to be updated continuously during operation to accommodate new data distributions. However, the existing continuous learning method has shortcomings in sample selection and model update. They typically employ random sampling or representative selection strategies based on overall distribution, which tend to ignore the core part of the conceptual drift, i.e., the varying part of the data distribution. Thus, the model may not be able to efficiently adapt to the new data distribution.

Disclosure of Invention

The invention aims to solve the problem that the prior art cannot effectively cope with conceptual drift in a network environment in a network intrusion detection task, and provides a network intrusion detection method and system based on reliability sample selection. The invention provides a network intrusion detection method based on reliability sample selection, which comprises the following steps:

In the initial training phase of the training device,

S1, dividing a network intrusion data set, wherein the network intrusion data set comprises an original training data set, an online training data set and a test data set;

s2, carrying out data enhancement on the samples according to the risk and reliability of the original training samples in the S1, and selecting high-quality samples from the enhanced data;

S3, inputting the high-quality sample data selected in the S2 into a model for training to obtain an initial training model;

In the on-line training stage, the device comprises a device for training,

S4, detecting whether the distribution of the input sample and the original sample is the same according to the input online training data, so as to judge whether concept drift occurs;

S5, selecting a certain number of reliability samples by adopting an attention mechanism according to the judgment result of S4 and updating a training data set, wherein when concept drift occurs, calculating attention weights of old samples and new samples by using a self-attention model, judging a reliability threshold according to the attention weights, screening samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, selecting the first k candidate samples as the reliability samples, deleting the k samples from the old samples, and when the concept drift does not occur, calculating the attention weights of the old samples and the new samples by using the self-attention model, judging the reliability threshold according to the attention weights, screening samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, selecting the first k candidate samples as the reliability samples, and combining the selected reliability samples into the current training set without deleting the old samples, and updating the labels of the training set;

S6, when the reliability samples in the input samples are insufficient, selecting samples with high model loss values from the non-reliability samples as supplementary samples;

S7, inputting the training data sets updated in the S5 and the S6 into the model for training until the online training data are completely input;

in the test phase of the process, the test device,

S8, inputting test data into the model, outputting a classification prediction result of the sample by the model, and evaluating the performance of the model by comparing the prediction result of the model with the real label.

Preferably, S1 comprises the steps of:

s1.1, inputting original training data into a pre-training model to generate pseudo labels and scoring graphs of the data, and setting an original training data set as WhereinRepresenting the ith sample, a pre-trained modelThe structure is as follows:

;

Generating pseudo tags And scoring graph;

;

S1.2 according to the generated pseudo tagAnd scoring graphComputing risk for each sample in the raw training datasetAnd reliability;

;

Wherein g (·) and h (·) are calculation functions of risk and reliability, respectively.

Preferably, S2 comprises the steps of:

s2.1 based on the risk and reliability of the samples in S1.2, genetic programming is used Enhancing training data, performing t iterations, randomly selecting samples and corresponding labels in each iteration, and performing on each sample according to risk and reliabilityPerforming mutation or crossover operation;

Mutation operation: ;

Wherein, Is a binary mask, each element has a mutation rate mu of 1, otherwise, 0;

crossover operation: ;

Wherein, Is a binary mask with each element having a crossing rate gamma of 1, otherwise 0, the enhanced data set beingWhereinIs the number of samples after enhancement;

s2.2 based on the enhanced data in S2.1 Re-evaluating the reliability of the sampleCalculating the average reliability of all samplesSelecting the satisfaction of>As high quality samples, forming a high quality sample set;

。

preferably, in S4, the Kolmogorov-Smirnov test is used to detect the distribution difference between the new sample and the old sample, and the input new sample is set as The old sample isCalculating KS statistic S and P value P, and when the P value is smaller than a concept drift judgment threshold value drift_threshold, indicating that the concept drift occurs;

。

Preferably, in S5, according to the concept drift determination result, the reliability samples are selected and updated in two cases:

s5.1 when concept drift occurs, old samples are calculated using a multi-headed self-attention model And new samplesThe employed self-attention model outputs are expressed as follows:

;

wherein each head And attention weightThe calculation formula of (2) is as follows:

;

Wherein, Is the dimension of each of the heads,、AndIs a learnable weight matrix;

;

judging a reliability sample threshold value theta according to the attention weight, screening samples with attention weight higher than the threshold value as candidate samples, sorting the candidate samples, and selecting the first k candidate samples as reliability samples And taking other samples as unreliable samples;

;

Then fromDelete k samplesCombining the selected reliability samples into the current training set, and updating the labels of the training set;

;

S5.2 calculating old samples using the self-attention model when no concept drift occurs And new samplesJudging a reliability threshold value theta according to the attention weight, screening out samples with the attention weight higher than the threshold value as candidate samples, sequencing the candidate samples, selecting the first k candidate samples as reliability samples, merging the selected reliability samples into a current training set, and updating the labels of the training set;

。

Preferably, in S6, when the new reliability sample n is smaller than the set number k of samples, a loss value of the non-reliability sample is calculated WhereinSelecting k-n samples as supplementary samples for the number of unreliable samples according to the loss value sequence, merging the supplementary samples into the current training set, and updating the labels of the training set;

;

。

preferably, in S8, the test dataset Input to the modelIn which, the classification prediction result is outputBy comparing the prediction results of the modelsWith real labelsThe performance of the model was evaluated using indexes including accuracy, precision, recall, and F1 score.

The embodiment of the application provides a network intrusion detection system based on reliability sample selection, which comprises a processor and a memory;

a memory for storing a computer program;

and a processor for implementing any of the method steps when executing the program stored on the memory.

The invention carries out data enhancement on network intrusion data through genetic programming to generate a more representative sample. The high-quality sample is selected for initial model training through reliability evaluation, so that the quality and the representativeness of training data are ensured, the initial performance of the model is improved, and the problems of uneven sample quality and unbalanced data distribution in network intrusion detection data can be effectively solved;

When the concept drift is detected during online training of the model, the invention adopts a reliability sample selection strategy guided by an attention mechanism. This strategy can preferentially select the most valuable samples for model training to update online, thereby quickly adapting to new data distributions. Through an attention mechanism, the model can dynamically adjust the selection of samples, so that the model can better adapt to the change of data distribution, and the adaptability problem of the model when facing the change of data distribution is effectively solved;

When the number of the reliability samples is insufficient, the invention selects the sample with high model loss value from the non-reliability samples as the supplementary sample. The strategy can effectively utilize the uncertainty information of the model, select samples which are most helpful to the improvement of the model performance, and effectively solve the problem of how to effectively utilize the existing data to improve the model performance when the number of the samples is limited.

Drawings

FIG. 1 is a flow chart of a network intrusion detection method based on reliability sample selection in an embodiment of the invention;

FIG. 2 is a graph of the performance contrast of a confusion matrix for a network intrusion detection method selected based on reliability samples according to an embodiment of the present invention, where (a) represents the results of the baseline model and (b) represents the results of the network intrusion detection method selected based on reliability samples;

FIG. 3 is a comparison of performance of a confusion matrix plot of an initial training model in an embodiment of the invention, where (a) represents the results of the baseline model and (b) represents the results of using a high quality sample selection strategy.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For network traffic attacks, in the network intrusion detection problem, an attacker aims to bypass the detection system by various means (such as data packet tampering, disguising normal traffic, etc.), so that the network system performs illegal operations or leaks sensitive information. In this embodiment, we define a normal behavior sample in network traffic as a normal sample and a sample containing attack behavior as an attack sample. The system aims to accurately distinguish benign samples from malicious samples by analyzing network traffic, so that potential network attacks are timely discovered and prevented;

as shown in fig. 1, the network intrusion detection method based on reliability sample selection provided in this embodiment includes the following steps.

The method comprises the steps of S1, dividing a network intrusion data set NSL-KDD, wherein the NSL-KDD data set comprises a training data set and a test data set, the training data set totally comprises 125973 pieces of sample data, the training data set is divided into an original training data set and an online training data set, the original training data set comprises 25194 pieces of sample data, the online training data set comprises 100779 pieces of sample data, the test data set comprises 22544 pieces of sample data, each piece of sample data in the data set comprises 41 pieces of characteristics and attack type labels normal and anomaly, and the characteristics comprise basic characteristics, content characteristics and flow-based characteristics;

Evaluating risk and reliability of original training sample, inputting original training data into pre-training model to generate pseudo label and scoring graph of data, setting original training data set as WhereinRepresenting the ith sample, a pre-trained modelThe structure is as follows:

;

Generating pseudo tags And scoring graph ;

;

Based on the generated pseudo tagAnd scoring graphCalculating the risk of each sampleAnd reliability;

;

In particular, according to the risk of the sample And reliabilityBy genetic programmingFor training dataPerforming data enhancement, performing t iterations, randomly selecting samples and corresponding labels in each iteration, and performing data enhancement on each sample according to risk and reliabilityPerforming mutation or crossover operation:

Mutation operation: ;

Wherein, Is a binary mask, each element has a mutation rate mu of 1, otherwise, 0, wherein the mutation rate mu is 0.1;

crossover operation: ;

Wherein, Is a binary mask with each element having a crossing rate gamma of 1, otherwise 0. Where the crossover rate is 0.5, the enhanced data set isWhereinIs the number of samples after enhancement;

from enhanced data Re-evaluating the reliability of the sampleCalculating the average reliability of all samplesSelecting the satisfaction of> As high quality samples, forming a high quality sample set;

。

S3, inputting the high-quality sample in the S2 into a model, training by adopting a random gradient descent (Stochastic GRADIENT DESCENT, SGD) optimizer, wherein the learning rate is 0.001, and completing the initial model training after 250 iterations, wherein the training is carried out by adopting a Pytorch framework in NVIDIA GTX 3060 GPU.

S4, detecting whether the distribution of the input sample and the original sample is the same according to the input online training data so as to judge whether the concept drift occurs, adopting a Kolmogorov-Smirnov test to detect the distribution difference between the new sample and the old sample, and setting the input new sample asThe old sample is. Calculating KS statistics and P value, when the P value is smaller than the concept drift judgment threshold valueThen it is indicated that a conceptual drift has occurred;

。

S5, according to the judging result of the S4, selecting a reliability sample from the input samples and updating a model:

When concept drift occurs, old samples are calculated using a self-attention model And new samplesThe employed self-attention model outputs are expressed as follows:

;

And reliability sample threshold according to attention weight Judging, screening out samples with attention weights higher than the threshold as candidate samples, sorting the candidate samples, and selecting the first k candidate samples as reliability samplesAnd taking other samples as unreliable samples;

;

calculating old samples using a self-attention model when no concept drift occurs And new samplesAnd a reliability threshold is made according to the attention weightJudging, screening out samples with attention weight higher than the threshold value as candidate samples, sorting the candidate samples, selecting the first k candidate samples as reliability samples, merging the selected reliability samples into the current training set, and updating the labels of the training set;

。

s6, when the reliability samples in the input samples are insufficient, selecting samples with high model loss values from the non-reliability samples as supplementary samples, and when the new reliability samples n are smaller than the set sample number k, calculating the loss values of the non-reliability samples K-n samples are selected as supplementary samples according to the loss value ranking, and then the supplementary samples are combined into the current training set, and the labels of the training set are updated;

;

。

And S7, inputting the updated training data sets in the S5 and the S6 into the model for training until the online training data sets are completely input, training samples in the online training data sets in a segmented mode, inputting 5000 samples each time, and finishing model training after 20 iterations.

S8, testing the data setInput to the modelIn which, the classification prediction result is outputBy comparing the prediction results of the modelsWith real labelsThe performance of the model was evaluated using indexes including accuracy, precision, recall, and F1 score.

The method is compared with other classical continuous learning methods, namely SSF, AOC-IDS, EWC and LwF, by the method in the embodiment. The specific properties are shown in Table 1:

TABLE 1 comparison of the performance of the inventive method with other continuous learning methods

Method of	Accuracy (%)	Accuracy (%)	Recall (%)	F1(%)
					The invention is that	92.5	91.2	96.1	93.6
SSF	90.5	89.2	94.7	91.9
					AOC-IDS	81.7	78.9	92.5	85.2
EWC	81.7	89.1	77.4	82.7
					LwF	82.7	89.2	79.1	83.8

As can be seen from Table 1, the invention can better complete network intrusion detection tasks, the invention improves the accuracy by 2%, which means that normal and intrusion samples can be accurately identified, improves the accuracy by 1%, which means that the more reliable the intrusion is predicted, the lower the false alarm rate is, improves the recall rate by 1.4%, which means that the detection of intrusion samples is more comprehensive, the lower the false alarm rate is, improves the F1 score by 1.7%, which means that better balance is achieved between the accuracy and the recall rate, and the overall performance is better;

In addition, FIG. 2 compares a thermodynamic diagram of a confusion matrix for a baseline algorithm and a network intrusion detection system selected based on reliability samples. When the network intrusion detection is carried out, the number of samples which are correctly predicted as attack traffic is increased, the number of samples which are incorrectly predicted as normal traffic is reduced, so that the network intrusion detection device has better capability of identifying the attack traffic, the number of samples which are correctly identified as normal traffic is increased, the number of samples which are incorrectly judged as attack traffic is reduced, and the network intrusion detection device has better capability of identifying the normal traffic.

TABLE 2 comparison of Performance of the inventive method in sequential use of the sample selection strategies

Method of	Accuracy (%)	Accuracy (%)	Recall (%)	F1(%)
					Baseline	89.3	89.1	92.4	90.7
Baseline+HS	91.5	88.8	97.5	92.9
					Baseline+HS+AS	92.1	89.3	97.7	93.4
Baseline+HS+AS+URS	92.5	91.2	96.1	93.6

Note that HS represents high quality sample selection, AS represents attention mechanism directed reliability sample selection, URS represents uncertain sample replenishment strategy;

As can be seen from table 2, the three strategies proposed by the present invention have improved performance on the baseline algorithm, and the high quality sample selection helping algorithm has improved 2.2% accuracy and 2.2% F1 fraction, while the accuracy is reduced by 0.3%, but the recall is improved by 5.1%. The reliability sample selection strategy guided by the attention mechanism helps the model to realize comprehensive improvement of the performance of the baseline method, and higher results are obtained on all evaluation indexes. The supplement strategy adopting the uncertainty sample helps the model achieve better balance between the accuracy rate and the recall rate;

In addition, fig. 3 also compares the impact of the high quality sample selection strategy on the initial model training, when the high quality sample selection strategy is adopted, the number of samples correctly predicted as attack traffic increases, the number of samples incorrectly predicted as normal traffic decreases, and the initial training model is better in identifying attack traffic.

TABLE 3 comparison of Performance of the inventive method for selection of different number of reliability samples

k	Accuracy (%)	Accuracy (%)	Recall (%)	F1(%)
					50	91.5	89.7	96.1	92.8
150	92.1	89.5	97.3	93.3
					250	92.2	89.4	97.7	93.4
350	92.5	91.2	96.1	93.6
					450	92.4	90.5	96.9	93.5

As can be seen from table 3, the network intrusion detection performance was evaluated when k=50, 150, 250, 350, 450, it can be seen that using a larger number of reliability sample selections provides better accuracy and recall but results in a decrease in accuracy, and when k=350, an F1 score of 93.6% and an accuracy of 92.5% were achieved, which increased by 0.8% and 1% respectively over the k=50 results, and a better balance was achieved between accuracy and recall, so the experiment was evaluated using k=350.

The embodiment of the disclosure also provides a network intrusion detection system based on the reliability sample selection, which comprises a processor and a memory;

a memory for storing a computer program;

A processor for implementing any one of the method steps in the network intrusion detection system based on the reliability sample selection when executing the program stored on the memory;

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, or alternatives falling within the spirit and principles of the invention.

Claims

1. A network intrusion detection method based on reliability sample selection, characterized in that it includes the following steps:

Initial training phase,

S1: Divide the network intrusion data set, which includes an original training data set, an online training data set, and a test data set; evaluate the risk and reliability of the original training samples;

S2: Perform data augmentation on the samples according to the risk and reliability of the original training samples in S1, and select high-quality samples from the augmented data;

S3: Input the high-quality sample data selected in S2 into the model for training to obtain an initial training model;

During the online training phase,

S4: Based on the input online training data, detect whether the distribution of the input sample and the original sample is the same, so as to determine whether concept drift occurs;

S5: According to the judgment result of S4, the attention mechanism is used to select a certain number of reliability samples for the input samples and update the training data set; including when concept drift occurs, the self-attention model is used to calculate the attention weights of the old and new samples, and the reliability threshold is judged according to the attention weights, and the samples with attention weights higher than the threshold are screened out as candidate samples, the candidate samples are sorted, the top k candidate samples are selected as reliability samples, and k samples are deleted from the old samples; when concept drift does not occur, the self-attention model is used to calculate the attention weights of the old and new samples, and the reliability threshold is judged according to the attention weights, and the samples with attention weights higher than the threshold are screened out as candidate samples, the candidate samples are sorted, and the top k candidate samples are selected as reliability samples without deleting the old samples; the selected reliability samples are merged into the current training set, and the labels of the training set are updated;

S6: When the reliability samples in the input samples are insufficient, samples with high model loss values are selected from the non-reliability samples as supplementary samples;

S7: input the training data set updated in S5 and S6 into the model for training until all the online training data sets are input;

Testing phase,

S8: Input the test data into the model, and the model outputs the classification prediction results of the samples; evaluate the performance of the model by comparing the model's prediction results with the true labels.

2. According to the network intrusion detection method based on reliability sample selection according to claim 1, it is characterized in that S1 comprises the steps of:

S1.1: Input the original training data into the pre-training model to generate pseudo labels and score graphs for the data; suppose the original training data set is ,in Represents the i-th sample; pre-trained model The structure is as follows:

;

Generate pseudo labels and rating graph ;

;

S1.2: Calculate the risk of each sample in the original training dataset based on the generated pseudo-labels and score graph and reliability ;

Among them, g(·) and h(·) are the calculation functions of risk and reliability, respectively.

3. According to the network intrusion detection method based on reliability sample selection according to claim 2, it is characterized in that S2 comprises the steps of:

S2.1: Based on the risk and reliability of the samples in S1.2, genetic programming is used to enhance the training data for t iterations. In each iteration, samples and corresponding labels are randomly selected. Based on the risk and reliability, samples are Perform mutation or crossover operations;

Mutation operation: ;

in, is a binary mask where each element is 1 with mutation rate μ and 0 otherwise;

Crossover operation: ;

in, is a binary mask, each element is 1 with a cross rate γ, otherwise it is 0; the enhanced training dataset is ,in is the number of samples after enhancement;

S2.2: Re-evaluate the reliability of the sample based on the enhanced data in S2.1 , calculate the average reliability of all samples :

;

Choose to meet The samples are taken as high-quality samples to form a high-quality sample set.

4. A network intrusion detection method based on reliability sample selection according to claim 3, characterized in that, in S4, a Kolmogorov-Smirnov test is used to detect the distribution difference between the new sample and the old sample, assuming that the input new sample is , the old sample is ; Calculate the KS statistic and P value . When the P value is less than the concept drift judgment threshold, it means that concept drift has occurred.

5. According to the network intrusion detection method based on reliability sample selection of claim 4, it is characterized in that, in S5, according to the concept drift determination result, the reliability samples are selected and updated in two cases:

S5.1: When concept drift occurs, the self-attention model is used to calculate the attention weights of old and new samples. The output of the adopted self-attention model is expressed as follows:

;

Among them, the calculation formula for each head and attention weight is:

;

in, is the dimension of each head, , and is a learnable weight matrix;

;

The reliability threshold is judged according to the attention weight, and the samples with attention weight higher than the threshold are selected as candidate samples. The candidate samples are sorted and the top k candidate samples are selected as reliability samples. , and treat the other samples as non-reliability samples ;

;

Then from Delete k samples from ;Merge the selected reliability samples into the current training set and update the label of the training set;

;

S5.2: When concept drift does not occur, the self-attention model is used to calculate the attention weights of old and new samples, and the reliability threshold is judged based on the attention weights. Samples with attention weights higher than the threshold are selected as candidate samples, and the candidate samples are sorted. The top k candidate samples are selected as reliability samples; the selected reliability samples are merged into the current training set, and the label of the training set is updated;

.

6. A network intrusion detection method based on reliability sample selection according to claim 5, characterized in that in S6, when the new reliability sample n is less than the set sample number k, the loss value of the non-reliability sample is calculated ,in is the number of unreliable samples, sorted by loss value, select kn samples as supplementary samples, then merge the supplementary samples into the current training set, and update the label of the training set;

;

.

7. The network intrusion detection method based on reliability sample selection according to claim 6 is characterized in that, in S8, the test data set Input to model Output the classification prediction results , by comparing the prediction results of the model With the true label , the performance of the model is evaluated using indicators including accuracy, precision, recall and F1 score.

8. A network intrusion detection system based on reliability sample selection, characterized by comprising a processor and a memory:

Memory, used to store computer programs;

A processor, for implementing the method steps described in any one of claims 1 to 7 when executing a program stored in a memory.