Disclosure of Invention
The invention aims to provide a breath sound identification method based on an event level detection technology, which can solve the problem that the existing method is inaccurate in breath sound event detection.
The invention provides a breath sound identification method based on an event level detection technology, which comprises the following steps:
Acquiring and preprocessing breath sound data to construct a training set;
Constructing an initial model based on a hierarchical token semantic audio transducer, adding position codes in each layer of input of the transducer encoder, and introducing a mask multi-head self-attention mechanism in each layer of the transducer decoder;
Training the initial model based on the training set to obtain a breath sound recognition model;
And identifying the breath sound based on the breath sound identification model to obtain an identification result.
According to the breath sound identification method based on the event level detection technology, the breath sound identification model is constructed based on the layered token semantic audio transducer to detect abnormal breath sound events, so that the detection accuracy and speed of the breath sound events are improved, and the diagnosis efficiency of clinical respiratory diseases is improved.
Optionally, the breath sound data includes continuous even sounds and discontinuous even sounds.
Optionally, 16kHz resampling of the audio signal in the dataset and conversion to a mel-frequency spectrogram is performed while preprocessing the breath sound data.
Optionally, masking the mel-frequency spectrogram using time and frequency masking techniques prior to band shifting the selected region while preprocessing the breath sound data.
Optionally, the initial model is comprised of a hierarchical token semantic audio Transformer, transformer encoder, a transducer decoder, and a feedforward neural network.
Optionally, when the breath sound recognition model is obtained based on the training set training initial model, extracting features of the mel spectrogram through the hierarchical token semantic audio transducer, further training through a transducer encoder by combining the one-dimensional position codes with the extracted features, generating event representation through the transducer decoder, finally converting the generated event into an event detection result by utilizing a feedforward neural network, and calculating event-level loss by utilizing a hungarian algorithm.
Alternatively, when event-level losses are calculated using the hungarian algorithm, the loss functions include a location loss function and a class loss function.
Optionally, when the initial model is trained based on the training set to obtain the breath sound recognition model, the training set is divided into training data and verification data, the initial model is trained based on the training data, and performance evaluation is performed on the trained initial model based on the verification data.
Optionally, performing performance evaluation on the trained initial model based on the verification data, wherein the evaluation index comprises a positive predictive value, a sensitivity, a positive predictive value and a harmonic mean value of the sensitivity.
Optionally, when the breath sounds are identified based on the breath sound identification model to obtain an identification result, the identification result is an event type and an event position represented by the breath sounds.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. Unless otherwise defined, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. As used herein, the word "comprising" and the like means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof without precluding other elements or items.
The embodiment of the invention provides a breath sound identification method based on an event level detection technology, which comprises the following technical scheme:
S1, acquiring and preprocessing breathing sound data to construct a training set;
s2, constructing an initial model based on a hierarchical token semantic audio transducer (HIERARCHICAL TOKEN SEMANTIC AUDIO TRANSFORMER, HTS-AT), adding position codes in each layer of input of a transducer encoder (Transformer Encoder), and introducing a mask multi-head self-attention mechanism in each layer of the transducer decoder (Transformer Decoder);
S3, training an initial model based on a training set to obtain a breathing sound identification model;
s4, based on the breathing sound recognition model, recognizing the breathing sound to obtain a recognition result.
In practice, when S1 is performed, the source of acquired breath sound data is the hf_lung_v1 public dataset, acquired from 279 patients, comprising 9765 recordings, each with a duration of 15 seconds, which is the largest public lung sound recording dataset so far. The dataset includes 34095 inhalations (Inhalation) and 18349 exhalations (Exhalation), 13883 consecutive even utterances (Continuous Adventitious Sounds, CAS) and 15606 discrete even utterances (Continuous Adventitious Sounds, DAS), the consecutive even utterances including 8457 wheezing events, 686 wheezing events and 4740 dry-o events, the discrete even utterances all being pop events. 9765 records were divided into 7809 training data and 1956 test data according to the segmentation provided by the dataset authors.
Respiratory sound data for training and testing are preprocessed, and audio signals in the data set are resampled at 16kHz, wherein parameters are set to be nfft to 1024, the window size to be 1024, the hop size to be 323 and the Mel filter size to be 64. As shown in fig. 1, the data is converted into mel spectrograms, and each spectrogram is masked by using a time and frequency masking technology and then the frequency band is shifted, so that the data enhancement is realized.
In fact, in performing S2, in some breath sound event detection tasks, each sound event category requires multiple separate models to detect. Thus, in the hf_lung_v1 dataset, 4 models are required for inspiration, expiration, CAS and DAS, respectively, which is called a single class multi-model approach. This approach clearly increases the need for computing resources and memory space and is difficult to apply in clinical practice. In addition, breath sound event detection is a typical complex sound SED task, requiring simultaneous identification of multiple overlapping breath sound events. To this end, we propose a multi-class single model REDT (Respiratory Sound Event Detection Transformer) that uses only one model to detect all respiratory events. As shown in fig. 2, the REDT model consists of HTS-AT, transformer Encoder, transformer Decoder and feed-forward neural network (Feedforward Neural Network, FFN).
S21, HTS-AT is used for extracting the characteristics of the Mel spectrogram. HTS-AT is based on a transducer deep learning model architecture, the core of which is a self-attention mechanism that enables the model to handle long-range dependency problems in sequence data, and to handle all elements in a sequence in parallel, which is difficult to achieve in conventional recurrent and convolutional neural networks. In the audio processing tasks such as respiratory sound event detection and the like, a transducer is adopted as a feature extraction part, and the key features in the audio signals can be extracted by utilizing the strong sequence modeling capability of the transducer, so that the accuracy and the robustness of event detection are improved. Compared with a convolutional neural network, the transducer is more flexible in processing sequence data, and can better capture time sequence information and long-distance dependence in an audio signal.
S22, a transducer encoder is used for extracting the position information. As shown in fig. 3, the transducer encoder is formed by stacking a plurality of identical Encoder Layer, and each Encoder Layer internally contains a multi-head self-attention mechanism, position coding, two position independent layer normalization layers and two residual connections, and a feed-forward neural network. In sharp contrast to the merging of position codes of standard transducers into the initial input of the transducer only, we add position codes to the input of each layer to increase its positioning capability.
S23, a transducer decoder is used for generating event representation. As shown in fig. 3, the input of the transducer decoder is first converted to a vector by the embedding layer, while adding position coding to capture the position information of the word. The input data is then processed through N decoder layers, each decoder layer containing a masked multi-headed self-attention that ensures that no future position information will be revealed when the sequence is generated, the decoder focusing on the output of the encoder and making decoding predictions, and a feed-forward neural network for further non-linear transformation of the output of the attention layer. Finally, the output layer is a linear layer followed by a softmax function that generates a probability distribution for each timestamp for the breath sound event.
S24, FFN is a most basic neural network structure, which is composed of a plurality of levels, each level containing a plurality of neurons. In FFN, information flows from the input layer to the hidden layer and then from the hidden layer to the output layer, which is unidirectional, with no feedback connection or loop. Two FFNs are used in the present model, one for multi-labeled classification and one for time-stamped prediction.
In fact, when the initial model is trained based on the training data in S3, as shown in fig. 4, features of a mel spectrogram are extracted through a hierarchical token semantic audio transducer, the features extracted through combination of one-dimensional position coding are further trained through a transducer encoder, then the transducer decoder is used for generating events, finally the generated events are converted into event detection results through FFN, and event-level losses are calculated through a hungarian algorithm.
S31, extracting features
To increase computational efficiency, the mel spectrogram is partitioned into smaller partitions using HTS-AT, and then features are extracted using a sliding window attention mechanism, which reduces training time and computational resources while maintaining high performance and lightweight.
1) In Swim Transformer, the Patch Partition module divides the input RGB image of size H W3 into non-overlapping patches of the same size, resulting in a dimension N× (P 2 X3). Each P 2 X3 patch is considered a patch token, and is partitioned into N total, i.e., the length of the transducer valid input sequence. The patch partitioning method used in Swim Transformer is designed for RGB images and is not applicable to mel-spectrograms of audio, wherein two axes of time and frequency are not uniform with the horizontal axis and the vertical axis in the images, and time- & gt frequency- & gt window is used for patch embedding of the mel-spectrograms.
The audio Mel spectrogram is divided into different Patch keys through a Patch-Ebed CNN with the kernel size of (P multiplied by P), and then sequentially input into a transducer. Unlike images, the width and height of the audio mel-frequency spectrogram represent different information, time and frequency axes, respectively. In general, the duration is significantly longer than the span of frequencies, and to better grasp the relationship between frequencies in the same time frame, we first divide the mel-spectrum diagram into patch windows, e.g., w1, w 2. Thereafter, we split the patch into each window. The token sequences are arranged in order of time, frequency, and last window, patches with different frequencies at the same time frame will be adjacent in the input sequence.
2) Patch Token maps features to any dimension (set to D), and since the number of patches isA matrix of dimension nxd is obtained. For groups 2,3 and 4, a patch merge layer is employed to reduce the sequence size. This merging operation is performed by merging four adjacent patches into one, the number of channels being four times (4D). A linear layer is then added to determine the 4D dimension as the 2D dimension, and after four network groups, the shape of the patch mark is determined fromReduced by 8 times toThus, GPU memory consumption drops exponentially after each group.
For each Swin transducer block in the group, a window attention mechanism is employed to reduce computational complexity. First, the attention window is divided into aw1, aw2,..awk, each window containing m×m patches. We then calculate only the attention matrix within the attention window. Therefore, we have k window attention matrices instead of the global attention matrix. For an audio patch marker of size f×t and an initial potential dimension D, the computational complexity of these two mechanisms in a single transducer block is as follows:
;
;
Wherein the window attention reduces the second complexity term ) Multiple times. For audio patch tokens arranged in a time-frequency-window sequence, each window attention module will calculate a relationship within a specific range of consecutive frequency bins and time frames. As the network goes deep, the Patch-Merge layer will Merge adjacent windows, enabling the attention relationship to be calculated in a larger space.
S32, generating sound event
For class localization on the time axis, the extracted features are combined with one-dimensional position coding, which is further trained by a transducer encoder, which can better notice the time domain information. Finally, the predicted time stamps are implemented by a transducer decoder for location and class implementation.
1) One-dimensional position coding is integrated into our model, and this coding mechanism enables the network to learn deep about the time localization of individual sound elements, thereby preventing missing important information. Essentially, it enhances the temporal perception of the model and ensures a comprehensive analysis of the audio data. The corresponding formula can be expressed as:
;
;
Where t, f are time and frequency in the mel-frequency spectrogram, i is the dimension, and d is the number of transducer attention units. By using the above formula, we can deduce and use Position coding with identical shape. Wherein the method comprises the steps ofIs a new feature map extracted by HTS-AT, T represents the dimension of the time axis, F represents the dimension of the frequency axis, and d represents the number of channels.
2) Will beAnd P is flattened on the time and frequency axes to obtain a d X TF characteristic map and position codes, and then input into a transducer encoder. The transducer treats sound event detection as an integrated prediction problem and assumes that each event is independent. Thus, the standard autoregressive decoding mechanism employed in the machine translation task is discarded. Instead, the decoder takes as input N learned embeddings (called event queries), N event representations are output in parallel, where N is a super parameter that is greater than the typical number of events in the audio clip. Finally, events from the decoder are converted into event detection results using predictive FFN, e.g
S33, calculating event loss
In order to calculate event-level losses, a match between the target event and the predicted event is required, which can be obtained by the hungarian algorithm, and the unmatched predictions will be marked as "empty". The loss function consists of position loss and classification loss:
;
Position loss is a prediction of mismatch with "empty" events 0) Calculated, it is a linear combination of L1 norms and IOU penalty between target and predicted position vectors:
;
Wherein the method comprises the steps of 、The e R is a super-parameter that,Is the assignment given by the matching process, N is the predicted number. The classification penalty is the cross entropy between the label and the prediction:
;
S34, evaluating training results
And evaluating the network training effect by adopting Jaccard similarity when evaluating the performance of the trained initial model based on the verification data. And comparing the prediction result of each segment with the corresponding actual ground condition, and calculating the Jaccard similarity between the actual ground condition and the predicted event. Based on this similarity we classify that if Jaccard similarity is greater than 0.5, the prediction is considered true positive (True Positives, TP), if similarity is between 0 and 0.5, the prediction is considered false negative (FALSE NEGATIVE, FN), if similarity is zero, the prediction is considered false positive (False Positives, FP). Note that in this event detection task, we cannot define True Negatives (TN). To evaluate our model performance, we used positive predictive value (Positive Predictive Value, TPV), sensitivity (Se), the harmonic mean of PPv and Se (F1 Score).
1. Positive predictive value
The accuracy of the positive predictive value evaluation model in detecting breath sound events, namely how many events detected by the model are correct, is calculated by the following formula:
;
Where TP represents the number of samples correctly predicted to be positive and FP represents the number of samples incorrectly predicted to be positive.
2. Sensitivity of
Sensitivity is the ability of the assessment model to detect actual respiratory events, i.e., how much of the actual events are correctly detected, as calculated by:
;
Where TP represents the number of samples that are correctly predicted to be positive and FN represents the number of samples that are incorrectly predicted to be negative.
3. F1 Score
The accuracy of the model is comprehensively considered, the model is a balanced performance index, and the calculation formula is as follows:
;
These metrics are critical to the evaluation and optimization of breath sound event detection models, and by optimizing these metrics, the clinical utility value of the model can be improved, ensuring reliable and effective diagnostic support in actual use.
S35, breath sound identification model comparison experiment
There are currently different methods to detect respiratory acoustic events, including single-class multi-model methods and multi-class single-model methods. The multi-class single-model event detection is closer to clinical medical application and is relatively easy to deploy. However, the score for the single-class multimodal approach is higher relative to single-class multimodal event detection. Two different sets of comparative experiments, namely single-type multimode experiments and multi-type single-model experiments, were performed to demonstrate the superiority and practicality of our model.
1) The results of the three classes of single models CRNN, TCN and REDT were compared in the HF lungjv1 dataset using the pulmonary SED index (jie_f1) and the Tollar based event index (tbe_f1) and are shown in table 1 below. The REDT model is higher than the CRNN and TCN models in all evaluation indexes. For the jie_f1 index, the REDT model increased the score by more than 40% in all event detections, over the average, compared to the baseline of the optimal CRNN model.
TABLE 1 comparison of different multiclass Single models
,
2) The results of the detection of five single-type multimode models LSTM, biGRU, CNN-GRU, CNN-BiGRU, CNN-BiGRU and multi-branch TCN in the HF_lung_V1 dataset were compared with the REDT model using the pulmonary SED index (JIE_F1) and are shown in Table 2 below. The REDT model far advanced four single-class multi-models in jie_f1 score for each event class, with the REDT model only having a 2.7% lower inspiratory score and much higher scores for other event classes than the most advanced multi-branch TCN model. In summary, the result shows that the REDT single-class model based on event-level detection has better performance in respiratory sound event detection than other single-class multiple models, and reaches the most advanced level.
TABLE 2 comparison of single-class multiple models and multiple-class single models
,
In fact, when S4 is executed, as shown in fig. 5, when the breathing sound is identified based on the breathing sound identification model to obtain an identification result, the identification result is the event type and the event position represented by the breathing sound. Wherein the vertical axis is shown with segment C representing continuous contingent events, segment D representing discontinuous contingent events, segment E representing inspiratory events, and segment I representing expiratory events. The estimated label bar is the identification result of REDT model, and the reference label bar is the comprehensive judgment result of several advanced doctors. As can be seen, the two are very close together, indicating that REDT has high accuracy in identifying breath sounds.
While embodiments of the present invention have been described in detail hereinabove, it will be apparent to those skilled in the art that various modifications and variations can be made to these embodiments. It is to be understood that such modifications and variations are within the scope and spirit of the present invention as set forth in the following claims. Moreover, the invention described herein is capable of other embodiments and of being practiced or of being carried out in various ways.