+

CN119601036A - A Breathing Sound Recognition Method Based on Event-Level Detection Technology - Google Patents

A Breathing Sound Recognition Method Based on Event-Level Detection Technology Download PDF

Info

Publication number
CN119601036A
CN119601036A CN202510122041.6A CN202510122041A CN119601036A CN 119601036 A CN119601036 A CN 119601036A CN 202510122041 A CN202510122041 A CN 202510122041A CN 119601036 A CN119601036 A CN 119601036A
Authority
CN
China
Prior art keywords
event
respiratory sound
level detection
detection technology
respiratory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510122041.6A
Other languages
Chinese (zh)
Inventor
张明辉
董高杨
王建鸿
沈雨飞
吴佳凯
孙萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang University
Original Assignee
Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang University filed Critical Nanchang University
Priority to CN202510122041.6A priority Critical patent/CN119601036A/en
Publication of CN119601036A publication Critical patent/CN119601036A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本发明提供了一种基于事件级检测技术的呼吸音识别方法,属于音频信号识别技术领域。针对现有方法对呼吸音事件检测不准确的问题,本发明提供的一种基于事件级检测技术的呼吸音识别方法,基于分层令牌语义音频Transformer构建呼吸音识别模型检测异常呼吸音事件,提高了呼吸音事件的检测准确度和速度,从而提高临床呼吸疾病的诊断效率。

The present invention provides a method for identifying respiratory sounds based on event-level detection technology, and belongs to the technical field of audio signal recognition. Aiming at the problem that the existing methods are inaccurate in detecting respiratory sound events, the present invention provides a method for identifying respiratory sounds based on event-level detection technology, which builds a respiratory sound recognition model based on hierarchical token semantic audio Transformer to detect abnormal respiratory sound events, thereby improving the detection accuracy and speed of respiratory sound events, and thus improving the diagnostic efficiency of clinical respiratory diseases.

Description

Breath sound identification method based on event level detection technology
Technical Field
The invention relates to the technical field of audio signal identification, in particular to a breath sound identification method based on an event level detection technology.
Background
Respiratory diseases are diseases with high incidence and high apoptosis rate, and have positive clinical significance in early discovery, early diagnosis and early treatment. Respiratory sounds are noise generated during breathing and reveal normal or abnormal conditions of the respiratory system, common abnormal respiratory sounds including popping, wheezing, dry and wheezing sounds emitted from the lungs. The identification of abnormal breath sounds plays a critical role in clinical medicine, and traditional breath sound identification methods mainly rely on a handheld stethoscope, which are affected by the diagnostic experience and subjective judgment of doctors and cannot be quantitatively analyzed. In recent years, researchers have utilized computer algorithms to process and analyze breath sounds to achieve automated and quantitative recognition of breath sounds.
Currently, breath sound recognition mainly adopts a frame-based event detection method, each audio frame is divided into an event category, and then continuous frame-level predictions are collected to recognize the boundary of a sound event. In the frame-level method, the input audio signal is divided into segments of fixed length, and sound events in each segment are further separately classified. Such a segmentation classification strategy lacks time-series information of events and cannot effectively identify consecutively occurring events. Sound event detection (Sound Event Detection, SED) techniques refer to detecting and classifying specific sound events in a given audio signal and determining the start and end times of those events. The technique may describe the timing of events, which may be more efficient in detecting consecutively occurring events.
The application of sound event detection techniques to breath sound recognition helps to further improve the diagnostic efficiency of respiratory disease, however, such methods are currently under investigation. The present invention thus provides a solution to this problem.
Disclosure of Invention
The invention aims to provide a breath sound identification method based on an event level detection technology, which can solve the problem that the existing method is inaccurate in breath sound event detection.
The invention provides a breath sound identification method based on an event level detection technology, which comprises the following steps:
Acquiring and preprocessing breath sound data to construct a training set;
Constructing an initial model based on a hierarchical token semantic audio transducer, adding position codes in each layer of input of the transducer encoder, and introducing a mask multi-head self-attention mechanism in each layer of the transducer decoder;
Training the initial model based on the training set to obtain a breath sound recognition model;
And identifying the breath sound based on the breath sound identification model to obtain an identification result.
According to the breath sound identification method based on the event level detection technology, the breath sound identification model is constructed based on the layered token semantic audio transducer to detect abnormal breath sound events, so that the detection accuracy and speed of the breath sound events are improved, and the diagnosis efficiency of clinical respiratory diseases is improved.
Optionally, the breath sound data includes continuous even sounds and discontinuous even sounds.
Optionally, 16kHz resampling of the audio signal in the dataset and conversion to a mel-frequency spectrogram is performed while preprocessing the breath sound data.
Optionally, masking the mel-frequency spectrogram using time and frequency masking techniques prior to band shifting the selected region while preprocessing the breath sound data.
Optionally, the initial model is comprised of a hierarchical token semantic audio Transformer, transformer encoder, a transducer decoder, and a feedforward neural network.
Optionally, when the breath sound recognition model is obtained based on the training set training initial model, extracting features of the mel spectrogram through the hierarchical token semantic audio transducer, further training through a transducer encoder by combining the one-dimensional position codes with the extracted features, generating event representation through the transducer decoder, finally converting the generated event into an event detection result by utilizing a feedforward neural network, and calculating event-level loss by utilizing a hungarian algorithm.
Alternatively, when event-level losses are calculated using the hungarian algorithm, the loss functions include a location loss function and a class loss function.
Optionally, when the initial model is trained based on the training set to obtain the breath sound recognition model, the training set is divided into training data and verification data, the initial model is trained based on the training data, and performance evaluation is performed on the trained initial model based on the verification data.
Optionally, performing performance evaluation on the trained initial model based on the verification data, wherein the evaluation index comprises a positive predictive value, a sensitivity, a positive predictive value and a harmonic mean value of the sensitivity.
Optionally, when the breath sounds are identified based on the breath sound identification model to obtain an identification result, the identification result is an event type and an event position represented by the breath sounds.
Drawings
FIG. 1 is a diagram of a Mel spectrogram coding process;
FIG. 2 is a diagram showing the structural components of a REDT model;
FIG. 3 is a diagram of the construction of a transducer encoder and decoder;
FIG. 4 is a diagram of a REDT model training process;
FIG. 5 is a diagram of REDT model identification results.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. Unless otherwise defined, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. As used herein, the word "comprising" and the like means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof without precluding other elements or items.
The embodiment of the invention provides a breath sound identification method based on an event level detection technology, which comprises the following technical scheme:
S1, acquiring and preprocessing breathing sound data to construct a training set;
s2, constructing an initial model based on a hierarchical token semantic audio transducer (HIERARCHICAL TOKEN SEMANTIC AUDIO TRANSFORMER, HTS-AT), adding position codes in each layer of input of a transducer encoder (Transformer Encoder), and introducing a mask multi-head self-attention mechanism in each layer of the transducer decoder (Transformer Decoder);
S3, training an initial model based on a training set to obtain a breathing sound identification model;
s4, based on the breathing sound recognition model, recognizing the breathing sound to obtain a recognition result.
In practice, when S1 is performed, the source of acquired breath sound data is the hf_lung_v1 public dataset, acquired from 279 patients, comprising 9765 recordings, each with a duration of 15 seconds, which is the largest public lung sound recording dataset so far. The dataset includes 34095 inhalations (Inhalation) and 18349 exhalations (Exhalation), 13883 consecutive even utterances (Continuous Adventitious Sounds, CAS) and 15606 discrete even utterances (Continuous Adventitious Sounds, DAS), the consecutive even utterances including 8457 wheezing events, 686 wheezing events and 4740 dry-o events, the discrete even utterances all being pop events. 9765 records were divided into 7809 training data and 1956 test data according to the segmentation provided by the dataset authors.
Respiratory sound data for training and testing are preprocessed, and audio signals in the data set are resampled at 16kHz, wherein parameters are set to be nfft to 1024, the window size to be 1024, the hop size to be 323 and the Mel filter size to be 64. As shown in fig. 1, the data is converted into mel spectrograms, and each spectrogram is masked by using a time and frequency masking technology and then the frequency band is shifted, so that the data enhancement is realized.
In fact, in performing S2, in some breath sound event detection tasks, each sound event category requires multiple separate models to detect. Thus, in the hf_lung_v1 dataset, 4 models are required for inspiration, expiration, CAS and DAS, respectively, which is called a single class multi-model approach. This approach clearly increases the need for computing resources and memory space and is difficult to apply in clinical practice. In addition, breath sound event detection is a typical complex sound SED task, requiring simultaneous identification of multiple overlapping breath sound events. To this end, we propose a multi-class single model REDT (Respiratory Sound Event Detection Transformer) that uses only one model to detect all respiratory events. As shown in fig. 2, the REDT model consists of HTS-AT, transformer Encoder, transformer Decoder and feed-forward neural network (Feedforward Neural Network, FFN).
S21, HTS-AT is used for extracting the characteristics of the Mel spectrogram. HTS-AT is based on a transducer deep learning model architecture, the core of which is a self-attention mechanism that enables the model to handle long-range dependency problems in sequence data, and to handle all elements in a sequence in parallel, which is difficult to achieve in conventional recurrent and convolutional neural networks. In the audio processing tasks such as respiratory sound event detection and the like, a transducer is adopted as a feature extraction part, and the key features in the audio signals can be extracted by utilizing the strong sequence modeling capability of the transducer, so that the accuracy and the robustness of event detection are improved. Compared with a convolutional neural network, the transducer is more flexible in processing sequence data, and can better capture time sequence information and long-distance dependence in an audio signal.
S22, a transducer encoder is used for extracting the position information. As shown in fig. 3, the transducer encoder is formed by stacking a plurality of identical Encoder Layer, and each Encoder Layer internally contains a multi-head self-attention mechanism, position coding, two position independent layer normalization layers and two residual connections, and a feed-forward neural network. In sharp contrast to the merging of position codes of standard transducers into the initial input of the transducer only, we add position codes to the input of each layer to increase its positioning capability.
S23, a transducer decoder is used for generating event representation. As shown in fig. 3, the input of the transducer decoder is first converted to a vector by the embedding layer, while adding position coding to capture the position information of the word. The input data is then processed through N decoder layers, each decoder layer containing a masked multi-headed self-attention that ensures that no future position information will be revealed when the sequence is generated, the decoder focusing on the output of the encoder and making decoding predictions, and a feed-forward neural network for further non-linear transformation of the output of the attention layer. Finally, the output layer is a linear layer followed by a softmax function that generates a probability distribution for each timestamp for the breath sound event.
S24, FFN is a most basic neural network structure, which is composed of a plurality of levels, each level containing a plurality of neurons. In FFN, information flows from the input layer to the hidden layer and then from the hidden layer to the output layer, which is unidirectional, with no feedback connection or loop. Two FFNs are used in the present model, one for multi-labeled classification and one for time-stamped prediction.
In fact, when the initial model is trained based on the training data in S3, as shown in fig. 4, features of a mel spectrogram are extracted through a hierarchical token semantic audio transducer, the features extracted through combination of one-dimensional position coding are further trained through a transducer encoder, then the transducer decoder is used for generating events, finally the generated events are converted into event detection results through FFN, and event-level losses are calculated through a hungarian algorithm.
S31, extracting features
To increase computational efficiency, the mel spectrogram is partitioned into smaller partitions using HTS-AT, and then features are extracted using a sliding window attention mechanism, which reduces training time and computational resources while maintaining high performance and lightweight.
1) In Swim Transformer, the Patch Partition module divides the input RGB image of size H W3 into non-overlapping patches of the same size, resulting in a dimension N× (P 2 X3). Each P 2 X3 patch is considered a patch token, and is partitioned into N total, i.e., the length of the transducer valid input sequence. The patch partitioning method used in Swim Transformer is designed for RGB images and is not applicable to mel-spectrograms of audio, wherein two axes of time and frequency are not uniform with the horizontal axis and the vertical axis in the images, and time- & gt frequency- & gt window is used for patch embedding of the mel-spectrograms.
The audio Mel spectrogram is divided into different Patch keys through a Patch-Ebed CNN with the kernel size of (P multiplied by P), and then sequentially input into a transducer. Unlike images, the width and height of the audio mel-frequency spectrogram represent different information, time and frequency axes, respectively. In general, the duration is significantly longer than the span of frequencies, and to better grasp the relationship between frequencies in the same time frame, we first divide the mel-spectrum diagram into patch windows, e.g., w1, w 2. Thereafter, we split the patch into each window. The token sequences are arranged in order of time, frequency, and last window, patches with different frequencies at the same time frame will be adjacent in the input sequence.
2) Patch Token maps features to any dimension (set to D), and since the number of patches isA matrix of dimension nxd is obtained. For groups 2,3 and 4, a patch merge layer is employed to reduce the sequence size. This merging operation is performed by merging four adjacent patches into one, the number of channels being four times (4D). A linear layer is then added to determine the 4D dimension as the 2D dimension, and after four network groups, the shape of the patch mark is determined fromReduced by 8 times toThus, GPU memory consumption drops exponentially after each group.
For each Swin transducer block in the group, a window attention mechanism is employed to reduce computational complexity. First, the attention window is divided into aw1, aw2,..awk, each window containing m×m patches. We then calculate only the attention matrix within the attention window. Therefore, we have k window attention matrices instead of the global attention matrix. For an audio patch marker of size f×t and an initial potential dimension D, the computational complexity of these two mechanisms in a single transducer block is as follows:
;
;
Wherein the window attention reduces the second complexity term ) Multiple times. For audio patch tokens arranged in a time-frequency-window sequence, each window attention module will calculate a relationship within a specific range of consecutive frequency bins and time frames. As the network goes deep, the Patch-Merge layer will Merge adjacent windows, enabling the attention relationship to be calculated in a larger space.
S32, generating sound event
For class localization on the time axis, the extracted features are combined with one-dimensional position coding, which is further trained by a transducer encoder, which can better notice the time domain information. Finally, the predicted time stamps are implemented by a transducer decoder for location and class implementation.
1) One-dimensional position coding is integrated into our model, and this coding mechanism enables the network to learn deep about the time localization of individual sound elements, thereby preventing missing important information. Essentially, it enhances the temporal perception of the model and ensures a comprehensive analysis of the audio data. The corresponding formula can be expressed as:
;
;
Where t, f are time and frequency in the mel-frequency spectrogram, i is the dimension, and d is the number of transducer attention units. By using the above formula, we can deduce and use Position coding with identical shape. Wherein the method comprises the steps ofIs a new feature map extracted by HTS-AT, T represents the dimension of the time axis, F represents the dimension of the frequency axis, and d represents the number of channels.
2) Will beAnd P is flattened on the time and frequency axes to obtain a d X TF characteristic map and position codes, and then input into a transducer encoder. The transducer treats sound event detection as an integrated prediction problem and assumes that each event is independent. Thus, the standard autoregressive decoding mechanism employed in the machine translation task is discarded. Instead, the decoder takes as input N learned embeddings (called event queries), N event representations are output in parallel, where N is a super parameter that is greater than the typical number of events in the audio clip. Finally, events from the decoder are converted into event detection results using predictive FFN, e.g
S33, calculating event loss
In order to calculate event-level losses, a match between the target event and the predicted event is required, which can be obtained by the hungarian algorithm, and the unmatched predictions will be marked as "empty". The loss function consists of position loss and classification loss:
;
Position loss is a prediction of mismatch with "empty" events 0) Calculated, it is a linear combination of L1 norms and IOU penalty between target and predicted position vectors:
;
Wherein the method comprises the steps of The e R is a super-parameter that,Is the assignment given by the matching process, N is the predicted number. The classification penalty is the cross entropy between the label and the prediction:
;
S34, evaluating training results
And evaluating the network training effect by adopting Jaccard similarity when evaluating the performance of the trained initial model based on the verification data. And comparing the prediction result of each segment with the corresponding actual ground condition, and calculating the Jaccard similarity between the actual ground condition and the predicted event. Based on this similarity we classify that if Jaccard similarity is greater than 0.5, the prediction is considered true positive (True Positives, TP), if similarity is between 0 and 0.5, the prediction is considered false negative (FALSE NEGATIVE, FN), if similarity is zero, the prediction is considered false positive (False Positives, FP). Note that in this event detection task, we cannot define True Negatives (TN). To evaluate our model performance, we used positive predictive value (Positive Predictive Value, TPV), sensitivity (Se), the harmonic mean of PPv and Se (F1 Score).
1. Positive predictive value
The accuracy of the positive predictive value evaluation model in detecting breath sound events, namely how many events detected by the model are correct, is calculated by the following formula:
;
Where TP represents the number of samples correctly predicted to be positive and FP represents the number of samples incorrectly predicted to be positive.
2. Sensitivity of
Sensitivity is the ability of the assessment model to detect actual respiratory events, i.e., how much of the actual events are correctly detected, as calculated by:
;
Where TP represents the number of samples that are correctly predicted to be positive and FN represents the number of samples that are incorrectly predicted to be negative.
3. F1 Score
The accuracy of the model is comprehensively considered, the model is a balanced performance index, and the calculation formula is as follows:
;
These metrics are critical to the evaluation and optimization of breath sound event detection models, and by optimizing these metrics, the clinical utility value of the model can be improved, ensuring reliable and effective diagnostic support in actual use.
S35, breath sound identification model comparison experiment
There are currently different methods to detect respiratory acoustic events, including single-class multi-model methods and multi-class single-model methods. The multi-class single-model event detection is closer to clinical medical application and is relatively easy to deploy. However, the score for the single-class multimodal approach is higher relative to single-class multimodal event detection. Two different sets of comparative experiments, namely single-type multimode experiments and multi-type single-model experiments, were performed to demonstrate the superiority and practicality of our model.
1) The results of the three classes of single models CRNN, TCN and REDT were compared in the HF lungjv1 dataset using the pulmonary SED index (jie_f1) and the Tollar based event index (tbe_f1) and are shown in table 1 below. The REDT model is higher than the CRNN and TCN models in all evaluation indexes. For the jie_f1 index, the REDT model increased the score by more than 40% in all event detections, over the average, compared to the baseline of the optimal CRNN model.
TABLE 1 comparison of different multiclass Single models
,
2) The results of the detection of five single-type multimode models LSTM, biGRU, CNN-GRU, CNN-BiGRU, CNN-BiGRU and multi-branch TCN in the HF_lung_V1 dataset were compared with the REDT model using the pulmonary SED index (JIE_F1) and are shown in Table 2 below. The REDT model far advanced four single-class multi-models in jie_f1 score for each event class, with the REDT model only having a 2.7% lower inspiratory score and much higher scores for other event classes than the most advanced multi-branch TCN model. In summary, the result shows that the REDT single-class model based on event-level detection has better performance in respiratory sound event detection than other single-class multiple models, and reaches the most advanced level.
TABLE 2 comparison of single-class multiple models and multiple-class single models
,
In fact, when S4 is executed, as shown in fig. 5, when the breathing sound is identified based on the breathing sound identification model to obtain an identification result, the identification result is the event type and the event position represented by the breathing sound. Wherein the vertical axis is shown with segment C representing continuous contingent events, segment D representing discontinuous contingent events, segment E representing inspiratory events, and segment I representing expiratory events. The estimated label bar is the identification result of REDT model, and the reference label bar is the comprehensive judgment result of several advanced doctors. As can be seen, the two are very close together, indicating that REDT has high accuracy in identifying breath sounds.
While embodiments of the present invention have been described in detail hereinabove, it will be apparent to those skilled in the art that various modifications and variations can be made to these embodiments. It is to be understood that such modifications and variations are within the scope and spirit of the present invention as set forth in the following claims. Moreover, the invention described herein is capable of other embodiments and of being practiced or of being carried out in various ways.

Claims (10)

1.一种基于事件级检测技术的呼吸音识别方法,其特征在于,1. A method for identifying respiratory sounds based on event-level detection technology, characterized in that: 获取并预处理呼吸音数据构建训练集;Acquire and preprocess respiratory sound data to construct a training set; 基于分层令牌语义音频Transformer构建初始模型,在Transformer编码器的每一层输入中添加位置编码,在Transformer解码器的每一层引入掩码多头自注意力机制;We build an initial model based on the Hierarchical Token Semantic Audio Transformer, add position encoding to each layer of the Transformer encoder input, and introduce a masked multi-head self-attention mechanism to each layer of the Transformer decoder. 基于训练集训练初始模型获得呼吸音识别模型;Train the initial model based on the training set to obtain a respiratory sound recognition model; 基于呼吸音识别模型对呼吸音进行识别得到识别结果。The respiratory sound is recognized based on the respiratory sound recognition model to obtain the recognition result. 2.根据权利要求1所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,所述呼吸音数据包括连续偶发音和不连续偶发音。2. A method for identifying respiratory sounds based on event-level detection technology according to claim 1, characterized in that the respiratory sound data includes continuous even utterances and discontinuous even utterances. 3.根据权利要求1所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,预处理呼吸音数据时,对数据集中的音频信号进行16kHz重采样并转换为梅尔频谱图。3. A respiratory sound recognition method based on event-level detection technology according to claim 1, characterized in that when preprocessing the respiratory sound data, the audio signal in the data set is resampled to 16kHz and converted into a Mel-spectrogram. 4.根据权利要求3所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,预处理呼吸音数据时,对梅尔频谱图采用时间和频率掩蔽技术对选定的区域进行掩蔽再进行频带偏移。4. A method for respiratory sound recognition based on event-level detection technology according to claim 3, characterized in that when preprocessing the respiratory sound data, the time and frequency masking technology is used to mask the selected area of the Mel-spectrogram and then the frequency band is shifted. 5.根据权利要求1所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,所述初始模型由分层令牌语义音频Transformer、Transformer编码器、Transformer解码器和前馈神经网络组成。5. According to claim 1, a respiratory sound recognition method based on event-level detection technology is characterized in that the initial model consists of a hierarchical token semantic audio Transformer, a Transformer encoder, a Transformer decoder and a feedforward neural network. 6.根据权利要求5所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,基于训练集训练初始模型获得呼吸音识别模型时,通过分层令牌语义音频Transformer提取梅尔频谱图的特征,利用一维位置编码结合所提取的特征通过Transformer编码器进一步训练,再通过Transformer解码器生成事件表示,最后利用前馈神经网络将生成的事件转换为事件检测结果,并利用匈牙利算法计算事件级损失。6. According to claim 5, a respiratory sound recognition method based on event-level detection technology is characterized in that when the respiratory sound recognition model is obtained by training the initial model based on the training set, the features of the Mel-spectrogram are extracted by a hierarchical token semantic audio Transformer, and the one-dimensional position encoding is combined with the extracted features to further train the Transformer encoder, and then the event representation is generated by the Transformer decoder, and finally the generated event is converted into an event detection result by a feedforward neural network, and the event-level loss is calculated by the Hungarian algorithm. 7.根据权利要求6所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,利用匈牙利算法计算事件级损失时,损失函数包括位置损失函数和分类损失函数。7. A respiratory sound recognition method based on event-level detection technology according to claim 6, characterized in that when the event-level loss is calculated using the Hungarian algorithm, the loss function includes a position loss function and a classification loss function. 8.根据权利要求1所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,基于训练集训练初始模型获得呼吸音识别模型时,将所述训练集划分为训练数据和验证数据,基于所述训练数据对初始模型进行训练,基于所述验证数据对训练后的初始模型进行性能评估。8. According to claim 1, a respiratory sound recognition method based on event-level detection technology is characterized in that when the respiratory sound recognition model is obtained by training the initial model based on the training set, the training set is divided into training data and verification data, the initial model is trained based on the training data, and the performance of the trained initial model is evaluated based on the verification data. 9.根据权利要求8所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,基于所述验证数据对训练后的初始模型进行性能评估,评估指标包括阳性预测值、灵敏度、阳性预测值和灵敏度的调和平均值。9. A respiratory sound recognition method based on event-level detection technology according to claim 8, characterized in that the performance of the trained initial model is evaluated based on the verification data, and the evaluation indicators include positive predictive value, sensitivity, and the harmonic mean of positive predictive value and sensitivity. 10.根据权利要求1所述的一种基于事件级检测技术的呼吸音识别方法,其特征在于,基于呼吸音识别模型对呼吸音进行识别得到识别结果时,所述识别结果为呼吸音表示的事件类型和事件位置。10. A method for respiratory sound recognition based on event-level detection technology according to claim 1, characterized in that when the respiratory sound is recognized based on the respiratory sound recognition model to obtain a recognition result, the recognition result is the event type and event position represented by the respiratory sound.
CN202510122041.6A 2025-01-26 2025-01-26 A Breathing Sound Recognition Method Based on Event-Level Detection Technology Pending CN119601036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510122041.6A CN119601036A (en) 2025-01-26 2025-01-26 A Breathing Sound Recognition Method Based on Event-Level Detection Technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510122041.6A CN119601036A (en) 2025-01-26 2025-01-26 A Breathing Sound Recognition Method Based on Event-Level Detection Technology

Publications (1)

Publication Number Publication Date
CN119601036A true CN119601036A (en) 2025-03-11

Family

ID=94839045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510122041.6A Pending CN119601036A (en) 2025-01-26 2025-01-26 A Breathing Sound Recognition Method Based on Event-Level Detection Technology

Country Status (1)

Country Link
CN (1) CN119601036A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347787A (en) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 Method, device and equipment for classifying aspect level emotion and readable storage medium
CN114391807A (en) * 2021-12-17 2022-04-26 珠海脉动时代健康科技有限公司 Sleep breathing disorder analysis method, device, equipment and readable medium
CN115827913A (en) * 2022-11-21 2023-03-21 中国人民解放军陆军工程大学 Positioning detection method and device for acoustic event
CN116246654A (en) * 2023-02-13 2023-06-09 南昌大学 Breathing sound automatic classification method based on improved Swin-transducer
CN116364055A (en) * 2023-05-31 2023-06-30 中国科学院自动化研究所 Speech generation method, device, equipment and medium based on pre-training language model
US11776240B1 (en) * 2023-01-27 2023-10-03 Fudan University Squeeze-enhanced axial transformer, its layer and methods thereof
CN117198339A (en) * 2023-09-04 2023-12-08 湖北文理学院 Health monitoring methods, devices, equipment and storage media based on voiceprint recognition
CN117253505A (en) * 2023-07-27 2023-12-19 武汉轻工大学 Method and device for detecting abnormal sound of machine by combining transducer improved AE
CN118211100A (en) * 2024-04-01 2024-06-18 腾讯科技(深圳)有限公司 Classification processing method and device
CN118260635A (en) * 2024-04-01 2024-06-28 腾讯科技(深圳)有限公司 Object classification processing method, device, computer equipment and storage equipment
CN118335125A (en) * 2024-04-22 2024-07-12 长沙汇致医疗科技有限公司 Method and device for identifying heart and lung sound signals and intelligent stethoscope
CN118447836A (en) * 2024-05-09 2024-08-06 上海大学 Semi-supervision-based sound event detection method, system, terminal and medium
CN119181388A (en) * 2024-11-22 2024-12-24 浙江大学 A method and system for classifying respiratory sounds based on mel-spectrogram
CN119174600A (en) * 2024-09-09 2024-12-24 昆明理工大学 Intelligent pneumonia prediction system based on cough sound and lung breath sound

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347787A (en) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 Method, device and equipment for classifying aspect level emotion and readable storage medium
CN114391807A (en) * 2021-12-17 2022-04-26 珠海脉动时代健康科技有限公司 Sleep breathing disorder analysis method, device, equipment and readable medium
CN115827913A (en) * 2022-11-21 2023-03-21 中国人民解放军陆军工程大学 Positioning detection method and device for acoustic event
US11776240B1 (en) * 2023-01-27 2023-10-03 Fudan University Squeeze-enhanced axial transformer, its layer and methods thereof
CN116246654A (en) * 2023-02-13 2023-06-09 南昌大学 Breathing sound automatic classification method based on improved Swin-transducer
CN116364055A (en) * 2023-05-31 2023-06-30 中国科学院自动化研究所 Speech generation method, device, equipment and medium based on pre-training language model
CN117253505A (en) * 2023-07-27 2023-12-19 武汉轻工大学 Method and device for detecting abnormal sound of machine by combining transducer improved AE
CN117198339A (en) * 2023-09-04 2023-12-08 湖北文理学院 Health monitoring methods, devices, equipment and storage media based on voiceprint recognition
CN118211100A (en) * 2024-04-01 2024-06-18 腾讯科技(深圳)有限公司 Classification processing method and device
CN118260635A (en) * 2024-04-01 2024-06-28 腾讯科技(深圳)有限公司 Object classification processing method, device, computer equipment and storage equipment
CN118335125A (en) * 2024-04-22 2024-07-12 长沙汇致医疗科技有限公司 Method and device for identifying heart and lung sound signals and intelligent stethoscope
CN118447836A (en) * 2024-05-09 2024-08-06 上海大学 Semi-supervision-based sound event detection method, system, terminal and medium
CN119174600A (en) * 2024-09-09 2024-12-24 昆明理工大学 Intelligent pneumonia prediction system based on cough sound and lung breath sound
CN119181388A (en) * 2024-11-22 2024-12-24 浙江大学 A method and system for classifying respiratory sounds based on mel-spectrogram

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KE CHEN ET LA.: "HTS-AT: A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMER FOR SOUND CLASSIFICATION AND DETECTION", 《ICASSP 2022》, 31 December 2022 (2022-12-31), pages 1 - 3 *

Similar Documents

Publication Publication Date Title
CN111629663B (en) Method for diagnosing respiratory diseases by analyzing cough sounds using disease signatures
Ren et al. Prototype learning for interpretable respiratory sound analysis
Chen et al. Classify respiratory abnormality in lung sounds using STFT and a fine-tuned ResNet18 network
Zhang et al. Towards open respiratory acoustic foundation models: Pretraining and benchmarking
Chen et al. Automatic multi-level in-exhale segmentation and enhanced generalized S-transform for wheezing detection
Luo et al. Croup and pertussis cough sound classification algorithm based on channel attention and multiscale Mel-spectrogram
Chen et al. Supervised and self-supervised pretraining based COVID-19 detection using acoustic breathing/cough/speech signals
CN118299030A (en) Causal relationship analysis method and system for lung sound and AECOPD symptoms
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network
CN119181388B (en) A method and system for classifying respiratory sounds based on mel-spectrogram
Tiwari et al. Deep lung auscultation using acoustic biomarkers for abnormal respiratory sound event detection
Pandey et al. Review of acoustic features and computational Models in lung Disease diagnosis
Ariyanti et al. Abnormal respiratory sound identification using audio-spectrogram vision transformer
Liu et al. Respiratory sounds feature learning with deep convolutional neural networks
Khandhan et al. Deep learning model for chronic obstructive pulmonary disease through breathing sound
Melms et al. Training one model to detect heart and lung sound events from single point auscultations
Dong et al. Respiratory sounds classification by fusing the time-domain and 2D spectral features
Patel et al. Different Transfer Learning Approaches for Recognition of Lung Sounds
CN119601036A (en) A Breathing Sound Recognition Method Based on Event-Level Detection Technology
Hakki et al. Wheeze events detection using convolutional recurrent neural network
Deivasikamani et al. Covid cough classification using knn classification algorithm
Anupama et al. Detection of Chronic Lung Disorders using Deep Learning
CN113730755A (en) Mechanical ventilation man-machine asynchronous detection and identification method based on attention mechanism
Khanaghavalle et al. A Deep Learning Framework for Multiclass Categorization of Pulmonary Diseases
EP et al. Machine Learning Based Lung Sound Analysis Using MFCC Features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载