WO2024260357A1

WO2024260357A1 - Content-aware audio noise management

Info

Publication number: WO2024260357A1
Application number: PCT/CN2024/100004
Authority: WO
Inventors: Shaofan YANG; Kai Li; Chunghsin YEH; Ripinder SINGH
Original assignee: Dolby Laboratories Licensing Corporation; Dolby International Ab; Shaofan YANG
Priority date: 2023-06-20
Filing date: 2024-06-19
Publication date: 2024-12-26

Abstract

A noise management method and a noise management apparatus are provided. The noise management method includes: performing audio event segmentation on an input audio signal to generate content-aware segmentation information (504); estimating noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal (506); applying noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels (508).

Description

CONTENT-AWARE AUDIO NOISE MANAGEMENT

TECHNICAL FIELD

Various example embodiments relate to audio signal processing and enhancement.

BACKGROUND

Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted as prior art by inclusion in this section.

Noise reduction processing typically aims to significantly reduce the background and/or broadband noise with minimal reduction to the audio signal quality. Such processing can be configured to remove various noise components including, but not limited to, tape hiss, microphone background noise, hum noise, wind noise, etc. A proper amount of noise reduction typically depends on the type of useful audio signal, the type of noise, and the acceptable loss to the useful audio signal. In some cases, noise reduction processing can increase the signal‐to‐noise ratio (SNR) by 5 to 20 dB while retaining high audio quality for the remaining useful audio signal.

It is with respect to these and other considerations that the disclosure made herein is presented.

BRIEF SUMMARY OF SOME SPECIFIC EMBODIMENTS

Techniques are described for processing audio signals. Examples found herein provide for systems, apparatus, devices, and methods to perform content-aware audio noise management. Some embodiments can beneficially be used to automatically manage the noise floor in mixed content types, e.g., including voice, music, and noise sounds. At least some embodiments can be adapted for such applications as real-time communications, content creation, and audio capture by user devices.

According to an example embodiment, provided is a noise management method comprising: performing audio event segmentation on an input audio signal to generate content-aware segmentation information; estimating noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and applying noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.

In some embodiments of the above method, the audio event segmentation is performed in real time.

In some embodiments of any of the above methods, the performing includes processing the frame-wise speech, music, and noise confidences to identify a continuous sequence of audio events corresponding to the input audio signal, the audio events being selected from a plurality of event clusters, each of which is associated with one or more classes selected from the group consisting of a vocal class, a music class, and a noise class.

In some embodiments of any of the above methods, the performing includes a moving-average-convergence-and-divergence processing of the frame-wise speech, music, and noise confidences.

In some embodiments of any of the above methods, the estimating comprises updating the fixed-length portion using a first-in-first-out buffer configured to receive the frequency-bin data.

In some embodiments of any of the above methods, the estimating comprises truncating the sorted frequency-bin data to a size determined with an adaptive percentile estimator based on the content-aware segmentation information.

In some embodiments of any of the above methods, the adaptive percentile estimator is configured to select different respective percentiles for a speech event, a music event, and a noise event, the determined size being based on a selected one of the different respective percentiles.

In some embodiments of any of the above methods, the applying comprises computing suppression gain values based on the calculated bin-wise signal-to-noise-ratio values and further based on the estimated noise floor levels.

In some embodiments of any of the above methods, for each of the frequency bins, the computing comprises: computing a respective first suppression gain value based on a respective one of the bin-wise signal-to-noise-ratio values; computing a respective second suppression gain value based on the respective first suppression gain value and further based on a respective one of the estimated noise floor levels; and computing a product of the respective first suppression gain value and the respective second suppression gain value.

In some embodiments of any of the above methods, the applying comprises applying different respective aggressiveness settings of the noise suppression to different types of audio content based on the content-aware segmentation information.

In some embodiments of any of the above methods, the method further comprises converting the input audio signal to the frequency-bin data using a Fourier transform and applying an inverse Fourier transform to the frequency bins after having applied the noise suppression thereto to generate the output audio signal.

According to another example embodiment, provided is a non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising any one of the above methods.

According to yet another example embodiment, provided is a noise management apparatus comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: perform audio event segmentation on an input audio signal to generate content-aware segmentation information; estimate noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and apply noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.

Various aspects of the present disclosure are directed at audio signal processing and provide improvements in at least the technical fields of audio processing, audio event classification and segmentation, noise estimation, noise suppression, and the like.

Some embodiments disclosed herein may be generally described as techniques, where the term “technique” may refer to system (s) , device (s) , method (s) , computer-readable instruction (s) , module (s) , component (s) , hardware logic, and/or operation (s) as suggested in the context presented below.

Features and technical benefits other than those explicitly described above will be apparent from a reading of the following Detailed Description and a review of the referred-to drawings. This Summary is provided to introduce a selection of techniques in a simplified form and is not intended to identify key or essential features of the claimed subject matter, which are defined by the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other more detailed and specific features of various embodiments are more fully disclosed in the following description, reference being had to the accompanying drawings, in which:

FIG. 1 is a block diagram illustrating an example content-aware noise management system in which various aspects of the present disclosure can be practiced.

FIG. 2 is a block diagram illustrating an example circuit implementation of the content-aware noise management system of FIG. 1 according to some aspects of the present disclosure.

FIG. 3 is a block diagram illustrating a workflow implemented in the audio-event classification and segmentation component of the content-aware noise management system of FIG. 1 according to some aspects of the present disclosure.

FIG. 4 is a schematic diagram pictorially illustrating audio event clusters used in the workflow of FIG. 3 according to some aspects of the present disclosure.

FIG. 5 is a flowchart illustrating a noise management method according to some aspects of the present disclosure.

FIG. 6A illustrates a schematic block diagram of an example device architecture that can be used to implement various aspects of the present disclosure.

FIG. 6B illustrates a schematic block diagram of an example CPU implemented in the device architecture of FIG. 6A that can be used to implement various aspects of the present disclosure.

DETAILED DESCRIPTION

In the following description, numerous details are set forth, such as audio device configurations, timings, operations, and the like, in order to provide an understanding of one or more aspects of the present disclosure. It will be readily apparent to one skilled in the art that these specific details are merely examples and not intended to limit the scope of this application.

As used herein, the term “includes” and its variants are to be read as open-ended terms that mean “includes but is not limited to. ” The term “or” is to be read as “and/or” unless the context clearly indicates otherwise. Such terms are to be read as having an inclusive meaning. For example, “A and B” may mean at least the following: “both A and B” , “at least both A and B” . As another example, “A or B” may mean at least the following: “at least A” , “at least B” , “both A and B” , “at least both A and B” . As another example, “A and/or B” may mean at least the following: “A and B” , “A or B” . When an exclusive-or is intended, such will be specifically noted (e.g., “either A or B” ,

“at most one of A and B” ) . The term “based on” is to be read as “based at least in part on. ” The term “one example implementation” and “an example implementation” are to be read as “at least one example implementation. ” The term “another implementation” is to be read as “at least one other implementation. ” The terms “determined, ” “determines, ” or “determining” are to be read as obtaining, receiving, computing, calculating, estimating, predicting, or deriving. In addition, in the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.

Various Acronyms that may appear throughout this disclosure and in the associated claims and/or drawings are listed below. Other commonly used acronyms and terms of art may be excluded from this list in the interest of brevity. Thus, a short list of acronyms is provided below as an easy reference for the reader.

MACD –Moving Average Convergence and Divergence.

FIFO –First In First Out.

SNR –Signal‐to‐Noise Ratio.

FFT –Fast Fourier Transform.

IFFT –Inverse FFT.

Noise suppression is an audio process that removes background noise from a captured signal. Millions of internet-connected devices are used for an increasing variety of daily life scenarios that involve audio capture, such as calls, user content capture/creation, virtual music lessons, and live-streaming of new or evolving forms of media content (e.g., podcasts, audiobooks, extended reality content, virtual worlds, etc. ) . In these scenarios, it is desirable for a noise management system that implements noise suppression to provide not only clear speech but to also be effective for richer sound experiences, such as high-fidelity music, virtual meetings, live-streaming, user-generated content, etc.

Speech-centric noise management algorithms designed specifically for speech may disadvantageously lead to over-suppression effects when applied to music or other audio signals having useful non-speech components. Accordingly, example embodiments disclosed herein provide content-aware audio noise management systems and methods that can beneficially be used, e.g., to manage a noise floor in mixed content types. At least some embodiments can be adapted for such applications as real-time communications, content creation, and audio capture by user devices.

FIG. 1 is a block diagram illustrating an example content-aware noise management system 100 in which various aspects of the present disclosure can be practiced. The content-aware noise management system 100 includes an audio-event classification and segmentation component 130. The audio-event classification and segmentation component 130 receives input audio signals 102 and processes the input audio signals 102 to generate content-aware segmentation information. The content-aware noise management system 100 also includes a content-aware noise-estimation component 110 and a content-aware noise suppression component 120. The audio-event classification and segmentation component 130 is connected to the content-aware noise-estimation component 110 and the content-aware noise suppression component 120 via an information path 128, over which the content-aware segmentation information is provided thereto. The content-aware noise-estimation component 110 and the content-aware noise suppression component 120 are also connected to one another via an intermediate path 112. In some examples, the audio-event classification and segmentation component 130 generates the content-aware segmentation information in real time.

As used herein, the term “real time” refers to a computer-based process that controls or monitors a corresponding environment by receiving data, processing the received data, and generating a response sufficiently quickly to affect or characterize the environment without significant delay. In the context of control or processing software, real-time responses are often understood to be on the order of milliseconds, or sometimes microseconds.

The content-aware noise-estimation component 110 receives the input audio signals 102 and processes the received input audio signals 102 based on the content-aware segmentation information to generate intermediate output signals, which are then provided to the content-aware noise suppression component 120 via the path 112. The content-aware noise suppression component 120 receives the intermediate output signals and processes the received intermediate output signals based on the content-aware segmentation information to generate output audio signals 122. In some examples, the processing implemented in the content-aware noise-estimation component 110 and the content-aware noise suppression component 120 is configured to generate the output audio signals 122, wherein the noise suppression is performed in frequency sub-bands (bins) having a selected frequency resolution and is based at least in part on the content-aware segmentation information. The estimated noise data are provided in the intermediate output signals transmitted via the path 112.

FIG. 2 is a block diagram illustrating an example circuit implementation of the content-aware noise management system 100 according to some aspects of the present disclosure. In the example shown, the content-aware noise management system 100 includes a fast Fourier transform (FFT) module 202. The FFT module 202 is a shared module for the content-aware noise-estimation component 110 and the content-aware noise suppression component 120. In other implementations, the FFT module 202 can be replaced by another suitable time-domain to frequency-domain converter, such as a filter bank. In the example shown, the FFT module 202 is connected to the downstream modules of the content-aware noise-estimation component 110 and the audio-event classification and segmentation component 130 via an input bus 204.

In operation, the FFT module 202 receives the input audio signals 102 and converts the received input audio signals 102 into corresponding frequency-domain signals, which are transmitted downstream via the bus 204. In some examples, the frequency-domain signals are represented by a plurality of frequency bins. Example conversion operations performed in the FFT module 202 include (i) partitioning an audio signal waveform, x [t], provided via the input audio signals 102 into windowed, half-overlapping frames and (ii) applying a Fourier transform to each of the frames to generate a respective spectral block of bins. The amplitudes of the content of the spectral blocks are collectively denoted as X [t, k], where k represents the frequency index, and t represents the time index. The quantity X [t, k] for a fixed time index t is referred to as a frame of frequency bins corresponding to time t.

In the example shown, the audio-event classification and segmentation component 130 also includes a multi-classification module 240 and an energy module 244, both of which receive the frequency bins via the bus 204. The audio-event classification and segmentation component 130 further includes an audio event segmentation module 248. The multi-classification module 240 and the energy module 244 are connected to the audio event segmentation module 248 via a first path 242 and a second path 246, respectively. In operation, the multi-classification module 240 processes the frequency-domain signals (on the bus 204) to generate a corresponding stream of classification information, which is fed to the audio event segmentation module 248 via the first path 242. The energy module 244 processes the frequency-domain signals (on the bus 204) to generate a corresponding stream of energy information, which is fed to the audio event segmentation module 248 via the second path 246. The audio event segmentation module 248 processes the stream of classification information received via the first path 242 and the stream of energy information received via the second path 246 to generate the above-described content-aware segmentation information.

In one example, the frequency bins corresponding to a current audio frame and the adjacent historical audio frame sequence are processed in the multi-classification module 240 to generate the corresponding classification information including a speech confidence C_S [t], a music confidence C_M [t], and a noise confidence C_N [t] of the current frame as follows:

where K represents the total number of bins, and T represents the length of historical frames used for classification purposes. Herein, the term “noise” refers to the noise floor, such as the room noise, fan noise, or electrical noise, which does not include transient noise. The audio event segmentation module 248 uses the above-described confidences and the energy information to generate the content-aware segmentation information using a suitable segmentation function f_segmentation, for example, as follows:

where E_S [t], E_M [t], E_N [t] represent the sound, music, and noise energies, respectively. In one example, the segmentation function f_segmentation is based on the moving average convergence/divergence (MACD) processing, which is described in more detail below, e.g., in reference to FIG. 3 and Eqs. (16) - (24) .

In the example shown, the content-aware noise-estimation component 110 also includes a First-In-First-Out (FIFO) buffer 212, a sorting module 216, and a percentile estimator 220 that are serially connected. The FIFO buffer 212 has a fixed size selected to containhistorical frames of bins in addition to the current frame of bins. In operation, the FIFO buffer 212 receives the frequency-domain signals (on the bus 204) and continuously refreshes the frequency-bin data stored therein by replacing the oldest frame of bins with the newest frame bins provided via the frequency-domain signals (on the bus 204) . The FIFO buffer 212 is connected to the sorting module 216 via a third path 214. The sorting module 216 is further connected to the percentile estimator 220 via a fourth path 218. The percentile estimator 220 additionally receives the above-described content-aware segmentation information from the audio-event classification and segmentation component 130.

In one example, the sorting module 216 applies a sorting function, f_sort, to the current contents of the FIFO buffer 212, which are made available through the third path 214. A mathematical expression for this sorting operation is as follows:

where the sorting operation is individually implemented for each bin frequency. The sorted frequency-bin data produced by the sorting module 216 are represented by the left side of Eq. (3) and are provided to the percentile estimator 220 via the fourth path 218.

In noise segments, the energy in each frequency bin represents the noise level, but the minimum and maximum levels may vary significantly. The minimum level usually underestimates the noise floor and is rather sensitive to outliers. The maximum level may represent a transient impulse and, as such, may lead to an overestimation of the noise floor. In addition, the maximum level can also be susceptible to outliers. Therefore, a certain percentile of the sorted FIFO buffer can typically better represent the noise floor in the noise segment. For example, in speech segments, the FIFO buffer is not permanently occupied with speech, e.g., due to the sentence silences inherently present in any speech. Therefore, a portion of time (e.g., quantified as a percentage) , the energy in each bin frequency is on the noise floor level in speech segments. In music segments, not a single energy in each frequency bin may be at the noise floor level because the harmonics of some specific instruments, such as the didgeridoo, may last a relatively long time. In such cases, even the minimum level can be an overestimation of the noise floor. Consequently, in some cases, the noise estimation can be stopped in the music segments to avoid subsequent over-suppression of useful sound.

Different content types typically have different respective silence proportions. Therefore, in some examples, the percentile estimator 220 is configured to adaptively estimate the noise floor percentiles, p [t], of the sorted FIFO buffer based on the segment classification as speech, music, or noise. The corresponding mathematical representation of the percentile estimation operation is as follows:

where f_percentile is the classification-sensitive percentile estimation function applied by the percentile estimator 220 to the sorted FIFO buffer based on the segment classification provided in the content-aware segmentation information received by the percentile estimator 220 from the audio-event classification and segmentation component 130.

The percentile estimator 220 is further configured to truncate the sorted FIFO buffer down to a duration index [t] size, which is smaller than the temporal buffer length T_buf. The truncation operation can mathematically be represented as follows:

where the duration index [t] is expressed as:
index [t]=floor (T_buf·p [t] ) (6)

The left part of Eq. (5) provides the noise floor estimate computed by the adaptive percentile estimator 220.

The content-aware noise-estimation component 110 further includes a noise floor update module 224 and a noise update controller 228. The noise floor update module 224 is connected to the percentile estimator 220 via a fifth path 222 and is also connected to the noise update controller 228 via a control path 226. The noise update controller 228 also receives the content-aware segmentation information. In operation, the noise floor update module 224 performs recursive noise floor updates based on the noise floor estimates computed by the percentile estimator 220 and received therefrom via the fifth path 222. In one example, the recursive noise floor updates are performed in accordance with Eq. (7) :

where α [t] is a dynamic smoothing factor provided to the noise floor update module 224, via the control path 226, by the noise update controller 228. The updated noise floor estimates obtained by the noise floor update module 224, e.g., as described above, are transmitted to the content-aware noise suppression component 120 via an intermediate output path 230. The frequency-domain signals and the updated noise floor estimates and represent first and second signal components, respectively, of the intermediate output signals transmitted via the intermediate path 112 (also see FIG. 1) .

In one example, the noise update controller 228 computes the dynamic smoothing factor as follows:
α [t]=α₀+β [t] · (1-α₀) (8)

where α₀∈ [0, 1] is a constant; and β [t] is a time-dependent update coefficient derived as follows:

whereis a threshold, the value of which is selected to constrain capture of the transient sound and to reduce or substantially avoid errors in the noise floor estimation. The quantity DR [t] represents the dynamic range between the maximum level bins and the minimum level bins. In one example, DR [t] is calculated as follows:

Eqs. (3) - (9) are applicable to the input having the amplitudes obtained via a time-frequency transformation, such as a Fourier transform. A person of ordinary skill in the pertinent art will readily understand how to adapt Eqs. (3) - (9) to alternative input formats, such as the amplitude power or the amplitude expressed in dB. In a preferred implementation, the time-frequency transformation produces a frequency representation (such as the frequency-domain signals on the bus 204) of the input audio signals 102 with sufficient frequency resolution to support processing of audio signals with a rich harmonics content. For example, some musical instruments have many harmonics of fundamental frequencies that are lower than about 100 Hz.

The content-aware noise suppression component 120 includes an SNR calculation module 260, a first suppression gain calculation module 264, a second suppression gain calculation module 270, an aggressiveness controller 274, a multiplication module 278, a gain application module 282, and an inverse FFT (IFFT) module 290. The input signals received by the content-aware noise suppression component 120 include the above-described intermediate output signals and the content-aware segmentation information. The first signal component of the intermediate output signals is applied to the SNR calculation module 260 and the gain application module 282. The second signal component of the intermediate output signals is applied to the SNR calculation module 260 and the second suppression gain calculation module 270. The output signals generated by the IFFT module 290 are the output audio signals 122.

The SNR calculation module 260 is connected to the first suppression gain calculation module 264 via a sixth path 262. The first suppression gain calculation module 264 is further connected to the second suppression gain calculation module 270 and the multiplication module 278 via a seventh path 266. The second suppression gain calculation module 270 is further connected to the multiplication module 278 via an eighth path 272. The aggressiveness controller 274 is connected to the multiplication module 278 via a ninth path 276. The multiplication module 278 is further connected to the gain application module 282 via a tenth path 280. The gain application module 282 is further connected to the IFFT module 290 via an eleventh path 284.

In one example, the SNR calculation module 260 is configured to calculate SNR values based on the following mathematical expression:
SNR [t, k]=10log10 { (X [t, k] ²-N[t, k]²) /N [t, k] 2} (10)

with the values of X [t, k] and N [t, k] being provided by the first and second signal components, respectively, of the intermediate output signals. The calculated SNR values are then sent to the first suppression gain calculation module 264 via the sixth path 262.

The first suppression gain calculation module 264 is configured to calculate a first suppression gain, G_sup [t, k], based on the following mathematical expression:
G_sup [t, k] =f_suppression (SNR [t, k] ) (11)

with the SNR values being received from the first suppression gain calculation module 264 via the sixth path 262. In different examples, the mapping function f_suppression can be designed based on different suitable nonlinear curves. The calculated first suppression gain values are used to compute the residual signal X [t, k]·G_sup [t, k] , which is sent to the second suppression gain calculation module 270 and the multiplication module 278 via the seventh path 266.

The second suppression gain calculation module 270 is configured to remove the residual noise from the residual signal X [t, k] ·G_sup [t, k] based on the following mathematical expression:
G_fur [t, k] =f_further (X [t, k] ·G_sup [t, k] , N [t, k] ) (12)

where G_fur [t, k] denotes the second suppression gain values; and f_further is a second mapping function. The values of the residual signal X [t, k] ·G_sup [t, k] are received from the first suppression gain calculation module 264 via the seventh path 266. The values N [t, k] are received with the second signal component (on the path 230) of the intermediate output signals. The second mapping function f_further is configured to compare the values of X [t, k] ·G_sup [t, k] and N [t, k] and to derive the values of G_fur [t, k] based on the difference. In different examples, the second mapping function f_further can be designed based on suitable nonlinear curves. The values of G_fur [t, k] are sent to the multiplication module 278 via the eighth path 272.

The aggressiveness controller 274 is configured to apply different aggressiveness settings to different types of content to reduce deleterious effects of possible over-suppression of relevant content. In one example, the aggressiveness controller 274 derives a content-aware gain, G_ca [t, k] , as follows:

where G_ca [t, k] denotes the content-aware gain values; and f_{content-aware} is a content-aware gain function. The values of G_ca [t, k] are sent to the multiplication module 278 via the ninth path 276.

The multiplication module 278 is configured to compute a final noise suppression gain, G [t, k] , by multiplying the respective gain values received via the paths 266, 272, and 276 as follows:
G [t, k]=G_sup [t, k] ·G_fur [t, k] ·G_ca [t, k] (14)

The values of G [t, k] are sent to the gain application module 282 via the tenth path 280. The gain application module 282 is configured to apply the received gain values to the bins X [t, k] to compute the corresponding output bins Y [t, k] as follows:
Y [t, k] =X [t, k] ·G [t, k] (15)

The output bins Y [t, k] , with the original phases, are then directed to the IFFT module 290 via the eleventh path 284. The IFFT module 290 generates the output audio signals 122 via the IFFT operation applied to frames of the output bins Y [t, k] .

FIG. 3 is a block diagram illustrating a workflow 300 implemented in the audio-event classification and segmentation component 130 of the content-aware noise management system 100 according to some aspects of the present disclosure. An input 302 to the workflow 300 includes the above-described input audio signals 102 and/or frequency-domain signals (on the bus 204) . An output 332 of the workflow 300 includes the above-described content-aware segmentation information.

The workflow 300 includes an audio analysis and classification block 310, a moving-average-convergence-and-divergence (MACD) block 320, and an audio event segmentation block 330. In the audio analysis and classification block 310, the audio signal of each frame of the input 302 is analyzed to obtain the energies and/or is classified to generate raw confidence (s) . In the MACD block 320, the energies or confidences are post-processed using a suitable MACD algorithm. In the audio event segmentation block 330, the audio events of the input 302 are segmented based on real-time MACD-processing results to generate the output 332.

Operations of the MACD block 320 include calculating a difference between short-term smoothing and long-term smoothing of a specific value to predict the trend momentum. In one example, the MACD processing implemented in the MACD block 320 recursively generates four outputs based on the value v [t] as follows:
ma_short [t] =α_short·m_ashort [t-1] + (1-α_short) ·v [t]     (16)
ma_long [t] =α_long·ma_long [t-1] + (1-α_long) ·v [t]     (17)
ma_diff [t] =ma_short [t] -ma_long [t]                 (18)
mma_diff [t] =α_diff·mma_diff [t-1] + (1-α_diff) ·ma_diff [t]    (19)

where ma_short [t] denotes a short-term moving average; ma_long [t] denotes a long-term moving average; ma_diff [t] denotes a difference between the short-term moving average and the long-term moving average; and mma_diff [t] denotes a moving average of the difference. In some examples, v [t] represents raw confidences generated by the multi-classification module 240, such as the speech confidence C_S [t], the music confidence C_M [t], and the noise confidence C_N [t] or the raw frame energy E_dB [t] . The coefficients α_short, α_long, α_diff∈ [0, 1] represent the short-term smoothing factor, the long-term smoothing factor, and the smoothing factor of convergence/divergence, respectively. These coefficients are hyperparameters of the MACD algorithm.

In some examples, an audio signal waveform x [t] is first partitioned into windowed, half-overlapping frames, and then the frame data are converted to the frequency domain using either a filter bank or a time-frequency transformer, such as the FFT module 202. As explained previously, the amplitudes of the content of the spectral blocks are collectively denoted as X [t, k] , where k represents the frequency index, and t represents the time index.

In some examples of the workflow 300, the energy module 244 is used to calculate the total energy (formatted by dB) of each frame in accordance with the following mathematical expression:

where K is the total number of bins. The absolute value of E_dB [t] may differ for different recording devices, e.g., due to differences in preset system gains, recording distances, and/or speaker loudness. Therefore, it is not advisable to set a threshold for determining an audio event directly based on E_dB [t] . However, based on the above-described MACD processing, the mma_diff [t] represents the difference between the short-term moving average and the long-term moving average of E_dB [t] , which can be used to determine a general audio event in a more robust manner, e.g., because mma_diff [t] is not directly dependent on the absolute value of R_dB [t] .

Based on the above considerations, the MACD processing of E_dB [t] implemented in the MACD block 320 can be represented as follows:

where α_short and α_long are paired factors that define a tradeoff between sensitivity to onset/offset and latency of segmentation. The coefficients α_short and α_long are selectable based on the specific attributes of the MACD application.

In some examples of the workflow 300, operations of the audio event segmentation block 330 include segmenting a general audio event based on the following condition:

where th_{mma_onset} is the threshold used to determine the onset of an event, and th_{mma_offset} is the threshold used to determine the offset of an event.

FIG. 4 is a schematic diagram pictorially illustrating audio event clusters used in the workflow 300 according to some aspects of the present disclosure. In the example shown, various audio events are assigned to a plurality of clusters distributed over several audio classes including a vocal class 410, a music class 420, and a noise class 430. The vocal class 410 includes a non-speech cluster 412 and a speech cluster 414. The music class 420 includes a vocal cluster 422, an instrument cluster 424, and a mixture cluster 426. The noise class 430 includes a selective noise cluster 432, a transient noise cluster 434, and an “other noise” cluster 436. Example members of some of the clusters are listed in FIG. 4 as an illustration. Note that some of the clusters may span more than one class. For example, the vocal cluster 422 spans both the vocal class 410 and the music class 420. In other examples, other suitable clustering schemes may also be used.

In practice, it is challenging to identify the cluster to which an audio signal belongs with 100%accuracy due to the above-mentioned overlaps between clusters and classes. For a live communication system (such as a system carrying audio or video calls, live streaming, and the like) , stationary room background noise, device fan noise, electrical device noise, and the like, are the noise types that can be identified and suppressed with relatively high confidence. However, for transient noise, it may usually be more difficult to unambiguously determine the user preference, whether to suppress or share. Therefore, in designing a noise classifier, a subset that is smaller than the whole noise class 430 can be selected as the training data for the noise classification. In this manner, any overlaps between noise and speech or music can be mitigated to obtain a more robust selective noise classifier.

In some examples of the framework 300, a current frame and the adjacent historical frame sequence are fed into the multi-classification module 240 to generate raw confidences related to specific classes, such as the speech confidence C_S [t] , the music confidence C_M [t] , and the noise confidence C_N [t] , as described above in reference to FIG. 2 and Eq. (1) . In such examples, the MACD processing of the raw speech confidences C_S [t] implemented in the MACD block 320 can be formatted as follows:

The speech event can then be segmented in the audio event segmentation block 330 using the following condition:

where th_{mma_onset}, th_{mma_offset}, th_{short_onset}, and th_{shoprt_offset} are the selectable thresholds that depend on the specific application.

In some examples, the audio event segmentation block 330 is configured to perform segmentation of music and noise events using a similar approach, e.g., as exemplified by Eqs. (23) -(24) . In such examples, the selection of values for the smoothing factors α_short, α_long, α_diff and for the thresholds th_{mma_onset}, th_{mma_offset}, th_{short_onset}, and th_{short_offset} will typically differ for different ones of the classes 410, 420, 430 to properly balance the sensitivity and latency aspects of the framework 300.

FIG. 5 is a flowchart illustrating example noise management methods 500 according to some aspects of the present disclosure. The methods 500 may be implemented using the content-aware noise management system 100. The methods 500 may be performed by a processor, which may be configured to perform methods 500 via machine-executable instructions. The methods 500 may be broken into various blocks or partitions, such as blocks 502, 504, 506, 508, and 510. The various process blocks illustrated in FIG. 5 provide examples of various methods disclosed herein, and it is understood that some blocks may be removed, added, combined, or modified without departing from the spirit of the present disclosure. For some examples, processing of the various blocks, which may be described as processes, methods, steps, blocks, operations, or functions, may commence at block 502.

At a block 502, “Convert input audio signal (s) to frequency-bin data, ” the method 500 includes converting the input audio signals 102 to the frequency-bin data. In some examples, operations of the block 502 include performing a Fourier transform, e.g., implemented with FFT module 202. Processing may proceed from block 502 to block 504.

At a block 504, “Perform audio event segmentation on converted input audio signals, ” the method 500 also includes performing audio event segmentation on the converted input audio signals. Operations of the block 504 include generating the content-aware segmentation information (on the path 128) corresponding to the input audio signals 102. In some examples, the audio event segmentation is performed in real time. In some examples, the content-aware segmentation information (on the path 128) includes frame-wise speech, music, and noise confidences. The operations of the block 504 also include processing the frame-wise speech, music, and noise confidences to identify a continuous sequence of audio events corresponding to the input audio signals 102. The audio events are selected from a plurality of event clusters, each of which is associated with one or more classes selected from the group consisting of a vocal class 410, a music class 420, and a noise class 430. In some examples, the processing operation is configured to mitigate an appearance of false selections in the identified continuous sequence. In some examples, operations of the block 504 also include the moving-average-convergence-and-divergence processing 320 of the frame-wise speech, music, and noise confidences.

At a block 506, “Estimate noise floor levels based on segmentation information, ” the method 500 also includes estimating the noise floor levels. The noise floor levels (on the path 230) are associated with the input audio signal 102 and are computed based on the content-aware segmentation information (on the path 128) and further based on sorting the frequency-bin data (on the bus 204) corresponding to a fixed-length portion of the input audio signal 102. Operations of the block 506 include updating the fixed-length portion using the FIFO buffer 212 configured to receive the frequency-bin data (on the bus 204) . The operations of the block 506 also include truncating the sorted frequency-bin data 218 to a size determined with the adaptive percentile estimator 220 based on the content-aware segmentation information (on the path 128) . In some examples, the adaptive percentile estimator 220 is configured to select different respective percentiles for a speech event, a music event, and a noise event, and the determined size is based on a selected one of the different respective percentiles. The operations of the block 506 also include updating the noise floor levels (on the path 230) using the update controller 224. In some examples, the update controller 224 is configured to start and stop the updating based on audio event data included in the content-aware segmentation information (on the path 128) .

At a block 508, “Applying noise suppression to frequency-bin data, ” the method 500 also includes applying noise suppression to the input signal 102 represented by the frequency-bin data. In some examples, the noise suppression is performed in frequency bins having a selected frequency resolution and is based at least in part on the content-aware segmentation information (on the path 128) and the estimated noise floor levels (on the path 230) . Operations of the block 508 include calculating bin-wise SNR values based on the amplitudes of the frequency bins and further based on the estimated noise floor levels (on the path 230) . The operations of the block 508 also include computing suppression gain values based on the calculated bin-wise SNR values and further based on the estimated noise floor levels (on the path 230) . In some examples, such computing includes: (i) computing a respective first suppression gain value based on a respective one of the bin-wise signal-to-noise-ratio values; (ii) computing a respective second suppression gain value based on the respective first suppression gain value and further based on a respective one of the estimated noise floor levels; and (iii) computing a product of the respective first suppression gain value and the respective second suppression gain value. The operations of the block 508 also include applying different respective aggressiveness settings of the noise suppression to different types of audio content based on the content-aware segmentation information (on the path 128) .

In some examples, at least some of the operations of the block 506 performed using the frequency bins having a first selected frequency resolution. In contrast, at least some of the operations of the block 508 are performed using the frequency bins having a second selected frequency resolution that is different from the first selected frequency resolution. The second selected frequency resolution may be higher than the first selected frequency resolution. In such examples, the operations of the block 508 may also include converting the estimated noise floor levels (on the path 230) from the first selected frequency resolution to the second selected frequency resolution.

At a block 510, “Apply inverse Fourier transform to frequency bins subjected to noise suppression to generate output audio signal (s) , ” the method 500 also includes applying an inverse Fourier transform to the output frequency bins. Operations of the block 510 also include concatenating time-domain signal portions generated via the inverse Fourier transforms to generate the output audio signal 122.

FIG. 6A illustrates a schematic block diagram of an example device architecture 600 (e.g., an apparatus 600) that may be used to implement various aspects of the present disclosure. The architecture 600 includes but is not limited to servers and client devices, systems, and methods as described in reference to FIGS. 1-5. As shown, the architecture 600 includes a central processing unit (CPU) 601 which is capable of performing various processes in accordance with a program stored in, for example, a read only memory (ROM) 602 or a program loaded from, for example, a storage unit 608 to a random-access memory (RAM) 603. The CPU 601 may be, for example, an electronic processor 601. In the RAM 603, the data required when the CPU 601 performs the various processes is also stored, as required. The CPU 601, ROM 602, and RAM 603 are connected to one another via a bus 604. An input/output interface 605 is also connected to the bus 604.

The following components are connected to the I/O interface 605: an input unit 606, that may include a keyboard, a mouse, or the like; an output unit 607 that may include a display, such as a liquid crystal display (LCD) and one or more speakers; a storage unit 608 including a hard disk, or another suitable storage device; and a communication unit 609 including a network interface card, such as a network card (e.g., wired or wireless) .

In some implementations, the input unit 606 includes one or more microphones in different positions (depending on the host device) enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats) .

In some implementations, the output unit 607 includes systems with various numbers of speakers. The output unit 607 (depending on the capabilities of the host device) can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats) .

In some embodiments, the communication unit 609 is configured to communicate with other devices (e.g., via a network) . A drive 610 is also connected to the I/O interface 605, as required. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium, is mounted on the drive 610, so that a computer program read therefrom is installed into the storage unit 608, as required. A person skilled in the pertinent art will understand that, although the apparatus 600 is described as including the above-described components, in various applications, it is possible to add, remove, and/or replace some of these components and all these modifications or alteration fall within the scope of the present disclosure.

In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 609, and/or installed from the removable medium 611, as shown in FIG. 6A.

FIG. 6B illustrates a schematic block diagram of the CPU 601 implemented in the architecture 600 of FIG. 6A according to one example. In the example shown, the CPU 601 includes an electronic processor 620 and a memory 621. The electronic processor 620 is electrically and/or communicatively connected to the memory 621 for bidirectional communication. The memory 621 stores software 622. In some examples, the memory 621 may be located internally to the electronic processor 620, such as for an internal cache memory or some other internally located ROM, RAM, or flash memory. In other examples, the memory 621 may be located externally to the electronic processor 620, such as in the ROM 602, the RAM 603, flash memory, or the removable medium 611, or another non-transitory computer readable medium that is contemplated for the architecture 600. In some instances, the electronic processor 620 runs the software 622 stored in the memory 621 to perform, among other things, various methods and operations associated with the content-aware noise management described above in reference to FIGS. 1-5.

A person of ordinary skill in the pertinent art will readily recognize that the present invention by no means is limited to the example embodiments described above. On the contrary, many modifications and variations are possible and are considered to be within the scope of the appended claims. Various aspects and implementations of the present disclosure may also be appreciated from the following enumerated example embodiments (EEEs) , which are not claims, and which may represent systems, methods, and devices, all arranged in accordance with various aspects of the present disclosure.

EEE1. A noise management method, comprising: performing audio event segmentation on an input audio signal to generate content-aware segmentation information; estimating noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and applying noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.

EEE2. The method according to EEE1, wherein the audio event segmentation is performed in real time.

EEE3. The method according to claim EEE1 or EEE2, wherein the content-aware segmentation information includes frame-wise speech, music, and noise confidences.

EEE4. The method according to EEE3, wherein the performing includes processing the frame-wise speech, music, and noise confidences to identify a continuous sequence of audio events corresponding to the input audio signal, the audio events being selected from a plurality of event clusters, each of which is associated with one or more classes selected from the group consisting of a vocal class, a music class, and a noise class.

EEE5. The method according to EEE4, wherein the processing is configured to mitigate an appearance of false selections in the identified continuous sequence.

EEE6. The method according to any one of EEE3 to EEE5, wherein the performing includes a moving-average-convergence-and-divergence processing of the frame-wise speech, music, and noise confidences.

EEE7. The method according to any one of EEE1 to EEE6, wherein the estimating comprises updating the fixed-length portion using a first-in-first-out buffer configured to receive the frequency-bin data.

EEE8. The method according to any one of EEE1 to EEE6, wherein the estimating comprises truncating the sorted frequency-bin data to a size determined with an adaptive percentile estimator based on the content-aware segmentation information.

EEE9. The method according to EEE8, wherein the adaptive percentile estimator is configured to select different respective percentiles for a speech event, a music event, and a noise event, the determined size being based on a selected one of the different respective percentiles.

EEE10. The method according to any one of EEE1 to EEE6, wherein the estimating comprises updating the noise floor levels using an update controller configured to start and stop the updating based on audio event data included in the content-aware segmentation information.

EEE11. The method according to any one of EEE1 to EEE10, wherein the applying comprises calculating bin-wise signal-to-noise-ratio values based on amplitudes of the frequency bins and the estimated noise floor levels.

EEE12. The method according to according to EEE11, wherein the applying further comprises computing suppression gain values based on the calculated bin-wise signal-to-noise-ratio values and further based on the estimated noise floor levels.

EEE13. The method according to EEE12, wherein, for each of the frequency bins, the computing comprises: computing a respective first suppression gain value based on a respective one of the bin-wise signal-to-noise-ratio values; computing a respective second suppression gain value based on the respective first suppression gain value and further based on a respective one of the estimated noise floor levels; and computing a product of the respective first suppression gain value and the respective second suppression gain value.

EEE14. The method according to any one of EEE1 to EEE13, wherein the applying comprises applying different respective aggressiveness settings of the noise suppression to different types of audio content based on the content-aware segmentation information.

EEE15. The method according to any one of EEE1 to EEE14, wherein the estimating is performed using the frequency bins having a first selected frequency resolution; and wherein the applying is performed using the frequency bins having a second selected frequency resolution that is different from the first selected frequency resolution.

EEE16. The method according to EEE15, wherein the second selected frequency resolution is higher than the first selected frequency resolution.

EEE17. The method according to EEE16, wherein the estimating further comprises converting the estimated noise floor levels from the first selected frequency resolution to the second selected frequency resolution.

EEE18. The method according to any one of EEE1 to EEE17, further comprising: converting the input audio signal to the frequency-bin data using a Fourier transform; and applying an inverse Fourier transform to the frequency bins after having applied the noise suppression thereto to generate the output audio signal.

EEE19. A non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising the method of any one of EEE1 to EEE18.

EEE20. A noise management apparatus, comprising: at least one processor; and at least one memory including program code, wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to: perform audio event segmentation on an input audio signal to generate content-aware segmentation information; estimate noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and apply noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.

With regard to the processes, systems, methods, heuristics, etc. described herein, it should be understood that, although the steps of such processes, etc. have been described as occurring according to a certain ordered sequence, such processes could be practiced with the described steps performed in an order other than the order described herein. It further should be understood that certain steps could be performed simultaneously, that other steps could be added, or that certain steps described herein could be omitted. In other words, the descriptions of processes herein are provided for the purpose of illustrating certain embodiments and should in no way be construed so as to limit the claims.

Accordingly, it is to be understood that the above description is intended to be illustrative and not restrictive. Many embodiments and applications other than the examples provided would be apparent upon reading the above description. The scope should be determined, not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. It is anticipated and intended that future developments will occur in the technologies discussed herein, and that the disclosed systems and methods will be incorporated into such future embodiments. In sum, it should be understood that the application is capable of modification and variation.

All terms used in the claims are intended to be given their broadest reasonable constructions and their ordinary meanings as understood by those knowledgeable in the technologies described herein unless an explicit indication to the contrary is made herein. In particular, use of the singular articles such as “a, ” “the, ” “said, ” etc. should be read to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments incorporate more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in fewer than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.

While this disclosure includes references to illustrative embodiments, this specification is not intended to be construed in a limiting sense. Various modifications of the described embodiments, as well as other embodiments within the scope of the disclosure, which are apparent to persons skilled in the art to which the disclosure pertains are deemed to lie within the principle and scope of the disclosure, e.g., as expressed in the following claims.

Some embodiments may be implemented as circuit-based processes, including possible implementation on a single integrated circuit.

Some embodiments can be embodied in the form of methods and apparatuses for practicing those methods. Some embodiments can also be embodied in the form of program code recorded in tangible media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other non-transitory machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the patented invention (s) . Some embodiments can also be embodied in the form of program code, for example, stored in a non-transitory machine-readable storage medium including being loaded into and/or executed by a machine, wherein, when the program code is loaded into and executed by a machine, such as a computer or a processor, the machine becomes an apparatus for practicing the patented invention (s) . When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.

Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value or range.

The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.

Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation. ”

Unless otherwise specified herein, the use of the ordinal adjectives “first, ” “second, ” “third, ” etc., to refer to an object of a plurality of like objects merely indicates that different instances of such like objects are being referred to, and is not intended to imply that the like objects so referred-to have to be in a corresponding order or sequence, either temporally, spatially, in ranking, or in any other manner.

Unless otherwise specified herein, in addition to its plain meaning, the conjunction “if” may also or alternatively be construed to mean “when” or “upon” or “in response to determining” or “in response to detecting, ” which construal may depend on the corresponding specific context. For example, the phrase “if it is determined” or “if [astated condition] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event] ” or “in response to detecting [the stated condition or event] . ”

Also, for purposes of this description, the terms “couple, ” “coupling, ” “coupled, ” “connect, ” “connecting, ” or “connected” refer to any manner known in the art or later developed in which energy is allowed to be transferred between two or more elements, and the interposition of one or more additional elements is contemplated, although not required. Conversely, the terms “directly coupled, ” “directly connected, ” etc., imply the absence of such additional elements.

As used herein in reference to an element and a standard, the term compatible means that the element communicates with other elements in a manner wholly or partially specified by the standard and would be recognized by other elements as sufficiently capable of communicating with the other elements in the manner specified by the standard. The compatible element does not need to operate internally in a manner specified by the standard.

The functions of the various elements shown in the figures, including any functional blocks labeled as “processors” and/or “controllers, ” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC) , field programmable gate array (FPGA) , read only memory (ROM) for storing software, random access memory (RAM) , and nonvolatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.

As used in this application, the terms “circuit, ” “circuitry” may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) ; (b) combinations of hardware circuits and software, such as (as applicable) : (i) a combination of analog and/or digital hardware circuit (s) with software/firmware and (ii) any portions of hardware processor (s) with software (including digital signal processor (s) ) , software, and memory (ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) ; and (c) hardware circuit (s) and or processor (s) , such as a microprocessor (s) or a portion of a microprocessor (s) , that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation. ” This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

It should be appreciated by those of ordinary skill in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

“BRIEF SUMMARY OF SOME SPECIFIC EMBODIMENTS” in this specification is intended to introduce some example embodiments, with additional embodiments being described in “DETAILED DESCRIPTION” and/or in reference to one or more drawings. “BRIEF SUMMARY OF SOME SPECIFIC EMBODIMENTS” is not intended to identify essential elements or features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.

Claims

A noise management method (500) , comprising:

performing (504) audio event segmentation on an input audio signal to generate content-aware segmentation information;

estimating (506) noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and

applying (508) noise suppression to the input audio signal to generate an output audio signal, the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.
The method of claim 1, wherein the audio event segmentation is performed in real time.
The method of claim 1 or 2, wherein the content-aware segmentation information includes frame-wise speech, music, and noise confidences.
The method of claim 3, wherein the performing (504) includes processing the frame-wise speech, music, and noise confidences to identify a continuous sequence of audio events corresponding to the input audio signal, the audio events being selected from a plurality of event clusters, each of which is associated with one or more classes selected from the group consisting of a vocal class, a music class, and a noise class.
The method of claim 4, wherein the processing is configured to mitigate an appearance of false selections in the identified continuous sequence.
The method of any one of claims 3-5, wherein the performing (504) includes a moving-average-convergence-and-divergence processing of the frame-wise speech, music, and noise confidences.
The method of any one of claims 1-6, wherein the estimating (506) comprises updating the fixed-length portion using a first-in-first-out buffer configured to receive the frequency-bin data.
The method of any one of claims 1-6, wherein the estimating (506) comprises truncating the sorted frequency-bin data to a size determined with an adaptive percentile estimator (220) based on the content-aware segmentation information.
The method of claim 8, wherein the adaptive percentile estimator is configured to select different respective percentiles for a speech event, a music event, and a noise event, the determined size being based on a selected one of the different respective percentiles.
The method of any one of claims 1-6, wherein the estimating (506) comprises updating the noise floor levels using an update controller configured to start and stop the updating based on audio event data included in the content-aware segmentation information.
The method of any one of claims 1-10, wherein the applying (508) comprises calculating bin-wise signal-to-noise-ratio values based on amplitudes of the frequency bins and the estimated noise floor levels.
The method of claim 11, wherein the applying (508) further comprises computing suppression gain values based on the calculated bin-wise signal-to-noise-ratio values and further based on the estimated noise floor levels.
The method of claim 12, wherein, for each of the frequency bins, the computing comprises:

computing a respective first suppression gain value based on a respective one of the bin-wise signal-to-noise-ratio values;

computing a respective second suppression gain value based on the respective first suppression gain value and further based on a respective one of the estimated noise floor levels; and

computing a product of the respective first suppression gain value and the respective second suppression gain value.
The method of any one of claims 1-13, wherein the applying (508) comprises applying different respective aggressiveness settings of the noise suppression to different types of audio content based on the content-aware segmentation information.
The method of any one of claims 1-14,

wherein the estimating (506) is performed using the frequency bins having a first selected frequency resolution; and

wherein the applying (508) is performed using the frequency bins having a second selected frequency resolution that is different from the first selected frequency resolution.
The method of claim 15, wherein the second selected frequency resolution is higher than the first selected frequency resolution.
The method of claim 16, wherein the estimating (506) further comprises converting the estimated noise floor levels from the first selected frequency resolution to the second selected frequency resolution.
The method of any one of claims 1-17, further comprising:

converting (502) the input audio signal to the frequency-bin data using a Fourier transform; and

applying (510) an inverse Fourier transform to the frequency bins after having applied the noise suppression thereto to generate the output audio signal.
A non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising the method of any one of claims 1-18.
A noise management apparatus (100) , comprising:

at least one processor (601) ; and

at least one memory (621) including program code (622) ,

wherein the at least one memory and the program code are configured to, with the at least one processor, cause the apparatus at least to:

perform (504) audio event segmentation on an input audio signal (102) to generate content-aware segmentation information;

estimate (506) noise floor levels associated with the input audio signal based on the content-aware segmentation information and further based on sorting frequency-bin data corresponding to a fixed-length portion of the input audio signal; and

apply (508) noise suppression to the input audio signal to generate an output audio signal (122) , the noise suppression being performed in frequency bins having a selected frequency resolution and being based at least in part on the content-aware segmentation information and the estimated noise floor levels.