US7562013B2 - Method for recovering target speech based on amplitude distributions of separated signals - Google Patents
Method for recovering target speech based on amplitude distributions of separated signals Download PDFInfo
- Publication number
- US7562013B2 US7562013B2 US10/572,427 US57242704A US7562013B2 US 7562013 B2 US7562013 B2 US 7562013B2 US 57242704 A US57242704 A US 57242704A US 7562013 B2 US7562013 B2 US 7562013B2
- Authority
- US
- United States
- Prior art keywords
- spectra
- spectrum
- target speech
- split
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 238000009826 distribution Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000001228 spectrum Methods 0.000 claims abstract description 185
- 238000000926 separation method Methods 0.000 claims abstract description 22
- 238000012880 independent component analysis Methods 0.000 claims abstract description 18
- 230000005540 biological transmission Effects 0.000 claims abstract description 9
- 238000012937 correction Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 14
- 238000004422 calculation algorithm Methods 0.000 description 11
- 238000012546 transfer Methods 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 239000011435 rock Substances 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 101100117236 Drosophila melanogaster speck gene Proteins 0.000 description 1
- 241001562081 Ikeda Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000011179 visual inspection Methods 0.000 description 1
- 230000002087 whitening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Definitions
- the present invention relates to a method for recovering target speech by extracting estimated spectra of the target speech, while resolving permutation ambiguity based on shapes of amplitude distributions of split spectra that are obtained by use of the Independent Component Analysis (ICA).
- ICA Independent Component Analysis
- the frequency-domain ICA has an advantage of providing good convergence as compared to the time -domain ICA.
- problems associated with the ICA-specific scaling or permutation ambiguity exist at each frequency bin of the separated signals, and all these problems need to be resolved in the frequency domain.
- Examples addressing the above issues include a method wherein the scaling problems are resolved by use of split spectra and the permutation problems are resolved by analyzing the envelop curve of a split spectrum series at each frequency. This is referred to as the envelop method.
- the envelop method See, for example, “ An Approach to Blind Source Separation based on Temporal Structure of Speech Signals ” by N. Murata, S, Ikeda, and A. Ziehe, Neurocomputing, USA, Elsevier, October 2001, Vol. 41, No. 1-4, pp. 1-24.
- the envelope method is often ineffective depending on sound collection conditions. Also, the correspondence between the separated signals and the sound sources (speech and a noise) is ambiguous in this method; therefore, it is difficult to identify which one of the resultant split spectra after permutation correction corresponds to the target speech or to the noise. For this reason, specific judgment criteria need to be defined in order to extract the estimated spectra for the target speech as well as for the noise from the split spectra.
- the objective of the present invention is to provide a method for recovering target speech based on shapes of amplitude distributions of split spectra obtained by use of blind signal separation, wherein the target speech is recovered by extracting estimated spectra of the target speech while resolving permutation ambiguity of the split spectra obtained through the ICA.
- blind signal separation means a technology for separating and recovering a target sound signal from mixed sound signals emitted from a plurality of sound sources.
- a method for recovering target speech based on shapes of amplitude distributions of split spectra obtained by use of blind signal separation comprises: a first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals of the target speech and the noise at a first microphone and at a second microphone, the microphones being provided at separate locations; a second step of performing the Fourier transform of the mixed signals from a time domain to a frequency domain, decomposing the mixed signals into two separated signals U 1 and U 2 by use of the Independent Component Analysis, and, based on transmission path characteristics of four different paths from the two sound sources to the first and second microphones, generating from the separated signal U 1 a pair of split spectra v 11 and v 12 , which were received at the first and second microphones respectively, and from the separated signal U 2 another pair of split spectra v 21 and v 22 , which were received at the first and second microphones respectively; and a third step of extracting estimated spectra Z
- the target speech emitted from one sound source and the noise emitted from another sound source are received at the first and second microphones provided at separate locations. At each microphone, a mixed signal of the target speech and the noise is formed.
- a statistical method such as the ICA, may be employed in order to decompose the mixed signals into two independent components, one of which corresponds to the target speech and the other corresponds to the noise.
- the mixed signals include convoluted sounds due to reflection and reverberation. Therefore, the Fourier transform of the mixed signals from the time domain to the frequency domain is performed so as to treat them just like in the case of instant mixing, and the frequency-domain ICA is employed to obtain the separated signals U 1 and U 2 corresponding to the target speech and the noise respectively.
- an amplitude distribution of a spectrum refers to an amplitude distribution of a spectrum series at each frequency.
- the spectra v 11 and v 12 correspond to one sound source
- the spectra v 21 and v 22 correspond to the other sound source. Therefore, by first obtaining the amplitude distributions for v 11 and v 22 (or for v 12 and v 21 ) and then by examining the shape of the amplitude distribution of each of the two spectra, it is possible to assign the one which has an amplitude distribution close to the super Guassian to the estimated spectrum Z* corresponding to the target speech, and assign the other with a relatively low kurtosis and a narrow base to the estimated spectrum Z corresponding to the noise. Thereafter, the recovered spectrum group of the target speech can be generated from all the extracted estimated spectra Z*, and the target speech can be recovered by performing the inverse transform of the estimated spectra Z* back to the time domain.
- the shape of the amplitude distribution of each of the split spectra v 11 , v 12 , v 21 , and v 22 is evaluated by means of entropy E of the amplitude distribution.
- the amplitude distribution is related to a probability density function which shows the frequency of occurrence of a main amplitude value; thus, the shape of the amplitude distribution may be considered to represent uncertainty of the amplitude value.
- entropy E may be employed. The entropy E is smaller when the amplitude distribution is close to the super Gaussian than when the amplitude distribution has a relatively low kurtosis and a narrow base. Therefore, the entropy for speech is small, and the entropy for a noise is large.
- a kurtosis may be employed for a quantitative evaluation of the shape of the amplitude distribution. However, it is not preferable because its results are not robust in the presence of outliers.
- a kurtosis is expressed with up to the fourth order moment.
- entropy is expressed as the weighted summation of all of the moments (0 th , 1 st , 2 nd , 3 rd . . . ) by the Taylor expansion. Therefore, entropy is a statistical measure that contains a kurtosis as its part.
- the entropy E is obtained by using the amplitude distribution of the real part of each of the split spectra v 11 , v 12 , v 21 , and v 22 . Since the amplitude distributions of the real part and the imaginary part of each of the split spectra v 11 , v 12 , v 21 , and v 22 have the similar shape, the entropy E may be obtained by use of either one. It is preferable that the real part is used because the real part represents actual signal intensities of the speech or the noise in the split spectra.
- the entropy is obtained by using the variable waveform of the absolute value of each of the split spectra v 11 , v 12 , v 21 , and v 22 .
- the variable waveform of the absolute value is used, the variable range is limited to positive values with 0 inclusive, thereby greatly reducing the calculation load for obtaining the entropy.
- the entropy E for the spectrum v 11 denoted as E 11
- the entropy E for the spectrum v 22 denoted as E 22
- the criteria are given as:
- the estimated spectra Z* and Z corresponding to the target speech and the noise are determined respectively. Therefore, it is possible to recover the target speech by extracting the estimated spectra of the target speech, while resolving permutation ambiguity without effects arising from transmission paths or sound collection conditions.
- input operations by means of speech recognition in a noisy environment such as voice commands or input for OA, for storage management in logistics, and for operating car navigation systems, may be able to replace the conventional input operations by use of fingers, touch censors or keyboards.
- FIG. 1 is a block diagram showing a target speech recovering apparatus employing the method for recovering target speech based on shapes of amplitude distributions of split spectra obtained by use of blind signal separation according to one embodiment of the present invention.
- FIG. 2 is an explanatory view showing a signal flow in which a recovered spectrum is generated from the target speech and the noise per the method in FIG. 1 .
- FIG. 3(A) is a graph showing the real part of a split spectrum series corresponding to the target speech
- FIG. 3(B) is a graph showing the real part of a split spectrum series corresponding to the noise
- FIG. 3(C) is a graph showing the amplitude distribution of the real part of the split spectrum series corresponding to the target speech
- FIG. 3(D) is a graph showing the amplitude distribution of the real part of the split spectrum series corresponding to the noise.
- a target speech recovering apparatus 10 which employs a method for recovering target speech based on shapes of amplitude distributions of split spectra obtained through blind signal separation according to one embodiment of the present invention, comprises two sound sources 11 and 12 (one of which is a target speech source and the other is a noise source, although they are not identified), a first microphone 13 and a second microphone 14 , which are provided at separate locations for receiving mixed signals transmitted from the two sound sources, a first amplifier 15 and a second amplifier 16 for amplifying the mixed signals received at the microphones 13 and 14 respectively, a recovering apparatus body 17 for separating the target speech and the noise from the mixed signals entered through the amplifiers 15 and 16 and outputting recovered signals of the target speech and the noise, a recovered signal amplifier 18 for amplifying the recovered signals outputted from the recovering apparatus body 17 , and a loudspeaker 19 for outputting the amplified recovered signals.
- These elements are described in detail below.
- first and second microphones 13 and 14 microphones with a frequency range wide enough to receive signals over the audible range (10-20000 Hz) may be used.
- amplifiers 15 and 16 amplifiers with frequency band characteristics that allow non-distorted amplification of audible signals may be used.
- the recovering apparatus body 17 comprises A/D converters 20 and 21 for digitizing the mixed signals entered through the amplifiers 15 and 16 , respectively.
- the recovering apparatus body 17 further comprises a split spectra generating apparatus 22 , equipped with a signal separating arithmetic circuit and a spectrum splitting arithmetic circuit.
- the signal separating arithmetic circuit performs the Fourier transform of the digitized mixed signals from the time domain to the frequency domain, and decomposes the mixed signals into two separated signals U 1 and U 2 by means of the Fast ICA.
- the spectrum splitting arithmetic circuit Based on transmission path characteristics of the four possible paths from the two sound sources 11 and 12 to the first and second microphones 13 and 14 , the spectrum splitting arithmetic circuit generates from the separated signal U 1 one pair of split spectra v 11 and v 12 which were received at the first microphone 13 and the second microphone 14 respectively, and generates from the separated signal U 2 another pair of split spectra v 21 and v 22 which were received at the first microphone 13 and the second microphone 14 respectively.
- the recovering apparatus body 17 further comprises: a recovered spectra extracting circuit 23 for extracting estimated spectra Z* corresponding to the target speech and estimated spectra Z corresponding to the noise to generate and output a recovered spectrum group of the target speech from the estimated spectra Z*, wherein the split spectra v 11 , v 12 , v 21 , and v 22 generated by the split spectra generating apparatus 22 are analyzed by applying criteria based on the shape of the amplitude distribution of each of v 11 , v 12 , v 21 , and v 22 which depend on the transmission path characteristics of the four different paths from the two sound sources 11 and 12 to the first and second microphones 13 and 14 ; and a recovered signal generating circuit 24 for performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain to generate the recovered signal.
- a recovered spectra extracting circuit 23 for extracting estimated spectra Z* corresponding to the target speech and estimated spectra Z corresponding to the noise to generate and output a
- the split spectra generating apparatus 22 equipped with the signal separating arithmetic circuit and the spectrum splitting arithmetic circuit, the recovered spectra extracting circuit 23 , and the recovered signal generating circuit 24 may be structured by loading programs for executing each circuit's functions on, for example, a personal computer. Also, it is possible to load the programs on a plurality of microcomputers and form a circuit for collective operation of these microcomputers.
- the entire recovering apparatus body 17 may be structured by incorporating the A/D converters 20 and 21 into the personal computer.
- an amplifier that allows analog conversion and non-distorted amplification of audible signals may be used.
- a loudspeaker that allows non-distorted output of audible signals may be used for the loudspeaker 19 .
- the method for recovering target speech based on the shape of the amplitude distribution of each of the split spectra obtained through blind signal separation comprises: the first step of receiving a signal s 1 (t) from the sound source 11 and a signal s 2 (t) from the sound source 12 at the first and second microphones 13 and 14 and forming mixed signals x 1 (t) and x 2 (t) at the first microphone 13 and at the second microphone 14 respectively; the second step of performing the Fourier transform of the mixed signals x 1 (t) and x 2 (t) from the time domain to the frequency domain, decomposing the mixed signals into two separated signals U 1 and U 2 by means of the Independent Component Analysis, and, based on the transmission path characteristics of the four possible paths from the sound sources 11 and 12 to the first and second microphones 13 and 14 , generating from the separated signal U 1 one pair of split spectra v 11 and v 12 , which were received at the first microphone 13 and the second microphone 14 respectively, and from
- the signal s 1 (t) from the sound source 11 and the signal s 2 (t) from the sound source 12 are assumed to be statistically independent of each other.
- Equation (1) when the signals from the sound sources 11 and 12 are convoluted, it is difficult to separate the signals s 1 (t) and s 2 (t) from the mixed signals x 1( t) and x 2 (t) in the time domain Therefore, the mixed signals x 1 (t) and x 2 (t) are divided into short time intervals (frames) and are transformed from the time domain to the frequency domain for each frame as in Equation (2):
- M is the number of sampling in a frame
- w(t) is a window function
- ⁇ is a frame interval
- K is the number of frames.
- the time interval can be about several 10 msec. In this way, it is also possible to treat the spectra as a group of spectrum series by laying out the components at each frequency in the order of frames.
- mixed signal spectra x( ⁇ ,k) and corresponding spectra of the signals s 1 (t) and s 2 (t) are related to each other in the frequency domain as in Equation (3):
- x ( ⁇ , k ) G ( ⁇ ) s ( ⁇ , k ) (3)
- s( ⁇ ,k) is the discrete Fourier transform of a windowed s(t)
- G( ⁇ ) is a complex number matrix that is the discrete Fourier transform of G(t).
- H( ⁇ ) is defined later in Equation (10)
- Q( ⁇ ) is a whitening matrix
- P is a matrix representing permutation with only one element in each row and each column being 1 and all the other elements being 0
- two nodes where the separated signal spectra U 1 ( ⁇ ,k) and U 2 ( ⁇ ,k) are outputted are referred to as 1 and 2.
- g 11 ( ⁇ ) is a transfer function from the sound source 11 to the first microphone 13
- g 21 ( ⁇ ) is a transfer function from the sound source 11 to the second microphone 14
- g 12 ( ⁇ ) is a transfer function from the sound source 12 to the first microphone 13
- g 22 ( ⁇ ) is a transfer function from the sound source 12 to the second microphone 14 .
- Each of the four spectra v 11 ( ⁇ ,k), v 12 ( ⁇ ,k), v 21 ( ⁇ ,k) and v 22 ( ⁇ ,k) shown in FIG. 2 is determined uniquely with an exclusive combination of one sound source and one transmission path in spite, of permutation. Amplitude ambiguity remains in the separated signal spectra U n ( ⁇ ,k) as in Equations (13) and (16), but not in the split spectra as shown in Equations (14), (15), (17) and (18).
- FIGS. 3(A) and 3(B) show the real part of a split spectrum series corresponding to speech and the real part of a split spectrum series corresponding to a noise, respectively.
- FIGS. 3(C) and 3(D) show the shape of the amplitude distribution of the real part of the split spectrum series corresponding to the speech shown in FIG.
- an amplitude distribution of a spectrum refers to an amplitude distribution of a spectrum series over k at each ⁇ .
- Equation (19) The shape of the amplitude distribution of each of v 11 and v 22 may be evaluated by using the entropy E, which is defined in Equation (19) as follows:
- l n indicates the n-th interval when the amplitude distribution range is divided into N equal intervals for the real part of v 11 and v 22
- q ij ( ⁇ , l n ) is the frequency of occurrence within the n-th interval.
- E 11 is the entropy for v 11
- E 22 is the entropy for v 22 .
- ⁇ E negative, it is judged that permutation is not occurring; thus, v 11 is assigned to the estimated spectrum Z* corresponding to the target speech, and v 22 is assigned to the estimated spectrum Z corresponding to the noise.
- v 21 is assigned to the estimated spectrum Z* corresponding to the target speech
- v 12 is assigned to the estimated speck Z corresponding to the noise.
- the recovered signal of the target speech y(t) is thus obtained by performing the inverse Fourier transform of the recovered spectrum group ⁇ y ( ⁇ , k)
- k 0, 1, . . . , K ⁇ 1 ⁇ for each frame back to the time domain, and then taking the summation over all the frames as in Equation (21):
- Experiments for recovering target speech were conducted in an office with 747 cm length, 628 cm width, 269 cm height, and about 400 msec reverberation time as well as in a conference room with the same volume and a different reverberation time of about 800 msec.
- Two microphones were placed 10 cm apart.
- a noise source was placed at a location 150 cm away from one microphone in a direction 10° outward with respect to a line originating from the microphone and normal to a line connecting the two microphones.
- a speaker was placed at a location 30 cm away from the other microphone in a direction 10° outward with respect to a line originating from the other microphone and normal to a line connecting the two microphones.
- the collected data were discretized with 8000 Hz sampling frequency and 16 Bit resolution.
- the Fourier transform was performed with 32 msec frame length and 8 msec frame interval by use of the Hamming window for the window function.
- the FastICA algorithm was employed for the frequency range of 200-3500 Hz.
- the initial weights were estimated by using random numbers in the range of ( ⁇ 1,1), iteration up to 1000 times, and a convergence condition CC>0.999999.
- the noise source was a loudspeaker emitting the noise from a road during high speed vehicle driving and two types of a non-stationary noise (“classical” and “station”) selected from NTT Noise Database ( Ambient Noise Database for Telephonometry , NTT Advanced Technology Inc., Sep. 1, 1996). Noise levels of 70 dB and 80 dB at the center of the microphone were selected. At the target speech source, each of two speakers (one male and one female) spoke three different words, each word lasting about 3 seconds.
- NTT Noise Database Ambient Noise Database for Telephonometry , NTT Advanced Technology Inc., Sep. 1, 1996.
- the spectra v 11 and v 22 obtained from the separated signal spectra U 1 and U 2 which had been obtained through the FastICA algorithm were visually inspected to see if they were separated well enough to enable us to judge if permutation occurred at each frequency. The judgment could not be made due to unsatisfactory separation at some low frequencies.
- the noise level was 70 dB
- the unsatisfactory separation rate was 0.9% in a non-reverberation room, 1.89% in the office, and 3.38% in the conference room.
- the noise level was 80 dB, it was 2.3% in the non-reverberation room, 9.5% in the office, and 12.3% in the conference room.
- the present method shows the permutation correction rates of greater than 99% in all the situations, thereby demonstrating robustness against the noise and reverberation effects. Better waveforms and sounds were obtained by use of the present method than the envelop method.
- Experiments for recovering target speech were conducted in a vehicle running at high speed (90-100 km/h) with the windows closed, the air conditioner (AC) on, and a rock music being emitted from the two front loudspeakers and two side loudspeakers.
- a microphone for receiving the target speech was placed in front of and 35 cm away from a speaker who was sitting at the passenger seat
- a microphone for receiving the noise was placed 15 cm away from the microphone for receiving the target speech in a direction toward the window or toward the center.
- the noise level was 73 dB.
- the experimental conditions such as speakers, words, microphones, a separation algorithm, and a sampling frequency were the same as those in Example 1.
- the spectra v 11 and v 22 obtained from the separated signal spectra U 1 and U 2 which had been obtained through the FastICA algorithm were visually inspected to see if they were separated well enough to enable us to judge if permutation occurred at each frequency.
- Example 2 Thereafter, as in Example 1, the frequencies at which unsatisfactory separation had occurred were removed, and the permutation correction capability was evaluated for each of the three methods: the method according to the present invention, the envelope method, and the locational information method. The results are shown in Table 2.
- the permutation correction rates are slightly less than 90%, and are different by a few percent depending on the location of the microphone for receiving the noise.
- the permutation correction rates are greater than 99% regardless of the location of the microphone for receiving the noise.
- the permutation correction rates are about 80%, which are lower than the results obtained by use of the present method or the envelope method. The present method is capable of correcting permutation problems without relying on the information on the sound sources' locations, thereby implying a wider application range.
- the entropy E 12 may be used instead of E 11
- the entropy E 21 may be used instead of E 22 .
- the entropy E is obtained based on the real part of the amplitude distribution of each of the spectra v 11 , v 12 , v 21 , and v 22 , it is possible to obtain the entropy E based on the imaginary part of the amplitude distribution.
- the entropy E may be obtained based on the variable waveform of the absolute value of each of the spectra v 11 , v 12 , v 21 , and v 22 .
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
-
- (1) if the difference ΔE is negative, the split spectrum v11 is extracted as the estimated spectrum Z*; and
- (2) if the difference ΔE is positive, the split spectrum v21 is extracted as the estimated spectrum Z*.
Among the entropies obtained for the split spectra v11, v21, v21, and v22, the entropies E11 and E12 correspond to one sound source, and the entropies E21 and E22 correspond to the other sound source. Therefore, the entropies E11 and E12 are considered to be essentially equivalent, and the entropies E21 and E22 are considered to be essentially equivalent. Therefore, the entropy E11 may be used as the entropy corresponding to the one sound source, and the entropy E22 may be used as the entropy corresponding to the other sound source. After obtaining the entropies E11 and E22 for v11 and v22 respectively, it is possible to assign the small one to the target speech and the large one to the noise. As a result, v11 can be assigned to the estimated spectrum Z* if the difference ΔE is negative, i.e. E11<E22, and v21 is assigned to the estimated spectrum Z* if the difference ΔE is positive, i.e. E11>E22.
x(t)=G(t)*s(t) (1)
where s(t)=[s1(t), s2(t)]T, x(t)=[x1(t), x2(t)]T, * is a convolution operator, and G(t) represents temper functions from the
where ω (=0, 2π/M, . . . , 2π(M−1)/M) is a normalized frequency, M is the number of sampling in a frame, w(t) is a window function, τ is a frame interval, and K is the number of frames. For example, the time interval can be about several 10 msec. In this way, it is also possible to treat the spectra as a group of spectrum series by laying out the components at each frequency in the order of frames.
x(ω, k)=G(ω)s(ω, k) (3)
where s(ω,k) is the discrete Fourier transform of a windowed s(t), and G(ω) is a complex number matrix that is the discrete Fourier transform of G(t).
U(ω, k)=H(ω)Q(ω)×(ω) (4)
where u(ω,k)=[U1(ω,k),U2(ω,k)]T.
H(ω)Q(ω)G(ω)=PD(ω) (5)
where H(ω) is defined later in Equation (10), Q(ω) is a whitening matrix, P is a matrix representing permutation with only one element in each row and each column being 1 and all the other elements being 0, and D(ω)=diag[d1(ω),d2(ω)] is a diagonal matrix representing the amplitude ambiguity. Therefore, these problems need to be addressed in order to obtain meaningful separated signals for recovering.
where f(|un(ω,k)|2) is a nonlinear function, and f(|un(ω,k)|2) is the derivative of f(|un(ω,k)|2), is a conjugate sign, and K is the number of frames.
CC=h n −T(ω)h n +(ω)≅1 (8)
is satisfied (for example, CC becomes greater than or equal to 0.9999). Further, h2(ω) is orthogonalized with h1(ω) as in Equation (9):
h 2(ω)=h 2(ω)−h 1(ω)h 1 −T(ω)h 2(ω) (9)
and normalized as in Equation (7) again.
which is used in Equation (4) to calculate the separated signal spectra u(ω,k)=[U1(ω,k),U2(ω,k)]T at each frequency. As shown in
Then, the split spectra for the above separated signal spectra Un(ω,k) are generated as in Equations (14) and (15):
which show that the split spectra at each node are expressed as the product of the spectrum s1(ω,k) and the transfer function, or the product of the spectrum s2(ω,k) and the transfer function. Note here that g11(ω) is a transfer function from the
and the split spectra at the
In the above, the spectrum v11(ω,k) generated at the
where pij(ω,1n) (n=1, 2, . . . , N) is a probability, which is equivalent to qij (ω, ln) (n=1, 2, . . . , N) normalized as in the following Equation (20). Here, ln indicates the n-th interval when the amplitude distribution range is divided into N equal intervals for the real part of v11 and v22, and qij (ω, ln) is the frequency of occurrence within the n-th interval.
TABLE 1 | |||||
Noise | |||||
Level | Correction Method | Office | Conference Room | ||
70 dB | Envelope Method | 93.1% | 96.0% | ||
Locational Information | 94.2% | 57.7% | |||
Method | |||||
Present Method | 99.9% | 99.9% | |||
80 dB | Envelope Method | 93.1% | 90.7% | ||
Locational Information | 88.3% | 55.0% | |||
Method | |||||
Present Method | 99.8% | 99.8% | |||
TABLE 2 | ||||
Locational | ||||
Envelope Method | Information Method | Present Method | ||
Microphone for | 86.6% | 80.4% | 99.4% |
Noise, toward | |||
Window | |||
Microphone for | 89.6% | 76.6% | 99.4% |
Noise, | |||
toward Center | |||
Claims (5)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003324733A JP4496379B2 (en) | 2003-09-17 | 2003-09-17 | Reconstruction method of target speech based on shape of amplitude frequency distribution of divided spectrum series |
JP2003-324733 | 2003-09-17 | ||
PCT/JP2004/012898 WO2005029467A1 (en) | 2003-09-17 | 2004-08-31 | A method for recovering target speech based on amplitude distributions of separated signals |
Publications (2)
Publication Number | Publication Date |
---|---|
US20070100615A1 US20070100615A1 (en) | 2007-05-03 |
US7562013B2 true US7562013B2 (en) | 2009-07-14 |
Family
ID=34372753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/572,427 Expired - Fee Related US7562013B2 (en) | 2003-09-17 | 2004-08-31 | Method for recovering target speech based on amplitude distributions of separated signals |
Country Status (3)
Country | Link |
---|---|
US (1) | US7562013B2 (en) |
JP (1) | JP4496379B2 (en) |
WO (1) | WO2005029467A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070208560A1 (en) * | 2005-03-04 | 2007-09-06 | Matsushita Electric Industrial Co., Ltd. | Block-diagonal covariance joint subspace typing and model compensation for noise robust automatic speech recognition |
US20080189103A1 (en) * | 2006-02-16 | 2008-08-07 | Nippon Telegraph And Telephone Corp. | Signal Distortion Elimination Apparatus, Method, Program, and Recording Medium Having the Program Recorded Thereon |
US20090150146A1 (en) * | 2007-12-11 | 2009-06-11 | Electronics & Telecommunications Research Institute | Microphone array based speech recognition system and target speech extracting method of the system |
US20100002899A1 (en) * | 2006-08-01 | 2010-01-07 | Yamaha Coporation | Voice conference system |
US20100092000A1 (en) * | 2008-10-10 | 2010-04-15 | Kim Kyu-Hong | Apparatus and method for noise estimation, and noise reduction apparatus employing the same |
US20100128897A1 (en) * | 2007-03-30 | 2010-05-27 | Nat. Univ. Corp. Nara Inst. Of Sci. And Tech. | Signal processing device |
US20100274554A1 (en) * | 2005-06-24 | 2010-10-28 | Monash University | Speech analysis system |
US20100296665A1 (en) * | 2009-05-19 | 2010-11-25 | Nara Institute of Science and Technology National University Corporation | Noise suppression apparatus and program |
US20120310637A1 (en) * | 2011-06-01 | 2012-12-06 | Parrot | Audio equipment including means for de-noising a speech signal by fractional delay filtering, in particular for a "hands-free" telephony system |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3827317B2 (en) * | 2004-06-03 | 2006-09-27 | 任天堂株式会社 | Command processing unit |
JP4449871B2 (en) | 2005-01-26 | 2010-04-14 | ソニー株式会社 | Audio signal separation apparatus and method |
JP4556875B2 (en) * | 2006-01-18 | 2010-10-06 | ソニー株式会社 | Audio signal separation apparatus and method |
ATE527833T1 (en) | 2006-05-04 | 2011-10-15 | Lg Electronics Inc | IMPROVE STEREO AUDIO SIGNALS WITH REMIXING |
JP2008039694A (en) * | 2006-08-09 | 2008-02-21 | Toshiba Corp | Signal count estimation system and method |
KR100891666B1 (en) | 2006-09-29 | 2009-04-02 | 엘지전자 주식회사 | Apparatus for processing audio signal and method thereof |
JP5232791B2 (en) | 2006-10-12 | 2013-07-10 | エルジー エレクトロニクス インコーポレイティド | Mix signal processing apparatus and method |
WO2008060111A1 (en) | 2006-11-15 | 2008-05-22 | Lg Electronics Inc. | A method and an apparatus for decoding an audio signal |
JP5463143B2 (en) | 2006-12-07 | 2014-04-09 | エルジー エレクトロニクス インコーポレイティド | Audio signal decoding method and apparatus |
EP2122612B1 (en) | 2006-12-07 | 2018-08-15 | LG Electronics Inc. | A method and an apparatus for processing an audio signal |
JP5642339B2 (en) * | 2008-03-11 | 2014-12-17 | トヨタ自動車株式会社 | Signal separation device and signal separation method |
WO2009151578A2 (en) * | 2008-06-09 | 2009-12-17 | The Board Of Trustees Of The University Of Illinois | Method and apparatus for blind signal recovery in noisy, reverberant environments |
US8073634B2 (en) * | 2008-09-22 | 2011-12-06 | University Of Ottawa | Method to extract target signals of a known type from raw data containing an unknown number of target signals, interference, and noise |
KR101233271B1 (en) * | 2008-12-12 | 2013-02-14 | 신호준 | Method for signal separation, communication system and voice recognition system using the method |
JP5375400B2 (en) * | 2009-07-22 | 2013-12-25 | ソニー株式会社 | Audio processing apparatus, audio processing method and program |
JP2011081293A (en) * | 2009-10-09 | 2011-04-21 | Toyota Motor Corp | Signal separation device and signal separation method |
CN102447993A (en) * | 2010-09-30 | 2012-05-09 | Nxp股份有限公司 | Sound scene manipulation |
CN102543098B (en) * | 2012-02-01 | 2013-04-10 | 大连理工大学 | Frequency domain voice blind separation method for multi-frequency-band switching call media node (CMN) nonlinear function |
US10497381B2 (en) | 2012-05-04 | 2019-12-03 | Xmos Inc. | Methods and systems for improved measurement, entity and parameter estimation, and path propagation effect measurement and mitigation in source signal separation |
WO2013166439A1 (en) * | 2012-05-04 | 2013-11-07 | Setem Technologies, Llc | Systems and methods for source signal separation |
WO2014145960A2 (en) | 2013-03-15 | 2014-09-18 | Short Kevin M | Method and system for generating advanced feature discrimination vectors for use in speech recognition |
JP6539829B1 (en) * | 2018-05-15 | 2019-07-10 | 角元 純一 | How to detect voice and non-voice level |
JP7159767B2 (en) * | 2018-10-05 | 2022-10-25 | 富士通株式会社 | Audio signal processing program, audio signal processing method, and audio signal processing device |
CN113077808B (en) * | 2021-03-22 | 2024-04-26 | 北京搜狗科技发展有限公司 | Voice processing method and device for voice processing |
CN113576527A (en) * | 2021-08-27 | 2021-11-02 | 复旦大学 | Method for judging ultrasonic input by using voice control |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002023776A (en) | 2000-07-13 | 2002-01-25 | Univ Kinki | A method for discriminating speaker speech and non-speech noise in blind separation and a method for specifying speaker speech channels |
-
2003
- 2003-09-17 JP JP2003324733A patent/JP4496379B2/en not_active Expired - Fee Related
-
2004
- 2004-08-31 US US10/572,427 patent/US7562013B2/en not_active Expired - Fee Related
- 2004-08-31 WO PCT/JP2004/012898 patent/WO2005029467A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002023776A (en) | 2000-07-13 | 2002-01-25 | Univ Kinki | A method for discriminating speaker speech and non-speech noise in blind separation and a method for specifying speaker speech channels |
Non-Patent Citations (6)
Title |
---|
A. Cichocki and S. Amari, "Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications", 1st Edition, 2002, John Wiley & Sons, Ltd, pp. 128-175. |
A. Hyvarinen and E. Oja, "Independent Component Analysis: Algorithms and Applications", Neural Networks Research Centre, Helsinki University of Technology, Pergamon Press, Jun. 2000, vol. 13, No. 4-5, pp. 1-31. |
E. Bingham and A. Hyvarinen, "A Fast Fixed-Point Algorithm for Independent Component Analysis of Complex Valued Signals", International Journal of Neural Systems, vol. 10, No. 1, Feb. 2000, World Scientific Publishing Company, pp. 1-8. |
K. Nobu et al., "Noise Cancellation Based on Split Spectra by Using Sounds Location", Journal of Robotics and Mechanatronics vol. 15, No. 1, 2003, pp. 15-23. |
N. Gotanda et al., "Permutation Correction and Speech Extraction Based on Split Spectrum Through FastICA", 4th International Symposium on Independent Component Analysis and Blind Signal Separation, Apr. 2003, Nara, Japan, pp. 379-384. |
N. Murata et al., "An Approach to Blind Source Separation Based on Temporal Structure of Speech Signals", Neurocomputing, Oct. 2001, vol. 41, Elsevier Science B.V., pp. 1-24. |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070208560A1 (en) * | 2005-03-04 | 2007-09-06 | Matsushita Electric Industrial Co., Ltd. | Block-diagonal covariance joint subspace typing and model compensation for noise robust automatic speech recognition |
US7729909B2 (en) * | 2005-03-04 | 2010-06-01 | Panasonic Corporation | Block-diagonal covariance joint subspace tying and model compensation for noise robust automatic speech recognition |
US20100274554A1 (en) * | 2005-06-24 | 2010-10-28 | Monash University | Speech analysis system |
US20080189103A1 (en) * | 2006-02-16 | 2008-08-07 | Nippon Telegraph And Telephone Corp. | Signal Distortion Elimination Apparatus, Method, Program, and Recording Medium Having the Program Recorded Thereon |
US8494845B2 (en) * | 2006-02-16 | 2013-07-23 | Nippon Telegraph And Telephone Corporation | Signal distortion elimination apparatus, method, program, and recording medium having the program recorded thereon |
US8462976B2 (en) * | 2006-08-01 | 2013-06-11 | Yamaha Corporation | Voice conference system |
US20100002899A1 (en) * | 2006-08-01 | 2010-01-07 | Yamaha Coporation | Voice conference system |
US20100128897A1 (en) * | 2007-03-30 | 2010-05-27 | Nat. Univ. Corp. Nara Inst. Of Sci. And Tech. | Signal processing device |
US8488806B2 (en) * | 2007-03-30 | 2013-07-16 | National University Corporation NARA Institute of Science and Technology | Signal processing apparatus |
US20090150146A1 (en) * | 2007-12-11 | 2009-06-11 | Electronics & Telecommunications Research Institute | Microphone array based speech recognition system and target speech extracting method of the system |
US8249867B2 (en) * | 2007-12-11 | 2012-08-21 | Electronics And Telecommunications Research Institute | Microphone array based speech recognition system and target speech extracting method of the system |
US20100092000A1 (en) * | 2008-10-10 | 2010-04-15 | Kim Kyu-Hong | Apparatus and method for noise estimation, and noise reduction apparatus employing the same |
US9159335B2 (en) | 2008-10-10 | 2015-10-13 | Samsung Electronics Co., Ltd. | Apparatus and method for noise estimation, and noise reduction apparatus employing the same |
US20100296665A1 (en) * | 2009-05-19 | 2010-11-25 | Nara Institute of Science and Technology National University Corporation | Noise suppression apparatus and program |
US20120310637A1 (en) * | 2011-06-01 | 2012-12-06 | Parrot | Audio equipment including means for de-noising a speech signal by fractional delay filtering, in particular for a "hands-free" telephony system |
US8682658B2 (en) * | 2011-06-01 | 2014-03-25 | Parrot | Audio equipment including means for de-noising a speech signal by fractional delay filtering, in particular for a “hands-free” telephony system |
Also Published As
Publication number | Publication date |
---|---|
US20070100615A1 (en) | 2007-05-03 |
JP2005091732A (en) | 2005-04-07 |
JP4496379B2 (en) | 2010-07-07 |
WO2005029467A1 (en) | 2005-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7562013B2 (en) | Method for recovering target speech based on amplitude distributions of separated signals | |
US10127922B2 (en) | Sound source identification apparatus and sound source identification method | |
US10319390B2 (en) | Method and system for multi-talker babble noise reduction | |
US7315816B2 (en) | Recovering method of target speech based on split spectra using sound sources' locational information | |
US9008329B1 (en) | Noise reduction using multi-feature cluster tracker | |
CN103811020B (en) | A kind of intelligent sound processing method | |
US9093079B2 (en) | Method and apparatus for blind signal recovery in noisy, reverberant environments | |
EP1891624B1 (en) | Multi-sensory speech enhancement using a speech-state model | |
US7533017B2 (en) | Method for recovering target speech based on speech segment detection under a stationary noise | |
US10002623B2 (en) | Speech-processing apparatus and speech-processing method | |
US10622008B2 (en) | Audio processing apparatus and audio processing method | |
US20110029309A1 (en) | Signal separating apparatus and signal separating method | |
JP4496378B2 (en) | Restoration method of target speech based on speech segment detection under stationary noise | |
JP2002023776A (en) | A method for discriminating speaker speech and non-speech noise in blind separation and a method for specifying speaker speech channels | |
Al-Ali et al. | Enhanced forensic speaker verification using multi-run ICA in the presence of environmental noise and reverberation conditions | |
Gotanda et al. | Permutation correction and speech extraction based on split spectrum through FastICA | |
KR101658001B1 (en) | Online target-speech extraction method for robust automatic speech recognition | |
Minipriya et al. | Review of ideal binary and ratio mask estimation techniques for monaural speech separation | |
JP2007047427A (en) | Audio processing device | |
Kalamani et al. | Modified least mean square adaptive filter for speech enhancement | |
Ishibashi et al. | Blind source separation for human speeches based on orthogonalization of joint distribution of observed mixture signals | |
CN118571219B (en) | Method, device, equipment and storage medium for enhancing personnel dialogue in seat cabin | |
Wang et al. | Real time hearing enhancement in crowded social environments with noise gating | |
Sang et al. | Supervised sparse coding strategy in hearing aids | |
JP6059112B2 (en) | Sound source separation device, method and program thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: KITAKYUSHU FOUNDATION FOR THE ADVANCEMENT OF INDUS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GOTANDA, HIROMU;KANEDA, KEIICHI;KOYA, TAKESHI;REEL/FRAME:017673/0704 Effective date: 20060224 Owner name: KINKI UNIVERSITY, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GOTANDA, HIROMU;KANEDA, KEIICHI;KOYA, TAKESHI;REEL/FRAME:017673/0704 Effective date: 20060224 |
|
AS | Assignment |
Owner name: KITAKYUSHU FOUNDATION FOR THE ADVANCEMENT OF INDUS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KINKI UNIVERSITY;REEL/FRAME:022780/0957 Effective date: 20090526 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20170714 |