CN106297772A

CN106297772A - Detection method is attacked in the playback of voice signal distorted characteristic based on speaker introducing

Info

Publication number: CN106297772A
Application number: CN201610716612.XA
Authority: CN
Inventors: 任延珍; 方众; 王立洁; 张月丹; 陈思仪
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2016-08-24
Filing date: 2016-08-24
Publication date: 2017-01-04
Anticipated expiration: 2036-08-24
Also published as: CN106297772B

Abstract

The invention discloses a playback attack detection method based on the distortion characteristics of a speech signal introduced by a loudspeaker. First, the speech to be detected is preprocessed, and the voiced sound frame is retained; feature extraction is performed for each voiced sound frame in the preprocessed speech signal, Obtain the eigenvector based on the linear distortion and nonlinear distortion characteristics of the speech signal; average the eigenvectors of all voiced frames to form a statistical eigenvector and obtain the feature model of the speech to be tested; then extract the eigenvector of the training speech sample to obtain the training Speech feature model, and use the trained speech feature model to train the SVM model to obtain a speech model library; finally, perform SVM pattern matching on the feature model of the speech to be tested and the trained speech model library, and output the judgment result. The invention can realize real-time and effective detection of playback voice.

Description

Playback Attack Detection Method Based on the Distortion Characteristics of Speech Signal Introduced by Loudspeaker

技术领域technical field

本发明属于数字媒体处理领域，涉及一种回放攻击检测方法，特别涉及一种判别语音是否为回放攻击的语音内容安全的方法。The invention belongs to the field of digital media processing, and relates to a playback attack detection method, in particular to a method for judging whether a voice is a voice content safe for a playback attack.

背景技术Background technique

生物特征作为生物体固有属性这一优势导致生物识别技术由之出现，说话人识别隶属于生物识别，是根据使用者的语音样本来实现身份认证。由于语音相对其他生物特征，具有拾音设备简单、随时随地可用、数据量小等优势，声纹验证技术从提出后已经经过了60多年的发展并且取得了巨大的进步，得到了广泛的应用。但是，目前针对声纹的身份认证系统却面临着各种伪装攻击，包括录音回放、语音合成、语音转换和语音模仿四种方式，其中回放攻击是指攻击者使用录音设备录制合法用户进入认证系统时的语音，然后在系统的拾音器端通过扬声器回放，达到伪装用户进入系统的目的。由于现有录音设备的低廉性和便携性，使得这一攻击操作简便，实现容易，录音回放攻击已成为最广泛威胁性最大的伪装攻击手段。现有的主流的说话人识别平台对于回放攻击的错误接受率极高，这表明录音回放攻击对于声纹认证平台的安全有极大的威胁性，由此可见如何实现录音回放攻击检测成为基于声纹的身份认证系统中急需解决的一个重要问题。The advantage of biological characteristics as an inherent attribute of living organisms has led to the emergence of biometric technology. Speaker recognition is part of biometrics, which is based on the user's voice samples to achieve identity authentication. Compared with other biological characteristics, voice has the advantages of simple voice pickup equipment, usable anytime and anywhere, and small data volume. Voiceprint verification technology has been developed for more than 60 years since it was proposed, and has made great progress, and has been widely used. However, the current identity authentication system for voiceprints is facing various masquerade attacks, including recording playback, speech synthesis, speech conversion, and voice imitation. The playback attack refers to the attacker using a recording device to record a legitimate user entering the authentication system. The real-time voice is then played back through the loudspeaker at the pickup end of the system, so as to achieve the purpose of disguising the user and entering the system. Due to the low cost and portability of existing recording equipment, this attack is easy to operate and easy to implement. Recording and playback attacks have become the most widespread and most threatening means of masquerading attacks. The existing mainstream speaker recognition platform has a very high false acceptance rate for replay attacks, which shows that recording replay attacks pose a great threat to the security of the voiceprint authentication platform. This shows how to implement recording replay attack detection based on voice It is an important problem that urgently needs to be solved in the identity authentication system of the pattern.

自录音回放攻击出现以来，国内外仅有少数研究团队对此进行了研究，其主要技术成果集中在2011年以前，近年来发展缓慢。并且现有研究成果对于语音采样频率，系统存储空间，语音采集环境等条件有严格要求和限制，也无法达到准确率高，实时性强的识别效果，所以均不能广泛适用于现有声纹识别平台。Since the recording playback attack appeared, only a few research teams at home and abroad have studied it, and their main technical achievements were concentrated before 2011, and the development has been slow in recent years. Moreover, the existing research results have strict requirements and restrictions on the voice sampling frequency, system storage space, voice collection environment and other conditions, and cannot achieve high accuracy and real-time recognition effects, so they cannot be widely applied to existing voiceprint recognition platforms. .

语音信号频谱图可以准确直观的反映出语音信号被修改前后的变化和差异，而回放攻击的过程相比于原始语音引入了麦克风采集，数字压缩和扬声器播放三个环节，每一个环节都可能会引起语音信号的改变。所以根据三个环节中语音信号频谱图的变化进行分析，提出基于语音信号频谱特性的回放攻击检测算法，可以设计实现出具有良好普适性、实时性和较高准确性的回放攻击检测算法。The spectrogram of the voice signal can accurately and intuitively reflect the changes and differences before and after the voice signal is modified. Compared with the original voice, the playback attack process introduces three links: microphone collection, digital compression, and speaker playback, and each link may be damaged. cause a change in the speech signal. Therefore, according to the analysis of the changes in the speech signal spectrum in the three links, a replay attack detection algorithm based on the speech signal spectrum characteristics can be designed and realized with good universality, real-time performance and high accuracy.

发明内容Contents of the invention

本发明针对现有声纹识别系统无法抵抗回放攻击的安全漏洞，提供了一种基于扬声器引入的语音信号失真特性的回放攻检测方法。Aiming at the safety loophole that the existing voiceprint recognition system cannot resist the replay attack, the present invention provides a replay attack detection method based on the distortion characteristics of the voice signal introduced by the loudspeaker.

本发明所采用的技术方案是：一种基于扬声器引入的语音信号失真特性的回放攻检测方法，其特征在于，包括以下步骤：The technical solution adopted in the present invention is: a playback attack detection method based on the distortion characteristics of the speech signal introduced by the loudspeaker, which is characterized in that, comprising the following steps:

步骤1：对待检测语音进行预处理，保留其中的浊音帧；Step 1: Preprocessing the speech to be detected, retaining voiced frames;

步骤2：针对预处理后语音信号中的每一个浊音帧进行特征提取，得到基于语音信号线性失真和非线性失真特性的特征向量；Step 2: perform feature extraction for each voiced sound frame in the preprocessed speech signal, and obtain a feature vector based on the linear distortion and nonlinear distortion characteristics of the speech signal;

步骤3：所有的浊音帧的特征向量求平均值，形成统计特征向量，获得待测语音的特征模型；Step 3: the feature vectors of all voiced frames are averaged to form a statistical feature vector to obtain a feature model of the speech to be tested;

步骤4：提取训练语音样本的特征向量，获得训练语音特征模型，并利用该训练语音特征模型来训练SVM模型，获得语音模型库；Step 4: extract the feature vector of the training speech sample, obtain the training speech feature model, and use the training speech feature model to train the SVM model, and obtain the speech model library;

步骤5：将待测语音的特征模型与已训练好的语音模型库进行SVM模式匹配，输出判决结果。Step 5: Perform SVM pattern matching between the feature model of the speech to be tested and the trained speech model library, and output the judgment result.

作为优选，步骤1所述对待检测语音进行预处理，是使用汉明窗对语音信号进行分帧加窗处理，帧长为70ms，保留其中的浊音帧。Preferably, the preprocessing of the speech to be detected in step 1 is to use a Hamming window to perform frame division and window processing on the speech signal, the frame length is 70 ms, and the voiced sound frames are retained.

作为优选，步骤2所述针对预处理后语音信号中的每一个浊音帧进行特征提取，是提取基于语音信号线性失真和非线性失真特性的26维特征向量。Preferably, the feature extraction for each voiced sound frame in the preprocessed speech signal in step 2 is to extract a 26-dimensional feature vector based on the linear distortion and nonlinear distortion characteristics of the speech signal.

作为优选，所述提取基于语音信号线性失真特征向量，由低频比、低频方差、低频差分方差、低频拟合和全局低频比五种特征，共计10维向量组成；Preferably, the extraction is based on the speech signal linear distortion feature vector, consisting of five features of low frequency ratio, low frequency variance, low frequency difference variance, low frequency fitting and global low frequency ratio, consisting of a total of 10 dimensional vectors;

所述低频比其中X(f)为对每一帧的快速傅里叶变换；The low frequency ratio Wherein X(f) is the fast Fourier transform of each frame;

所述低频方差其中 The low frequency variance in

所述低频差分方差其中 The low frequency difference variance in

所述低频拟合是利用6维拟合特征对于0～500Hz的FFT采样点进行拟合，拟合公式为其中x为0～500Hz的FFT采样点，a_i表示拟合的系数；The low-frequency fitting is to use the 6-dimensional fitting feature to fit the FFT sampling points of 0-500 Hz, and the fitting formula is Where x is the FFT sampling point from 0 to 500Hz, and a _i represents the fitting coefficient;

所述全局低频比 The global low frequency ratio

作为优选，所述提取基于语音信号非线性失真特征向量，包括总谐波失真、削波比和音色向量三种特征，共计16维特征向量；Preferably, the extraction is based on the nonlinear distortion feature vector of the speech signal, including three features of total harmonic distortion, clipping ratio and timbre vector, with a total of 16 dimensional feature vectors;

所述总谐波失真其中X(f)为每一帧的快速傅氏变换，f₁为基音频率，f_i为i倍基音频率；The total harmonic distortion in X(f) is the fast Fourier transform of each frame, f ₁ is the pitch frequency, and f _i is i times the pitch frequency;

所述削波比其中x为时域谱，len为时域谱长度；The clipping ratio in x is the time-domain spectrum, len is the length of the time-domain spectrum;

所述音色向量 The Timbre Vector

作为优选，步骤3所述统计特征向量，是26维统计特征向量。Preferably, the statistical feature vector in step 3 is a 26-dimensional statistical feature vector.

作为优选，步骤4所述训练语音样本，来自若干设备和若干位录制者，包括回放语音和原始语音。Preferably, the training speech samples in step 4 come from several devices and several recorders, including playback speech and original speech.

作为优选，步骤4中在提取训练语音样本特征向量以后，利用LIBSVM对训练语音样本集中的特征数据库进行二分类训练，所述特征数据库由训练语音样本特征向量组成。Preferably, in step 4, after extracting the feature vectors of the training speech samples, LIBSVM is used to perform binary classification training on the feature database in the training speech sample set, the feature database is composed of the feature vectors of the training speech samples.

本发明的有益效果是：本发明可以集成于现有的声纹识别平台，实现对回放语音实时有效的检测，为当前信息时代的司法取证、电子商务、金融系统等领域提供安全有效的身份认证技术支持。The beneficial effects of the present invention are: the present invention can be integrated into the existing voiceprint recognition platform to realize real-time and effective detection of the playback voice, and provide safe and effective identity authentication for judicial evidence collection, e-commerce, financial systems and other fields in the current information age Technical Support.

附图说明Description of drawings

图1是本发明实施例的算法总体流程图；Fig. 1 is the overall flowchart of the algorithm of the embodiment of the present invention;

图2是本发明实施例的特征提取流程图；Fig. 2 is a feature extraction flowchart of an embodiment of the present invention;

图3是本发明实施例的回放攻击引入的差异对比图；Fig. 3 is a comparison diagram of differences introduced by replay attacks according to an embodiment of the present invention;

图4是本发明实施例的加速度频率响应曲线图；Fig. 4 is the acceleration frequency response graph of the embodiment of the present invention;

图5是本发明实施例的描述低频衰减失真的频谱图；FIG. 5 is a spectrum diagram describing low-frequency attenuation distortion according to an embodiment of the present invention;

图6是本发明实施例的描述高频谐波失真的频谱图。FIG. 6 is a spectrum diagram describing high-frequency harmonic distortion according to an embodiment of the present invention.

具体实施方式detailed description

为了便于本领域普通技术人员理解和实施本发明，下面结合附图及实施例对本发明作进一步的详细描述，应当理解，此处所描述的实施示例仅用于说明和解释本发明，并不用于限定本发明。In order to facilitate those of ordinary skill in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.

本发明实施例中涉及的相关术语解释如下：Related terms involved in the embodiments of the present invention are explained as follows:

1)回放攻击：利用录音设备录制说话人的声音,然后对说话人识别系统播放这段录音,从而使得说话人识别系统判断其为说话人。1) Playback attack: Use recording equipment to record the speaker's voice, and then play this recording to the speaker recognition system, so that the speaker recognition system can judge that it is the speaker.

2)信号频谱：信号各分量的幅度或相位关于频率的函数。2) Signal spectrum: The amplitude or phase of each component of the signal is a function of frequency.

3)线性失真：由电路的线性电抗组件对不同频率的响应不同而引起的幅度或者相位的失真，输出信号中不会有新的频率分量3) Linear distortion: the amplitude or phase distortion caused by the linear reactance components of the circuit responding differently to different frequencies, and there will be no new frequency components in the output signal

4)非线性失真：输出信号中产生新的谐波成分，表现为输出信号与输入信号不成线性关系。4) Non-linear distortion: New harmonic components are generated in the output signal, which shows that the output signal is not linearly related to the input signal.

5)基音：在复音中,频率最低的声音叫做基音，乐音的音调是由基音的频率决定的。5) Pitch: In polyphony, the sound with the lowest frequency is called the pitch, and the pitch of the musical tone is determined by the frequency of the pitch.

本发明是基于语音信号频谱特性的回放攻击检测算法，以扬声器对语音信号造成的线性失真和非线性失真特性为技术原理，提取相应特征向量，并采用SVM进行分类判决，可实现对回放语音实时有效的检测。The present invention is a replay attack detection algorithm based on the spectral characteristics of voice signals. It uses the linear distortion and nonlinear distortion characteristics caused by the loudspeaker to the voice signal as the technical principle, extracts the corresponding feature vectors, and uses SVM for classification and judgment, which can realize real-time playback of voice effective detection.

请见图1，是本发明的算法流程图，参照该图所示，对一段语音的回放攻击检测过程有如下步骤：See also Fig. 1, is the algorithm flowchart of the present invention, with reference to shown in this figure, the playback attack detection process to a section of voice has the following steps:

步骤1：针对待检测语音，首先使用汉明窗对信号进行分帧加窗处理，帧长为70ms，保留其中的浊音帧。Step 1: For the speech to be detected, first use the Hamming window to frame and window the signal, the frame length is 70ms, and retain the voiced sound frame.

步骤2：针对预处理后语音信号中的每一个浊音帧进行特征提取，得到基于语音信号线性失真和非线性失真特性的26维特征向量。Step 2: Perform feature extraction for each voiced sound frame in the preprocessed speech signal to obtain a 26-dimensional feature vector based on the linear distortion and nonlinear distortion characteristics of the speech signal.

如图3所示，回放攻击的过程相比于原始语音引入了麦克风采集，数字压缩和扬声器播放三个环节，其中扬声器对语音信号的影响最为显着，且具有多个性能评价指标，扬声器放音的过程对于语音信号的影响可以分为线性失真和非线性失真两种。As shown in Figure 3, compared with the original voice, the playback attack process introduces three links: microphone acquisition, digital compression, and speaker playback. Among them, the impact of the speaker on the voice signal is the most significant, and it has multiple performance evaluation indicators. The influence of the sound process on the speech signal can be divided into two types: linear distortion and nonlinear distortion.

线性失真是由于电路中存在线性组件，其阻抗随频率的不同而不同，从而导致系统对不同频率的信号分量的放大倍数与延迟时间不同。线性失真会使不同频率信号分量的大小及相对时间关系发生变化，但不会产生输入信号所没有的新的频率成分。Linear distortion is due to the existence of linear components in the circuit, and its impedance varies with frequency, which causes the system to have different amplification factors and delay times for signal components of different frequencies. Linear distortion will change the magnitude and relative time relationship of different frequency signal components, but will not produce new frequency components that the input signal does not have.

如图5所示，上面为原始语音，下面为回放语音，线性失真在扬声器上主要体现为低频部分的衰减现象。如图4所示，由于声辐射和加速度成正比，因此把扬声器纸盆的固有频率设计得低于工作频率，扬声器工作在质量工作区,当Qm＝1时频率响应较为平坦。在这种工作状态下，扬声器会出现明显低频衰减。As shown in Figure 5, the top is the original voice, and the bottom is the playback voice. The linear distortion is mainly reflected in the attenuation of the low frequency part on the speaker. As shown in Figure 4, since the sound radiation is directly proportional to the acceleration, the natural frequency of the speaker cone is designed to be lower than the operating frequency, and the speaker works in the mass working area. When Qm=1, the frequency response is relatively flat. In this working state, the loudspeaker will experience obvious low-frequency attenuation.

非线性失真是由于电路中的非线性组件或进入非线性区域而引起的。非线性失真的主要特征是产生了输入信号所没有的新的频率的成分。可以分为谐波失真和瞬态互调失真。Non-linear distortion is caused by non-linear components in the circuit or entering non-linear regions. The main characteristic of nonlinear distortion is the generation of new frequency components that were not present in the input signal. Can be divided into harmonic distortion and transient intermodulation distortion.

谐波失真指原有频率的各种倍频的有害干扰。如图6所示为一段原始语音信号和相应的回放语音信号，由于放大器不够理想，输出的信号除了包含放大的输入成分之外，还新添了一些原信号的整数倍的频率成分(谐波)，致使输出波形走样。Harmonic distortion refers to the harmful interference of various multiples of the original frequency. As shown in Figure 6, it is a section of original voice signal and corresponding playback voice signal. Since the amplifier is not ideal enough, the output signal not only contains the amplified input component, but also adds some frequency components (harmonics) that are integer multiples of the original signal. ), causing the output waveform to be out of shape.

由于晶体管工作特性不稳定，易受温度等因素影响而产生失真，因此会采用大深度的负反馈。为了减小由深度负反馈所引起的高频振荡，晶体管放大器一般要在前置推动级晶体管的基极和集电极之间加入一个小电容，使高频段的相位稍为滞后，称为滞后价或称分补价。当输入信号含有速度很高的瞬态脉冲时，电容来不及充电，线路是处于没有负反馈状态。由于输入讯号没有和负回输讯号相减，造成讯号过强，这些过强讯号会使放大线路瞬时过载，结果使输出讯号出现削波现象。Due to the unstable working characteristics of the transistor and the distortion caused by factors such as temperature, a deep negative feedback is adopted. In order to reduce the high-frequency oscillation caused by deep negative feedback, the transistor amplifier generally needs to add a small capacitor between the base and collector of the pre-driver transistor, so that the phase of the high-frequency band is slightly delayed, which is called lagging price or Weigh sub-price. When the input signal contains a high-speed transient pulse, the capacitor has no time to charge, and the circuit is in a state of no negative feedback. Because the input signal is not subtracted from the negative feedback input signal, the signal is too strong. These too strong signals will momentarily overload the amplification circuit, resulting in clipping of the output signal.

请见图2，本实施例基于线性失真原理和非线性失真原理特征提取过程如下：Please see Figure 2, the feature extraction process of this embodiment based on the principle of linear distortion and the principle of nonlinear distortion is as follows:

基于线性失真现象提出的特征均是在500Hz范围下进行处理，从而达到更好的区分效果。这里我们提出了低频比、低频方差，低频差分方差，低频拟合和全局低频比五种特征，共计10维向量来描述线性失真中的低频衰减特性。The features proposed based on the linear distortion phenomenon are all processed in the range of 500Hz, so as to achieve better discrimination effect. Here we propose five features: low frequency ratio, low frequency variance, low frequency difference variance, low frequency fitting and global low frequency ratio, with a total of 10 dimensional vectors to describe the low frequency attenuation characteristics in linear distortion.

①低频比(Low Spectral Ratio)①Low Spectral Ratio

回放语音信号在250～350Hz的范围内谱峰分布低于原始语音,而在接近500Hz的范围时又高于原始语音,所以用250～350Hz的特征参数比上400～500Hz特征参数可以最明显的区分两者。The spectral peak distribution of the playback speech signal is lower than the original speech in the range of 250-350Hz, and higher than the original speech in the range of 500Hz, so the characteristic parameters of 250-350Hz can be compared with the characteristic parameters of 400-500Hz. distinguish between the two.

公式1所示,其中X(f)为对每一帧的快速傅里叶变换。Equation 1, where X(f) is the fast Fourier transform for each frame.

②低频方差(Low Spectral Variance)②Low Spectral Variance

低频方差用于描述信号在低频区域的波动情况。首先对500Hz以内的FFT采样点进行统计,在帧长为70ms的情况下,16kHz的采样点总共有1120个,在0～500Hz以内的采样点共计35个；The low-frequency variance is used to describe the fluctuation of the signal in the low-frequency region. First, count the FFT sampling points within 500Hz. In the case of a frame length of 70ms, there are a total of 1120 sampling points at 16kHz, and a total of 35 sampling points within 0-500Hz;

③低频差分方差(Low Spectral Difference Variance)③Low Spectral Difference Variance

一阶差分常用来描述数据的变化程度。这里，通过一阶差分的方差值，来更准确的描述低频部分的数据波动程度。First-order differences are often used to describe the degree of change in data. Here, the variance value of the first-order difference is used to more accurately describe the degree of data fluctuation in the low-frequency part.

④低频曲线拟合(Low Spectral Curve Fit)④Low Spectral Curve Fit

利用6维拟合特征对于0～500Hz的FFT采样点进行拟合。The 6-dimensional fitting feature is used to fit the FFT sampling points from 0 to 500 Hz.

其中x为0～500Hz的FFT采样点，a_i表示拟合的系数；Where x is the FFT sampling point from 0 to 500Hz, and a _i represents the fitting coefficient;

⑤全局低频比(Global Low Spectral Ratio)⑤Global Low Spectral Ratio

此特征的提出是基于现有的频带特征检测算法和扬声器对语音信号的衰减作用,通过对原有算法的改进使其具有广泛适用性。低频比例特征的提取验证了语音信号总体在低频部分衰减的特点。This feature is proposed based on the existing frequency band feature detection algorithm and the attenuation effect of the speaker on the speech signal, and it has wide applicability through the improvement of the original algorithm. The extraction of low-frequency proportional features verifies the attenuation characteristics of the overall speech signal in the low-frequency part.

其中X(f)为每一帧的快速傅氏变换,本实验所使用的音频信号采样频率均为16kHz,衰减部分主要发生在500Hz以下。Among them, X(f) is the fast Fourier transform of each frame. The audio signal sampling frequency used in this experiment is 16kHz, and the attenuation part mainly occurs below 500Hz.

对于非线性失真现象，提取总谐波失真，削波比和音色向量三种特征，共计16维特征向量，用于描述非线性失真中的高频谐波失真和瞬态互调失真现象。①总谐波失真(Total Harmonic Distortion)For nonlinear distortion phenomena, three features, total harmonic distortion, clipping ratio and timbre vector, are extracted, with a total of 16-dimensional feature vectors, which are used to describe high-frequency harmonic distortion and transient intermodulation distortion in nonlinear distortion. ①Total Harmonic Distortion (Total Harmonic Distortion)

此特征的提出是基于扬声器对于语音高频部分的谐波失真现象。各次谐波的方均根值与基波方均根值的比例称为该次谐波的谐波含量。所有谐波的方均根值的方和根与基波方均根值的比例称为总谐波失真The proposal of this feature is based on the harmonic distortion phenomenon of the loudspeaker to the high-frequency part of the voice. The ratio of the root mean square value of each harmonic to the root mean square value of the fundamental is called the harmonic content of the harmonic. The ratio of the root sum of the root mean square values of all harmonics to the root mean square value of the fundamental is called total harmonic distortion

其中X(f)为每一帧的快速傅氏变换。f₁为基音频率，f_i为i倍基音频率。where X(f) is the fast Fourier transform of each frame. f ₁ is the pitch frequency, and f _i is i times the pitch frequency.

②削波比(Clipping Ratio)② Clipping Ratio

将时域谱绝对值的平均值和最大值作比，用来量化由瞬态互调失真带来的削波现象。The average value of the absolute value of the time-domain spectrum is compared with the maximum value to quantify the clipping phenomenon caused by the transient intermodulation distortion.

其中x为时域谱，len为时域谱长度。Where x is the time-domain spectrum, and len is the length of the time-domain spectrum.

③音色向量(Timbre Vector)③Timbre Vector

回放信号与原始信号在谐波上差异明显。音色主要由各个谐波(泛音)的相对大小决定。音色向量可以描述谐波的相对大小关系。The playback signal is significantly different from the original signal in terms of harmonics. Timbre is mainly determined by the relative size of each harmonic (overtone). Timbre vectors can describe the relative size relationship of harmonics.

步骤3：分别对每一个浊音帧提取完特征向量后，将所有的浊音帧的特征向量求平均值，形成26维统计特征向量。Step 3: After extracting the feature vector for each voiced sound frame, average the feature vectors of all the voiced sound frames to form a 26-dimensional statistical feature vector.

步骤4.1：输入训练样本集，训练样本集中的训练音频来自多种设备和多位录制者,并包括回放语音和原始语音；如图2所示，对训练样本集中的所有语音样本提取26维统计特征向量。Step 4.1: Input the training sample set, the training audio in the training sample set comes from multiple devices and multiple recorders, and includes playback voice and original voice; as shown in Figure 2, extract 26-dimensional statistics for all voice samples in the training sample set Feature vector.

步骤4.2：语音的判定问题实际上是二分类问题，所以使用的模型为SVM；在提取出特征向量以后，利用LIBSVM对训练样本集中的特征数据库进行二分类训练。Step 4.2: The speech determination problem is actually a binary classification problem, so the model used is SVM; after extracting the feature vector, use LIBSVM to perform binary classification training on the feature database in the training sample set.

步骤5：将待测语音样本的特征模型与已训练好的语音模型库进行SVM模式匹配，进一步输出判决结果。Step 5: Perform SVM pattern matching on the feature model of the speech sample to be tested and the trained speech model library, and further output the judgment result.

步骤5.1：提取待测语音特征向量；Step 5.1: extracting the speech feature vector to be tested;

步骤5.2：将待测样本特征向量与已有的语音模型库进行模式匹配，得到判决标准，进一步输出判决结果。Step 5.2: Match the feature vector of the sample to be tested with the existing speech model library to obtain the judgment standard, and further output the judgment result.

将待测样本特征向量与已有的语音模型库进行模式匹配，训练过的SVM模型具有区分原始语音和回放语音的分类边界，可以实现对待测样本进行二分类，进一步输出判决结果，判决为回放/原始。The feature vector of the sample to be tested is pattern-matched with the existing speech model library. The trained SVM model has a classification boundary that distinguishes the original speech from the playback speech, which can realize the binary classification of the sample to be tested, and further output the judgment result, and the judgment is playback /original.

为了验证本算法的有效性，设置三个实验来进行测试；In order to verify the effectiveness of the algorithm, three experiments are set up for testing;

实验1：不同年龄段以及不同性别的用户在频率、语调等声音特点方面差异较大，所以对不同用户人群进行分类测试，分别为18岁以下、18—40岁和40岁以上三个年龄段，每个年龄段都分别有男性录制者和女性录制者；不同用户人群分类测试结果请见下表1；Experiment 1: Users of different age groups and different genders have great differences in voice characteristics such as frequency and intonation. Therefore, different user groups are classified into three age groups, namely under 18 years old, 18-40 years old and over 40 years old. , there are male recorders and female recorders in each age group; please see Table 1 below for the test results of different user groups;

表1不同用户人群分类测试结果Table 1 Classification test results of different user groups

测试分组test group Age1(<18)Age1 (<18) Age2(18-40)Age2(18-40) Age3(>40)Age3(>40) 平均average 测试指标Test indicators ARAR ARAR ARAR ARAR 男male 100％100% 99.2054％99.2054% 98.2％98.2% 99.14％99.14% 女Female 97.7941％97.7941% 98.3％98.3% 98.8372％98.8372% 98.32％98.32% 平均average 98.08％98.08% 98.69％98.69% 98.525％98.525% 98.68％98.68%

实验2：不同扬声器的物理结构不同,其扬声器的频响曲线相对不同，针对扬声器的测试可以验证不同主流设备的识别情况，测试设备分别为华为，iPhone，三星，魅族，谷歌nexus；不同扬声器分类测试结果请见下表2；Experiment 2: Different speakers have different physical structures, and their frequency response curves are relatively different. The test for speakers can verify the recognition of different mainstream devices. The test devices are Huawei, iPhone, Samsung, Meizu, and Google nexus; different speakers are classified The test results are shown in Table 2 below;

表2不同扬声器分类测试结果Table 2 Classification test results of different loudspeakers

设备类型Equipment type 样本数量Number of samples FARFAR ARAR iPhone5siPhone5s 172172 8.55％8.55% 91.45％91.45% 华为Huawei 171171 2.34％2.34% 97.66％97.66% NexusNexus 155155 0.65％0.65% 99.35％99.35% 魅族meizu 175175 1.15％1.15% 98.85％98.85% 三星Samsung 254254 3.15％3.15% 96.85％96.85% 平均average 185.4185.4 3.17％3.17% 96.83％96.83%

实验3：文献[1]中的算法是目前提出的较为优秀的回放攻击检测算法，所以将本发明的方法与文献[1]的算法进行对比测试，以验证本算法对于识别率的提升，算法对比测试结果请见下表3；Experiment 3: The algorithm in document [1] is an excellent replay attack detection algorithm currently proposed, so the method of the present invention is compared with the algorithm in document [1] to verify the improvement of the recognition rate of this algorithm. The comparison test results are shown in Table 3 below;

表3算法对比测试Table 3 Algorithm comparison test

实验结果表明，本发明提供的算法对于不同用户人群和不同扬声器设备均具有良好的检测通用性，并且算法的平均识别正确率率高达98％以上，相较于现有算法平均82％的识别率有了显着的提升。Experimental results show that the algorithm provided by the present invention has good detection versatility for different user groups and different loudspeaker devices, and the average recognition accuracy rate of the algorithm is as high as 98%, compared with the average recognition rate of 82% of the existing algorithm There has been a significant improvement.

文献[1]Villalba,Jesús,and Eduardo Lleida."Detecting replay attacksfrom far-field recordings on speaker verification systems."European Workshopon Biometrics and Identity Management.Springer Berlin Heidelberg,2011.Literature [1] Villalba, Jesús, and Eduardo Lleida. "Detecting replay attacks from far-field recordings on speaker verification systems." European Workshop on Biometrics and Identity Management. Springer Berlin Heidelberg, 2011.

应当理解的是，本说明书未详细阐述的部分均属于现有技术。It should be understood that the parts not described in detail in this specification belong to the prior art.

应当理解的是，上述针对较佳实施例的描述较为详细，并不能因此而认为是对本发明专利保护范围的限制，本领域的普通技术人员在本发明的启示下，在不脱离本发明权利要求所保护的范围情况下，还可以做出替换或变形，均落入本发明的保护范围之内，本发明的请求保护范围应以所附权利要求为准。It should be understood that the above-mentioned descriptions for the preferred embodiments are relatively detailed, and should not therefore be considered as limiting the scope of the patent protection of the present invention. Within the scope of protection, replacements or modifications can also be made, all of which fall within the protection scope of the present invention, and the scope of protection of the present invention should be based on the appended claims.

Claims

1. a playback attack detection method based on the speech signal distortion characteristics introduced by loudspeaker, it is characterized in that, comprising the following steps:

Step 1: Preprocessing the speech to be detected, retaining voiced frames;

Step 2: perform feature extraction for each voiced sound frame in the preprocessed speech signal, and obtain a feature vector based on the linear distortion and nonlinear distortion characteristics of the speech signal;

Step 3: the feature vectors of all voiced frames are averaged to form a statistical feature vector to obtain a feature model of the speech to be tested;

Step 4: extract the feature vector of the training speech sample, obtain the training speech feature model, and use the training speech feature model to train the SVM model, and obtain the speech model library;

Step 5: Perform SVM pattern matching between the feature model of the speech to be tested and the trained speech model library, and output the judgment result.

2. the playback attack detection method based on the speech signal distortion characteristic that loudspeaker introduces according to claim 1, it is characterized in that: the speech to be detected described in step 1 is preprocessed, is to use Hamming window to carry out framing adding to speech signal Window processing, the frame length is 70ms, and the voiced frames in it are reserved.

3. the playback attack detection method based on the speech signal distortion characteristic that loudspeaker introduces according to claim 1, is characterized in that: the described step 2 carries out feature extraction for each voiced sound frame in the preprocessed speech signal, is to extract based on 26-dimensional eigenvectors of the linear distortion and nonlinear distortion characteristics of speech signals.

4. the playback attack detection method based on the speech signal distortion characteristic that loudspeaker introduces according to claim 1 or 3, is characterized in that: described extraction is based on speech signal linear distortion feature vector, by low frequency ratio, low frequency variance, low frequency differential variance , low-frequency fitting and global low-frequency ratio five features, a total of 10-dimensional vector composition;

The low frequency ratio Wherein X(f) is the fast Fourier transform of each frame;

The low frequency variance in

The low frequency difference variance in

The low-frequency fitting is to use the 6-dimensional fitting feature to fit the FFT sampling points of 0-500 Hz, and the fitting formula is Where x is the FFT sampling point from 0 to 500Hz, and a _i represents the fitting coefficient;

The global low frequency ratio

5. the playback attack detection method based on the speech signal distortion characteristic that loudspeaker introduces according to claim 1 or 3, is characterized in that: described extraction is based on speech signal non-linear distortion feature vector, comprises total harmonic distortion, clipping ratio and timbre vector, a total of 16 dimensional feature vectors;

The total harmonic distortion in X(f) is the fast Fourier transform of each frame, f ₁ is the pitch frequency, and f _i is i times the pitch frequency;

The clipping ratio in x is the time-domain spectrum, len is the length of the time-domain spectrum;

The Timbre Vector

6. The playback attack detection method based on the distortion characteristics of the speech signal introduced by the loudspeaker according to claim 1, characterized in that: the statistical feature vector in step 3 is a 26-dimensional statistical feature vector.

7. the playback attack detection method based on the speech signal distortion characteristics introduced by loudspeaker according to claim 1, characterized in that: the training speech samples described in step 4, from several devices and several recorders, including playback speech and original speech .

8. the playback attack detection method based on the speech signal distortion characteristic that speaker introduces according to claim 1 or 7, it is characterized in that: in step 4, after extracting training speech sample feature vector, utilize LIBSVM to the feature in training speech sample set The database is trained for binary classification, and the feature database is composed of feature vectors of training speech samples.