SQuId: Measuring Speech Naturalness in Many Languages

Sellam, Thibault; Bapna, Ankur; Camp, Joshua; Mackinnon, Diana; Parikh, Ankur P.; Riesa, Jason

doi:10.1109/ICASSP49357.2023.10094909

Computer Science > Computation and Language

arXiv:2210.06324 (cs)

[Submitted on 12 Oct 2022 (v1), last revised 1 Jun 2023 (this version, v2)]

Title:SQuId: Measuring Speech Naturalness in Many Languages

Authors:Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, Jason Riesa

View PDF

Abstract:Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.

Comments:	Accepted at ICASSP 2023, with additional material in the appendix
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2210.06324 [cs.CL]
	(or arXiv:2210.06324v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.06324
Related DOI:	https://doi.org/10.1109/ICASSP49357.2023.10094909

Submission history

From: Thibault Sellam [view email]
[v1] Wed, 12 Oct 2022 15:43:09 UTC (457 KB)
[v2] Thu, 1 Jun 2023 14:51:00 UTC (415 KB)

Computer Science > Computation and Language

Title:SQuId: Measuring Speech Naturalness in Many Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SQuId: Measuring Speech Naturalness in Many Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators