这是indexloc提供的服务,不要输入任何密码

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

Isha Pandey1  Pranav Gaikwad2  Amruta Parulekar1  Ganesh Ramakrishnan1
1 Indian Institute of Technology Bombay, India
2 BharatGen, India
Abstract

High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India.

We train a non-autoregressive Continuous Normalizing Flow (CNF) Mathieu and Nickel (2020) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages


Isha Pandey1  Pranav Gaikwad2  Amruta Parulekar1  Ganesh Ramakrishnan1 1 Indian Institute of Technology Bombay, India 2 BharatGen, India


1 Introduction

Large-scale generative models have made remarkable progress across domains, but achieving high-quality and context-sensitive speech generation for diverse linguistic landscapes remains a challenge, particularly for languages with rich morphology, significant dialectal variation, and phonetic structures that diverge considerably from high-resource languages like English. Indian languages exemplify this complexity. Directly applying architectures developed for high-resource settings to Indian languages is often ineffective due to their intricate prosody, agglutinative or inflectional morphology, and the general lack of standardized orthographies across dialects. Moreover, state-of-the-art speech generation systems are typically trained on massive datasets, often exceeding 60,000 hours, whereas our work investigates these challenges using a much smaller dataset of approximately 2,000 hours or less, reflecting the true low-resource conditions common to many Indian languages.

In this work, we conduct a focused investigation into the role of duration prediction within the speech generation pipeline, aiming to better understand its influence on prosody and speaker-related features in Indian languages. As described in Section 3.2, we employ a Continuous Normalizing Flow (CNF)–based audio model inspired by the Voicebox architectureMathieu and Nickel (2020); Le et al. (2023) and implement two distinct duration prediction strategies for comparison. Rather than optimizing solely for end-task performance, our goal is to explore how architectural choices particularly in duration modeling interact with the linguistic diversity and low-resource realities of Indian speech, offering insights that may guide future model design and adaptation.

Our study centers on adapting the duration prediction mechanism within the Voicebox-inspired framework. In typical systems such as Voicebox (see Section 3.4.1), durations are predicted based on text and explicit timing targets obtained from a pre-trained forced alignment model. However, this approach can be unreliable in low-resource contexts like Indian languages, which exhibit rich prosodic variation—including rhythm, stress, and intonation. Forced alignments may fail to capture such variation accurately. To address this, we adopt a prompt-based duration prediction strategy inspired by recent work in prosody modeling, such as PFlowKim et al. (2023). As discussed in Section 3.4, this approach replaces external alignment inputs with a three-second speaker prompt, which—when processed using cross-attention—enables the model to extract speaker-specific prosodic cues directly from reference audio. This design allows for more contextually appropriate and natural duration predictions and better captures the expressive prosody characteristic of Indian languages. Our adaptation aims to evaluate the feasibility and impact of prompt-based duration modeling in a multilingual, low-resource setting where alignment-dependent techniques often struggle.

For the audio generation component, detailed in Section 3.3, we adopt the non-autoregressive CNF-based modeling approach used in Voicebox. This enables the transformation of a simple latent distribution into the complex distribution of missing speech segments, conditioned on both text and surrounding audio context, i.e., modeling p(missing datacontext)𝑝conditionalmissing datacontextp(\text{missing data}\mid\text{context})italic_p ( missing data ∣ context ). Training is performed using the flow-matching objective, which supports efficient and scalable optimization of CNFs via vector field regression. While the audio model architecture closely follows Voicebox, we modify the training strategy to improve learning robustness given the limited and diverse nature of the Indian language data.

We evaluate the two proposed duration prediction strategies through detailed empirical analysis, using datasets described in Section 4.1. As detailed in Section 5, our experiments focus on speech infilling tasks, including both continuous sentence completion and cross-sentence completion, conducted across multiple Indian languages. To assess performance, we use Word Error Rate (WER) and the SIMO metric to evaluate intelligibility and similarity to ground-truth speaker characteristics, respectively. Additionally, we conduct human evaluations to complement these objective metrics. The results reveal language-dependent trade-offs: in some cases, prompt-based predictors improve intelligibility, while in others, they better preserve speaker-specific prosodic traits.These observations highlight how prosody, language structure, and modeling decisions collectively influence speech generation performance across diverse Indian languages.

2 Related work

Foundations in Large-Scale Generative Speech Models: There have been several advancements in zero-shot text-to-speech (TTS) in the recent years. Our work is primarily based on Voicebox Le et al. (2023), which is a text-guided multilingual speech generation model that uses non-autoregressive flow-matching for speech infilling, given audio context and text. It can perform a variety of tasks inlcuding noise removal, style transfer and zero-shot TTS conversion. Flow matching Lipman et al. (2023) is a framework for training continuous normalization flows by regressing a vector field that translates a source distribution to a target data distribution along predetermined probability paths. This simulation-free approach allows for faster training, sampling, and better generalization than diffusion paths. There exist several other architectures for text-to-speech synthesis. NaturalSpeech Tan et al. (2022) uses a variational autoencoder for end-to-end text-to-speech modeling with several strategies like phoneme pretraining, differentiable duration modeling, bidirectional prior-posterior modeling, and a memory mechanism to improve the speech quality and minimize artifacts. Vall-E Wang et al. (2023) uses a neural codec audio model with discrete audio codec tokens to generate high quality personalized zero-shot speech by framing TTS as a conditional language modeling task. Tacotron 2 Shen et al. (2018) uses a recurrent sequence-to-sequence feature prediction network to map character embeddings to mel spectrograms and generate high-quality speech from input text. WaveNet van den Oord et al. (2016) is a fully probabilistic and autoregressive deep generative model for natural speech synthesis, for which each audio sample’s predicted distribution is conditioned on all previous audios.

Speech Generation for Low-Resource Languages, Particularly Indian Languages: Despite the fast developments on zero-shot text-to-speech synthesis in high resource languages, the literature for low-resource, particularly Indian languages, is limited. Panda et al. (2020) highlights that the technological advancements for TTS reflect for less than 60 % of the official languages of India while research for the unofficial languages is yet to begin. The paper also highlights the challenges faced in designing Indian Language TTS systems due to the linguistic variations in these languages. IndicTTS Kumar et al. (2023) was the first study to train and evaluate TTS systems based on Transformers Vaswani et al. (2023) and HifiGAN Kong et al. (2020) on Indian languages. However, it does not include multi-speaker and speaker-specific TTS. Since then, several new multi-speaker datasets such as IndicVoices-R Sankar et al. (2024) have been released, and it is hence crucial to use these datasets and develop speaker-specific TTS systems for Indian languages.

Innovations in Duration Prediction and Prosody Modeling: Duration prediction is a popular method used to convert text to natural sounding speech. FastSpeech Ren et al. (2019) extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction. These durations are then used by a length regulator to expand the source phoneme duration to match the target mel spectrogram. VITS Kim et al. (2021) proposes a stochastic duration predictor to synthesize speech with diverse rhythms from input text, emphasizing that the same text can be naturally spoken in different ways. Non-attentive Tacotron Shen et al. (2021) replaces the attention-based alignment mechanism in Tacotron 2 Shen et al. (2018) with an explicit duration predictor which allows for utterance-wide and per-phoneme duration control at inference time. Apart from direct duration prediction, other strategies have also emerged for prosody and style transfer. Style tokens Wang et al. (2018) uses embeddings that are jointly trained with Tacotron Shen et al. (2018) and are able to learn a wide range of expressiveness. There have also been RAD-TTS Shih et al. (2021) uses normalizing flows, robust alignment learning and models speech rhythm as a separate generative distribution to enable improved prosody transfer. For our work, we derived inspiration from these duration modeling techniques to develop a highly effective duration predictor that can work without relying on external forced alignments.

Non-Autoregressive Audio Modeling: Our audio model was based on Voicebox Le et al. (2023), but there are several other TTS models that use non-autoregressive (NAR) modeling, such as FastSpeech Ren et al. (2019) and FastSpeech 2 Ren et al. (2022). Hifi-GAN Kong et al. (2020) is a state-of-the-art NAR vocoder that models the periodic patterns in audio. More recently, flow-based NAR TTS models like Glow-TTS Kim et al. (2020) have emerged, which does not require any external alignments. Finally, EfficientTTS Miao et al. (2020) focuses on efficient NAR speech generation by using a new monotonic alignment modeling approach.

3 Methodology

3.1 Background

Flow Matching with Optimal Transport Continuous Normalizing Flows (CNFs) provide a powerful framework for learning complex data distributions by transforming a simple prior distribution p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a target data distribution p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This transformation is achieved through a time-dependent vector field vt:[0,1]×dd:subscript𝑣𝑡01superscript𝑑superscript𝑑v_{t}:[0,1]\times\mathbb{R}^{d}\to\mathbb{R}^{d}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, which constructs a flow ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT governed by the ordinary differential equation:

ddtϕt(x)=vt(ϕt(x));ϕ0(x)=xformulae-sequence𝑑𝑑𝑡subscriptitalic-ϕ𝑡𝑥subscript𝑣𝑡subscriptitalic-ϕ𝑡𝑥subscriptitalic-ϕ0𝑥𝑥\frac{d}{dt}\phi_{t}(x)=v_{t}(\phi_{t}(x));\phi_{0}(x)=xdivide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) ; italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) = italic_x For a given flow ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can derive the probability path pt(x) using the change of variables formula: pt(x)=p0(ϕt1(x))|det(ϕt1x(x))|subscript𝑝𝑡𝑥subscript𝑝0subscriptsuperscriptitalic-ϕ1𝑡𝑥subscriptsuperscriptitalic-ϕ1𝑡𝑥𝑥p_{t}(x)=p_{0}(\phi^{-1}_{t}(x))\left|\det\left(\frac{\partial\phi^{-1}_{t}}{% \partial x}(x)\right)\right|italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) | roman_det ( divide start_ARG ∂ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x end_ARG ( italic_x ) ) | To train the neural network parameters θ𝜃\thetaitalic_θ that define our vector field vt(x;θ)subscript𝑣𝑡𝑥𝜃v_{t}(x;\theta)italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_θ ), we employ the Flow Matching objective:

FM(θ)=𝔼t,pt(x)ut(x)vt(x;θ)2subscript𝐹𝑀𝜃subscript𝔼𝑡subscript𝑝𝑡𝑥superscriptdelimited-∥∥subscript𝑢𝑡𝑥subscript𝑣𝑡𝑥𝜃2\mathcal{L}_{FM}(\theta)=\mathbb{E}_{t,p_{t}(x)}\left\lVert u_{t}(x)-v_{t}(x;% \theta)\right\rVert^{2}caligraphic_L start_POSTSUBSCRIPT italic_F italic_M end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

However, directly computing this objective is challenging as we lack prior knowledge of pt or vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To address this, we utilize the Conditional Flow Matching (CFM) objective, which provides a practical training approach:

CFM(θ)=𝔼t,q(x1),pt(x|x1)ut(x|x1)vt(x;θ)2subscript𝐶𝐹𝑀𝜃subscript𝔼𝑡𝑞subscript𝑥1subscript𝑝𝑡conditional𝑥subscript𝑥1superscriptdelimited-∥∥subscript𝑢𝑡conditional𝑥subscript𝑥1subscript𝑣𝑡𝑥𝜃2\mathcal{L}_{CFM}(\theta)=\mathbb{E}_{t,q(x_{1}),p_{t}(x|x_{1})}\left\lVert u_% {t}(x|x_{1})-v_{t}(x;\theta)\right\rVert^{2}caligraphic_L start_POSTSUBSCRIPT italic_C italic_F italic_M end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

For our implementation, we specifically adopt the optimal transport (OT) path, which defines the conditional probability and vector field as: pt(x|x1)=𝒩(x|tx1,(1(1σmin)t)2I)subscript𝑝𝑡conditional𝑥subscript𝑥1𝒩conditional𝑥𝑡subscript𝑥1superscript11subscript𝜎𝑚𝑖𝑛𝑡2𝐼p_{t}(x|x_{1})=\mathcal{N}(x|tx_{1},(1-(1-\sigma_{min})t)^{2}I)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x | italic_t italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ( 1 - ( 1 - italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) and ut(x|x1)=x1(1σmin)x1(1σmin)tsubscript𝑢𝑡conditional𝑥subscript𝑥1subscript𝑥11subscript𝜎𝑚𝑖𝑛𝑥11subscript𝜎𝑚𝑖𝑛𝑡u_{t}(x|x_{1})=\frac{x_{1}-(1-\sigma_{min})x}{1-(1-\sigma_{min})t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( 1 - italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) italic_x end_ARG start_ARG 1 - ( 1 - italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) italic_t end_ARG

The OT path is particularly advantageous as it ensures points move with constant speed and direction, leading to more stable training and efficient inference. This choice simplifies the learning process while maintaining the model’s expressive power.

3.2 Problem Formulation and Model Training

Our work adapts the approach from Voicebox Le et al. (2023) to Indian languages. Given speech-transcript pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), we train a model for text-guided speech generation through in-context learning. The model learns speech infilling - predicting missing speech segments given surrounding audio and text transcript. Using a binary mask m𝑚mitalic_m, we split the audio into missing (xmissubscript𝑥𝑚𝑖𝑠x_{mis}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT) and context (xctxsubscript𝑥𝑐𝑡𝑥x_{ctx}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT) portions, training the model to learn p(xmis|y,xctx)𝑝conditionalsubscript𝑥𝑚𝑖𝑠𝑦subscript𝑥𝑐𝑡𝑥p(x_{mis}|y,x_{ctx})italic_p ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT | italic_y , italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ). Following voicebox Le et al. (2023), we use an audio model and a duration predictor for fine-grained alignment control

The system works with audio frames x=(x1,,xN)𝑥subscript𝑥1subscript𝑥𝑁x=(x_{1},...,x_{N})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), character sequence y=(y1,,yM)𝑦subscript𝑦1subscript𝑦𝑀y=(y_{1},...,y_{M})italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), and per-character durations l=(l1,,lM)𝑙subscript𝑙1subscript𝑙𝑀l=(l_{1},...,l_{M})italic_l = ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ). The frame-level transcript z𝑧zitalic_z is derived by repeating each character yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT according to its duration ljsubscript𝑙𝑗l_{j}italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. For any input pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), we obtain l𝑙litalic_l and z𝑧zitalic_z through forced alignment using a speech recognition model. The final prediction combines the audio model q(xmis|z,xctx)𝑞conditionalsubscript𝑥𝑚𝑖𝑠𝑧subscript𝑥𝑐𝑡𝑥q(x_{mis}|z,x_{ctx})italic_q ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT | italic_z , italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ) and duration model q(lmis|y,lctx)𝑞conditionalsubscript𝑙𝑚𝑖𝑠𝑦subscript𝑙𝑐𝑡𝑥q(l_{mis}|y,l_{ctx})italic_q ( italic_l start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT | italic_y , italic_l start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ), where lmissubscript𝑙𝑚𝑖𝑠l_{mis}italic_l start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT and lctxsubscript𝑙𝑐𝑡𝑥l_{ctx}italic_l start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT represent masked and unmasked portions of the duration sequence. Additionally, inspired by PFlow [citation], we explored an alternative duration predictor that takes a variable-length unmasked audio portion (minimum 2 seconds) to predict the duration distribution for each character in the text. This approach learns q(lmis|y,xctx)𝑞conditionalsubscript𝑙𝑚𝑖𝑠𝑦subscript𝑥𝑐𝑡𝑥q(l_{mis}|y,x_{ctx})italic_q ( italic_l start_POSTSUBSCRIPT italic_m italic_i italic_s end_POSTSUBSCRIPT | italic_y , italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ) directly, eliminating the need for explicit duration context lctxsubscript𝑙𝑐𝑡𝑥l_{ctx}italic_l start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT.

3.3 Audio Model

Following voicebox Le et al. (2023), we implement a Continuous Normalizing Flow (CNF) model that learns the distribution of missing audio frames given the context. The audio is represented as an 80-dimensional log Mel spectrogram (xi80subscript𝑥𝑖superscript80x_{i}\in\mathbb{R}^{80}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 80 end_POSTSUPERSCRIPT). For a given input of context spectrogram xctxN×Fsubscript𝑥𝑐𝑡𝑥superscript𝑁𝐹x_{ctx}\in\mathbb{R}^{N\times F}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT, flow state xtN×Fsubscript𝑥𝑡superscript𝑁𝐹x_{t}\in\mathbb{R}^{N\times F}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT, character sequence z[K]N𝑧superscriptdelimited-[]𝐾𝑁z\in[K]^{N}italic_z ∈ [ italic_K ] start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (where K𝐾Kitalic_K is the number of character classes), and time step t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], we employ a Transformer to parameterize the vector field vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The model is trained using a masked version of the flow matching objective:

LaudioCFM(θ)subscript𝐿𝑎𝑢𝑑𝑖𝑜𝐶𝐹𝑀𝜃\displaystyle L_{audio-CFM}(\theta)italic_L start_POSTSUBSCRIPT italic_a italic_u italic_d italic_i italic_o - italic_C italic_F italic_M end_POSTSUBSCRIPT ( italic_θ ) =𝔼t,m,q(x,z),p0(x0)ut(xt|x)\displaystyle=\mathbb{E}_{t,m,q(x,z),p_{0}(x_{0})}\left\lVert u_{t}(x_{t}|x)\right.= blackboard_E start_POSTSUBSCRIPT italic_t , italic_m , italic_q ( italic_x , italic_z ) , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x )
vt(xt,xctx,z;θ)2\displaystyle\quad-\left.v_{t}(x_{t},x_{ctx},z;\theta)\right\rVert^{2}- italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT , italic_z ; italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

Where the masked context xctxsubscript𝑥ctxx_{\text{ctx}}italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT is defined as:

xictx={0if mi=1,xiif mi=0,superscriptsubscript𝑥𝑖ctxcases0if subscript𝑚𝑖1subscript𝑥𝑖if subscript𝑚𝑖0x_{i}^{\text{ctx}}=\begin{cases}0&\text{if }m_{i}=1,\\ x_{i}&\text{if }m_{i}=0,\end{cases}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ctx end_POSTSUPERSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , end_CELL end_ROW

, And the binary mask misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT identifies the frames to be predicted (masked portion) versus the available context frames (unmasked portion). The input to the Transformer is then constructed by concatenating the three sequences (xt,xctx,zemb)subscript𝑥𝑡subscript𝑥ctxsubscript𝑧emb(x_{t},x_{\text{ctx}},z_{\text{emb}})( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT ) along the feature dimension:

Hc=Proj([xt;xctx;zemb])N×D,subscript𝐻𝑐Projsubscript𝑥𝑡subscript𝑥ctxsubscript𝑧embsuperscript𝑁𝐷H_{c}=\text{Proj}([x_{t};x_{\text{ctx}};z_{\text{emb}}])\in\mathbb{R}^{N\times D},italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = Proj ( [ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT ; italic_z start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT ] ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT ,

where D𝐷Ditalic_D is the Transformer embedding dimension.

To condition on the flow step t𝑡titalic_t, a sinusoidal positional encoding maps t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] to a vector htDsubscript𝑡superscript𝐷h_{t}\in\mathbb{R}^{D}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The final input to the Transformer is Hcsubscript𝐻𝑐H_{c}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT concatenated with htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

H~c(N+1)×D.subscript~𝐻𝑐superscript𝑁1𝐷\widetilde{H}_{c}\in\mathbb{R}^{(N+1)\times D}.over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_D end_POSTSUPERSCRIPT .

The transformer outputs vt(xt,xctx,z;θ)N×Fsubscript𝑣𝑡subscript𝑥𝑡subscript𝑥ctx𝑧𝜃superscript𝑁𝐹v_{t}(x_{t},x_{\text{ctx}},z;\theta)\in\mathbb{R}^{N\times F}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT , italic_z ; italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT, which corresponds to the original sequence length N𝑁Nitalic_N.

Unlike Voicebox training, our loss function interpolates between the masked and unmasked portions, applying a weighted sum. Specifically, we assign a weight of 0.9 to the loss computed on the masked frames and a weight of 0.1 to the loss on the unmasked (context) frames. This strategy allows the model to prioritize learning the content of the segments that require prediction while simultaneously being regularized to maintain the integrity and quality of the provided audio context.

3.4 Duration Prediction

The Voicebox model Le et al. (2023) requires accurate character durations for high-quality speech generation. For Indian languages, predicting these durations presents unique challenges due to the rich diversity in regional pronunciations and speaking styles. Through extensive experimentation, we explored different approaches to duration prediction, analyzing their impact on both intelligibility and speaker similarity in the generated speech.

Dataset # Speakers # Lang. Lang. Scope
IndicVoices Javed et al. (2024) 22563 22 Indic only
IndicSuperb Javed et al. (2022) 1218 12 Indic only
IndicTTS Kumar et al. (2023) 44 22 Indic only
Spring R et al. (2023) 7609 10 Indic only
FLEURS Conneau et al. (2022) Not specified 102 International
Mucs Diwan et al. (2021) Not specified 6 Indic only
Shrutilipi Bhogale et al. (2023a) Not specified 12 Indic only
SYSPIN Abhayjeet et al. (2025) 18 9 Indic only
Common Voice Ardila et al. (2020) Not specified 60 International
Vaani Team (2025) 112394 54 Indic only
Dhwani Shah et al. (2025) Not specified 40 Indic only
Bhashini (Multiple datasets) Not specified 19 Indic only
OpenSLR (Multiple datasets) Not specified 25 International
Table 1: Number of Speakers, Languages, and Language Scope in Various Speech Datasets

3.4.1 Voicebox style infilling Duration Predictor

The duration predictor which was used in Voicebox Le et al. (2023) was used to predict durations from the text and feed them to the audio model. Inspired by the audio model, it models q(l|y,lctx)𝑞conditional𝑙𝑦subscript𝑙ctxq(l|y,l_{\text{ctx}})italic_q ( italic_l | italic_y , italic_l start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT ) using a conditional vector field. l𝑙litalic_l represents duration, lctxsubscript𝑙ctxl_{\text{ctx}}italic_l start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT is the context duration, y𝑦yitalic_y is the phonetic transcript. It swaps (x,xctx,z)𝑥subscript𝑥ctx𝑧(x,x_{\text{ctx}},z)( italic_x , italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT , italic_z ) from the audio model with (l,lctx,y)𝑙subscript𝑙ctx𝑦(l,l_{\text{ctx}},y)( italic_l , italic_l start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT , italic_y ). Training is done using a masked version of the Conditional Flow Matching (CFM) loss.

While we use the regression-based duration modeling approach promoted by them where they regress the masked duration lmisssubscript𝑙missl_{\text{miss}}italic_l start_POSTSUBSCRIPT miss end_POSTSUBSCRIPT given lctxsubscript𝑙ctxl_{\text{ctx}}italic_l start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT and y𝑦yitalic_y. In our implementation, learnable embeddings of the text characters and corresponding durations are concatenated along the frame dimension. This concatenated representation is then projected down using a linear projection layer. To capture local temporal patterns, we introduce a convolutional network before passing the features to the transformer. The Transformer architecture is similar to the audio model with fewer parameters. Unlike Voicebox Le et al. (2023), in our implementation we train duration predictor with MSE loss over log-scaled durations over masked phones.

3.4.2 Speaker-Prompted Duration Predictor

Accurately modeling phoneme durations is critical for synthesizing natural-sounding speech, yet presents a significant challenge in the context of Indian languages. This difficulty is exacerbated by the high variety of dialects and pronunciations across different regions of the country, which introduces considerable variability in phoneme durations within our training data. To address this high variance and improve the robustness and generalization capability of duration prediction under data constraints, we developed an enhanced duration predictor. This design was inspired by approaches that leverage speech prompts for conditioning, such as the method explored in PFlow Kim et al. (2023).

Unlike conventional approaches that condition duration prediction directly on text and rely on forced aligned durations as contextual input, which can be unreliable in low-resource settings and often lead to unnatural prosody, we treat these alignments as weak supervision during training. Rather than conditioning the model on hard-aligned durations, we used a 3-second mel spectrogram segment xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, randomly sampled from the context mel spectrogram xctxsubscript𝑥ctxx_{\text{ctx}}italic_x start_POSTSUBSCRIPT ctx end_POSTSUBSCRIPT, together with the text sequence c𝑐citalic_c as input. This allows the model to implicitly learn prosodic patterns and speaker-specific durations from real mel prompt, guided by weak alignment signals, without depending on them explicitly at inference time.

The core of our model is a transformer-based encoder fencsubscript𝑓encf_{\text{enc}}italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT that produces a speaker-conditioned text representation hc=fenc(xp,c)subscript𝑐subscript𝑓encsubscript𝑥𝑝𝑐h_{c}=f_{\text{enc}}(x_{p},c)italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_c ). The encoder embeds the text sequence c𝑐citalic_c using learnable embeddings and projects the mel-spectrogram xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT via a linear layer to match the text embedding dimensionality. Cross-attention is then applied from the text tokens c𝑐citalic_c to the speech frames in xpsubscript𝑥𝑝x_{p}italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, enabling the model to fuse prosodic cues into the text representation.

The resulting representation hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is passed through two feed-forward layers with non-linear activation to predict log-duration values for each token in the input sequence of length N𝑁Nitalic_N. The model is trained using a mean squared error (MSE) loss between the predicted and ground-truth log-scaled durations:

dur=1Ni=1N(logd^ilogdi)2,subscriptdur1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript^𝑑𝑖subscript𝑑𝑖2\mathcal{L}_{\text{dur}}=\frac{1}{N}\sum_{i=1}^{N}\left(\log\hat{d}_{i}-\log d% _{i}\right)^{2},caligraphic_L start_POSTSUBSCRIPT dur end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_log over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where d^isubscript^𝑑𝑖\hat{d}_{i}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the predicted and target durations for the i𝑖iitalic_i-th token, respectively.

The audios generated by the Voicebox using these predicted durations were considerably better in both intelligibility and speaker similarity compared to using durations from an external forced alignment module.

4 Experimental Setup

4.1 Datasets

Several transcribed Indian language datasets were used for training the model to achieve a balanced representation of urban and rural speakers and recording devices. This created a diversity in the vocabulary, content, and recording channels, including a mix of read speech, voice commands, extempore discussions, and both wide and narrow-band recordings. Table 1 depicts the details of the different publically available datasets used for training the model.

4.2 Data Preprocessing

4.2.1 Text Normalization:

The transcripts, being from multiple datasets, had inconsistencies and were normalized to remove punctuation, spaces and other characters. Sentences containing words in the latin script were also removed from the Fluers and Spring datasets to create homogenized monolingual datasets for each target language.

4.2.2 Audio Filtering:

Some audios were unintelligible, noisy, or had transcription errors. To remove such audios, all audios were transcribed using the IndicWhisper Bhogale et al. (2023b) model and Word Error Rate (WER) of these transcripts with the ground-truth transcripts was calculated. All audios with WER more than 0.2 were discarded as low-intelligibility audios. For Hindi, this led to a reduction from 1.3 million audios to 1 million audios. From the discarded audios, audios with average CTC alignment scores greater than 0.9 were added back to the dataset, leading to an addition of 30k more audios for Hindi. Table 2 depicts the number of hours of data left for every language after filtering.

Language Hours
Hindi 2913
Marathi 1962
Tamil 1190
Telugu 932
Bengali 1224
Table 2: Training dataset sizes in no. of hours after filtering
Table 3: WERs after transcribing generated speech through IndicConformer, where a sentence is 50% masked
Language Dataset Ground Truth durations Infill style durations Pflow style durations
Tamil IndicSuperb 0.2794 0.2806 0.2622
Mucs 0.3438 0.3431 0.3269
overall 0.31160 0.31185 0.29455
Telugu Fluers 0.4832 0.47 0.4938
IndicSuperb 0.3661 0.4057 0.431
Mucs 0.3842 0.3655 0.3894
overall 0.41117 0.41373 0.43807
Bengali Fluers 0.4136 0.4391 0.4103
IndicSuperb 0.1948 0.2133 0.1936
overall 0.3042 0.3262 0.30195
Hindi Fluers 0.1797 0.1915 0.2051
IndicSuperb 0.127 0.1542 0.1574
Mucs 0.2022 0.2334 0.2368
overall 0.16963 0.19303 0.19977
Marathi Fluers 0.3962 0.348 0.3526
IndicSuperb 0.1896 0.1862 0.39
Mucs 0.1214 0.1381 0.351
overall 0.23573 0.22410 0.36453

4.3 Model

Following Voicebox we employed a 103 M parameter model audio model. Transformer model based on the architecture proposed by Vaswani et al. (2023), consisting of 12 layers. Each layer utilizes multi-head self-attention with 16 heads and 512 hidden dim of feed-forward network. Positional information is incorporated using Rotary Positional Embedding (RoPE) Su et al. (2023) applied within the self-attention mechanism. We used RMSNorm Zhang and Sennrich (2019) for layer normalization. Additionally, the model incorporates UNet like skip connections Ronneberger et al. (2015) where outputs from the first half of the layers are concatenated and linearly combined with the inputs of the corresponding layers in the second half.

The Voicebox style duration model uses the same model as audio model with 16 heads, 512 embedding/FFN dimensions, with 12 layers for all our models. 51M parameters in total.

For the Speaker prompted duration predictor we employ a Speech-prompted Text Encoder inspired by Kim et al. (2023), and a convolution-based Duration Predictor. The Text Encoder processes input speech prompt and text embeddings. These are first linearly projected to a shared feature space and concatenated. A prenetwork of three convolutional layers processes this combined sequence. The prenetwork output is split into corresponding prompt and text segments. To differentiate modalities and provide positional context, we add positional encodings to each segment, defined as the sum of standard absolute positional encodings and a unique learnable embedding for prompt or text. These processed representations are then fed into a 12-layer Transformer (8 heads, 512 hidden dimension). The Transformer’s attention mechanism is configured to enable text tokens to attend to the speech prompt tokens. The Text Encoder’s output provides a speaker-conditional hidden representation.

The Duration Predictor is a shallow convolutional network. It takes the speaker-conditional hidden representation produced by the Text Encoder (prior to its final linear projection) to determine token durations. The total parameters of this model are 84M All models are trained in FP32.

4.4 Training

The audio models are trained for updates 750K/1M steps depending on the language with an effective batch size of 256k frames. For training efficiency, audio length is capped at 1,000 frames and chunked randomly if the length exceeds this threshold. Duration models are trained for 200K updates with an effective batch size of 200K frames. The AdamW optimizer is used with a peak learning rate of 2e-4, linearly warmed up for 5K steps and linearly decays over the rest of training. To encourage robustness of audio model, a probabilistic masking strategy is applied to the input mel representations. With a 50% probability, a random percentage r of the sequence tokens is masked, where r is sampled from a uniform distribution U[30,100]. With the remaining 50% probability, either the entire sequence is masked (with 90% probability, resulting in 45% overall) or no masking is applied (with 10% probability, resulting in 5% overall).

4.5 Evaluation Metrics

4.5.1 Speaker similarity:

To ensure that the generated audio was in the prompt audio speaker’s voice, we used various metrics to analyze the the speaker similarity of the prompt and the generated audio.

  1. 1.

    Sim-o: The ECAPA-TDNN Desplanques et al. (2020) model, a speaker verification model trained on the Voxceleb2 dataset Chung et al. (2018), was used to obtain embeddings for the prompt and generated audio. Then, cosine similarity was calculated between these embeddings to obtain a similarity score. Higher sim-o scores indicate higher speaker similarity of two audio pairs.

  2. 2.

    Similarity Mean Opinion Score (SMOS): 30 human evaluators were employed per language to score the similarity between the prompt and generated audio on a scale of 1 to 5, with 1 being completely dissimilar and 5 being exactly the same. We randomly picked 40 audios for scoring and each audio was annotated by 10 people. Appendix A contains the specific instructions and scoring guidelines given to the humans for SMOS.

4.5.2 Intelligibility

To ensure the quality of generated audios, we used several metrics to evaluate their intelligibility.

  1. 1.

    Word Error Rate(WER): The IndicConformer AI4Bharat (2024) model was used to transcribe the generated audios and WER was calculated between these transcriptions and the ground truth transcriptions. A higher WER indicates a poor quality generation.

  2. 2.

    Quality Mean Opinion Score (QMOS): 30 human evaluators were employed to score the quality of the generated audio on a scale of 1 to 5, with 1 being commpletely unintelligible and 5 being perfectly coherent. We randomly picked 40 audios for scoring and each audio was annotated by 10 people. Appendix A contains the specific instructions and scoring guidelines given to the humans for QMOS.

5 Results

The input to our Voicebox audio model was a speaker-specific speech prompt and ground-truth text for text generation. For the text, audio durations were obtained from ground truth durations, P-Flow style speaker-prompted durations and Voicebox-style infil durations. These durations were then used to generate speaker specific speech using the audio model. To measure intelligibility, WERs were calculated for the generated speech. To measure speaker similarity of the generated and original speech, Sim-o scores between the two audio parts were calculated.

5.1 Continuous Sentence Completion

We used each instance of the test set as a separate testing sample. Each sentence was split into two halves, with the latter half masked. The first half served as the speaker-specific prompt, while the ground-truth text of the masked second half was provided as the text input. Table 3 depicts the dataset-wise WERs for the generated speech from every duration prediction technique. To mitigate the bias introduced by errors from the ASR model used for transcription, we also compare WERs of audios using ground-truth forced aligned durations. The P-flow duration predictor offers gains over Infil-style durations for Tamil and Bengali; however, its performance degrades significantly for Marathi. Table 4 depicts the Sim-o scores between the original and generated audios for every duration prediction technique. We can observe that the P-flow duration predictor gives improvements over the infil-style durations for Tamil, Telugu and Bengali.

Language GT Sim-o Infil Sim-o PFlow Sim-o
Marathi 0.6539 0.6481 0.6393
Tamil 0.6902 0.6833 0.6925
Telugu 0.6178 0.6292 0.633
Bengali 0.6682 0.6573 0.6706
Hindi 0.6512 0.6466 0.6344
Table 4: Similarity scores across languages for continuous sentence completion of GT, Infil, and PFlow systems

5.2 Human Evaluations

Human evaluations were used to get SMOS and QMOS scores. Detailed human annotation instructions are in Appendic A and B. Table 5 contains SMOS and QMOS scores which show that the speaker-prompted duration predictor performs better than the infilling duration predictor for Hindi and Tamil.

Lang. Score GT Infill Pflow
Hindi QMOS Naturality 4.3449 3.9742 4.3349
Intelligibility 4.4646 4.2397 4.1313
SMOS Similarity 3.9867 3.57 4.1356
Overall 4.2654 3.9280 4.2006
Tamil QMOS Naturality 4.4277 4.4792 4.4856
Intelligibility 4.4378 4.4682 4.6063
SMOS Similarity 4.5343 4.4121 4.5619
Overall 4.4666 4.4532 4.5513
Bengali QMOS Naturality 4.1688 4.2678 4.1125
Intelligibility 4.4538 4.4748 4.3788
SMOS Similarity 3.8065 3.9685 3.9176
Overall 4.1430 4.2370 4.1363
Table 5: QMOS and SMOS scores for Hindi, Tamil and Telugu outputs.

6 Conclusion

We have performed an analysis of different techniques to obtain speaker-specific durations during zero-shot speaker-specific TTS. We analyzed a Voicebox-style infillling duration predictor and a P-Flow-style speaker prompted duration predictor. It was observed that the speaker prompted duration predictor offered considerable benefits over the infilling duration predictor for Tamil, in both intelligibility and speaker similarity. Additionally as a general trend, the speaker-prompted duration predictor led to better speaker similarity, while the infilling duration predictor led to better intelligibility. This trade-off gives us insights into the importance of the duration prediction process and how it can affect not only the intelligibility, but also the speaker similarity of generated audios. We can leverage this knowledge to choose a suitable duration predictor based on our use-case, and also to develop a new duration predictor that adapts properties of both speaker-prompted and infilling-style duration predictors to improve both intelligibility and speaker-similarity.

References

Appendix A: QMOS guidelines

6.1 Naturalness Evaluation

Definition: Naturalness refers to how lifelike and fluid the synthesized speech sounds.

Score Description
5 Completely natural, indistinguishable from a human speaker.
4 Mostly natural, but with slight unnatural elements.
3 Moderately natural, with noticeable synthetic artifacts or monotony.
2 Mostly unnatural, robotic or artificial-sounding.
1 Completely unnatural, heavily robotic, or difficult to listen to.
Table 6: Description of naturalness scores.

6.2 Intelligibility Evaluation

Definition: Intelligibility measures how easily the speech can be understood, regardless of how natural it sounds. It focuses on clarity and accuracy of pronunciation.

Score Description
5 Perfectly clear; every word is easily understood.
4 Mostly clear, but with minor pronunciation errors or distortions.
3 Somewhat clear; requires some effort to understand certain words.
2 Mostly unclear; many words are difficult to recognize.
1 Completely unintelligible; nearly impossible to understand.
Table 7: Description of intelligibility scores.

6.3 Guidelines

  1. 1.

    Listen to each audio sample carefully. Replay if necessary.

  2. 2.

    Rate the sample separately for Naturalness, Intelligibility, and Speaker Similarity on a scale from 1 to 5.

  3. 3.

    Avoid bias by focusing on the specific criteria, not personal preference.

  4. 4.

    Explain the score clearly. Justify the score by describing key factors such as errors, inconsistencies, or deviations from the expected standard.

6.4 Important guidelines

  • Use headphones for better sound quality.

  • Ensure a quiet environment to avoid distractions.

  • Rate objectively without comparing different speech styles or accents.

  • Do not assume meaning—rate based on what you actually hear.

Thank you for your contribution!

Appendix B: SMOS guidelines

6.5 Speaker Similarity Evaluation

Definition: Speaker similarity refers to how much the synthesized voice resembles the reference speaker’s voice in terms of timbre, pitch, and prosody. Ignore other factors like intelligibility and only focus on speaker similarity.

Score Description
5 Indistinguishable from the reference speaker.
4 Very similar, but with minor differences.
3 Moderately similar, but noticeable variations.
2 Weak similarity, with clear differences in voice identity.
1 Completely different from the reference speaker.

6.6 Guidelines

  1. 1.

    Listen to each audio sample carefully. Replay if necessary.

  2. 2.

    Rate the sample separately for Naturalness, Intelligibility, and Speaker Similarity on a scale from 1 to 5.

  3. 3.

    Avoid bias by focusing on the specific criteria, not personal preference.

  4. 4.

    Explain the score clearly. Justify the score by describing key factors such as errors, inconsistencies, or deviations from the expected standard.

6.7 Important guidelines

  • Use headphones for better sound quality.

  • Ensure a quiet environment to avoid distractions.

  • Rate objectively without comparing different speech styles or accents.

  • Do not assume meaning—rate based on what you actually hear.

Appendix C: Test set creation

We use the Vistaar test set for all our benchmarking. From this set, we select audio clips ranging from 3 to 10 seconds in duration. Test sets yielding fewer than 100 instances after this filtering are excluded. Table 8 lists the datasets and the number of samples used for benchmarking after filtering.

Table 8: Number of samples per dataset for each language.
Language Dataset Samples
Tamil IndicSuperb 2598
Mucs 1137
Total 3738
Telugu Fluers 374
IndicSuperb 2892
Mucs 1166
Total 4432
Bengali Fluers 787
IndicSuperb 2898
Total 3698
Hindi Fluers 360
IndicSuperb 1876
Mucs 917
Total 3170
Marathi Fluers 825
IndicSuperb 1006
Mucs 316
Total 2153

Thank you for your contribution!