Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages

Isha Pandey¹ Pranav Gaikwad² Amruta Parulekar¹ Ganesh Ramakrishnan¹
¹ Indian Institute of Technology Bombay, India
² BharatGen, India

Abstract

High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India.

We train a non-autoregressive Continuous Normalizing Flow (CNF) Mathieu and Nickel (2020) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.

Isha Pandey¹ Pranav Gaikwad² Amruta Parulekar¹ Ganesh Ramakrishnan¹ ¹ Indian Institute of Technology Bombay, India ² BharatGen, India

1 Introduction

Large-scale generative models have made remarkable progress across domains, but achieving high-quality and context-sensitive speech generation for diverse linguistic landscapes remains a challenge, particularly for languages with rich morphology, significant dialectal variation, and phonetic structures that diverge considerably from high-resource languages like English. Indian languages exemplify this complexity. Directly applying architectures developed for high-resource settings to Indian languages is often ineffective due to their intricate prosody, agglutinative or inflectional morphology, and the general lack of standardized orthographies across dialects. Moreover, state-of-the-art speech generation systems are typically trained on massive datasets, often exceeding 60,000 hours, whereas our work investigates these challenges using a much smaller dataset of approximately 2,000 hours or less, reflecting the true low-resource conditions common to many Indian languages.

In this work, we conduct a focused investigation into the role of duration prediction within the speech generation pipeline, aiming to better understand its influence on prosody and speaker-related features in Indian languages. As described in Section 3.2, we employ a Continuous Normalizing Flow (CNF)–based audio model inspired by the Voicebox architectureMathieu and Nickel (2020); Le et al. (2023) and implement two distinct duration prediction strategies for comparison. Rather than optimizing solely for end-task performance, our goal is to explore how architectural choices particularly in duration modeling interact with the linguistic diversity and low-resource realities of Indian speech, offering insights that may guide future model design and adaptation.

Our study centers on adapting the duration prediction mechanism within the Voicebox-inspired framework. In typical systems such as Voicebox (see Section 3.4.1), durations are predicted based on text and explicit timing targets obtained from a pre-trained forced alignment model. However, this approach can be unreliable in low-resource contexts like Indian languages, which exhibit rich prosodic variation—including rhythm, stress, and intonation. Forced alignments may fail to capture such variation accurately. To address this, we adopt a prompt-based duration prediction strategy inspired by recent work in prosody modeling, such as PFlowKim et al. (2023). As discussed in Section 3.4, this approach replaces external alignment inputs with a three-second speaker prompt, which—when processed using cross-attention—enables the model to extract speaker-specific prosodic cues directly from reference audio. This design allows for more contextually appropriate and natural duration predictions and better captures the expressive prosody characteristic of Indian languages. Our adaptation aims to evaluate the feasibility and impact of prompt-based duration modeling in a multilingual, low-resource setting where alignment-dependent techniques often struggle.

For the audio generation component, detailed in Section 3.3, we adopt the non-autoregressive CNF-based modeling approach used in Voicebox. This enables the transformation of a simple latent distribution into the complex distribution of missing speech segments, conditioned on both text and surrounding audio context, i.e., modeling $p(\text{missing data}\mid\text{context})$ . Training is performed using the flow-matching objective, which supports efficient and scalable optimization of CNFs via vector field regression. While the audio model architecture closely follows Voicebox, we modify the training strategy to improve learning robustness given the limited and diverse nature of the Indian language data.

We evaluate the two proposed duration prediction strategies through detailed empirical analysis, using datasets described in Section 4.1. As detailed in Section 5, our experiments focus on speech infilling tasks, including both continuous sentence completion and cross-sentence completion, conducted across multiple Indian languages. To assess performance, we use Word Error Rate (WER) and the SIMO metric to evaluate intelligibility and similarity to ground-truth speaker characteristics, respectively. Additionally, we conduct human evaluations to complement these objective metrics. The results reveal language-dependent trade-offs: in some cases, prompt-based predictors improve intelligibility, while in others, they better preserve speaker-specific prosodic traits.These observations highlight how prosody, language structure, and modeling decisions collectively influence speech generation performance across diverse Indian languages.

2 Related work

Foundations in Large-Scale Generative Speech Models: There have been several advancements in zero-shot text-to-speech (TTS) in the recent years. Our work is primarily based on Voicebox Le et al. (2023), which is a text-guided multilingual speech generation model that uses non-autoregressive flow-matching for speech infilling, given audio context and text. It can perform a variety of tasks inlcuding noise removal, style transfer and zero-shot TTS conversion. Flow matching Lipman et al. (2023) is a framework for training continuous normalization flows by regressing a vector field that translates a source distribution to a target data distribution along predetermined probability paths. This simulation-free approach allows for faster training, sampling, and better generalization than diffusion paths. There exist several other architectures for text-to-speech synthesis. NaturalSpeech Tan et al. (2022) uses a variational autoencoder for end-to-end text-to-speech modeling with several strategies like phoneme pretraining, differentiable duration modeling, bidirectional prior-posterior modeling, and a memory mechanism to improve the speech quality and minimize artifacts. Vall-E Wang et al. (2023) uses a neural codec audio model with discrete audio codec tokens to generate high quality personalized zero-shot speech by framing TTS as a conditional language modeling task. Tacotron 2 Shen et al. (2018) uses a recurrent sequence-to-sequence feature prediction network to map character embeddings to mel spectrograms and generate high-quality speech from input text. WaveNet van den Oord et al. (2016) is a fully probabilistic and autoregressive deep generative model for natural speech synthesis, for which each audio sample’s predicted distribution is conditioned on all previous audios.

Speech Generation for Low-Resource Languages, Particularly Indian Languages: Despite the fast developments on zero-shot text-to-speech synthesis in high resource languages, the literature for low-resource, particularly Indian languages, is limited. Panda et al. (2020) highlights that the technological advancements for TTS reflect for less than 60 % of the official languages of India while research for the unofficial languages is yet to begin. The paper also highlights the challenges faced in designing Indian Language TTS systems due to the linguistic variations in these languages. IndicTTS Kumar et al. (2023) was the first study to train and evaluate TTS systems based on Transformers Vaswani et al. (2023) and HifiGAN Kong et al. (2020) on Indian languages. However, it does not include multi-speaker and speaker-specific TTS. Since then, several new multi-speaker datasets such as IndicVoices-R Sankar et al. (2024) have been released, and it is hence crucial to use these datasets and develop speaker-specific TTS systems for Indian languages.

Innovations in Duration Prediction and Prosody Modeling: Duration prediction is a popular method used to convert text to natural sounding speech. FastSpeech Ren et al. (2019) extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction. These durations are then used by a length regulator to expand the source phoneme duration to match the target mel spectrogram. VITS Kim et al. (2021) proposes a stochastic duration predictor to synthesize speech with diverse rhythms from input text, emphasizing that the same text can be naturally spoken in different ways. Non-attentive Tacotron Shen et al. (2021) replaces the attention-based alignment mechanism in Tacotron 2 Shen et al. (2018) with an explicit duration predictor which allows for utterance-wide and per-phoneme duration control at inference time. Apart from direct duration prediction, other strategies have also emerged for prosody and style transfer. Style tokens Wang et al. (2018) uses embeddings that are jointly trained with Tacotron Shen et al. (2018) and are able to learn a wide range of expressiveness. There have also been RAD-TTS Shih et al. (2021) uses normalizing flows, robust alignment learning and models speech rhythm as a separate generative distribution to enable improved prosody transfer. For our work, we derived inspiration from these duration modeling techniques to develop a highly effective duration predictor that can work without relying on external forced alignments.

Non-Autoregressive Audio Modeling: Our audio model was based on Voicebox Le et al. (2023), but there are several other TTS models that use non-autoregressive (NAR) modeling, such as FastSpeech Ren et al. (2019) and FastSpeech 2 Ren et al. (2022). Hifi-GAN Kong et al. (2020) is a state-of-the-art NAR vocoder that models the periodic patterns in audio. More recently, flow-based NAR TTS models like Glow-TTS Kim et al. (2020) have emerged, which does not require any external alignments. Finally, EfficientTTS Miao et al. (2020) focuses on efficient NAR speech generation by using a new monotonic alignment modeling approach.

3 Methodology

3.1 Background

Flow Matching with Optimal Transport Continuous Normalizing Flows (CNFs) provide a powerful framework for learning complex data distributions by transforming a simple prior distribution $p_{0}$ to a target data distribution $p_{1}$ . This transformation is achieved through a time-dependent vector field $v_{t}:[0,1]\times\mathbb{R}^{d}\to\mathbb{R}^{d}$ , which constructs a flow $\phi_{t}$ governed by the ordinary differential equation:

$\frac{d}{dt}\phi_{t}(x)=v_{t}(\phi_{t}(x));\phi_{0}(x)=x$ For a given flow $\phi_{t}$ , we can derive the probability path pt(x) using the change of variables formula: $p_{t}(x)=p_{0}(\phi^{-1}_{t}(x))\left|\det\left(\frac{\partial\phi^{-1}_{t}}{% \partial x}(x)\right)\right|$ To train the neural network parameters $\theta$ that define our vector field $v_{t}(x;\theta)$ , we employ the Flow Matching objective:

\mathcal{L}_{FM}(\theta)=\mathbb{E}_{t,p_{t}(x)}\left\lVert u_{t}(x)-v_{t}(x;% \theta)\right\rVert^{2}

However, directly computing this objective is challenging as we lack prior knowledge of pt or $v_{t}$ . To address this, we utilize the Conditional Flow Matching (CFM) objective, which provides a practical training approach:

\mathcal{L}_{CFM}(\theta)=\mathbb{E}_{t,q(x_{1}),p_{t}(x|x_{1})}\left\lVert u_% {t}(x|x_{1})-v_{t}(x;\theta)\right\rVert^{2}

For our implementation, we specifically adopt the optimal transport (OT) path, which defines the conditional probability and vector field as: $p_{t}(x|x_{1})=\mathcal{N}(x|tx_{1},(1-(1-\sigma_{min})t)^{2}I)$ and $u_{t}(x|x_{1})=\frac{x_{1}-(1-\sigma_{min})x}{1-(1-\sigma_{min})t}$

The OT path is particularly advantageous as it ensures points move with constant speed and direction, leading to more stable training and efficient inference. This choice simplifies the learning process while maintaining the model’s expressive power.

3.2 Problem Formulation and Model Training

Our work adapts the approach from Voicebox Le et al. (2023) to Indian languages. Given speech-transcript pairs $(x,y)$ , we train a model for text-guided speech generation through in-context learning. The model learns speech infilling - predicting missing speech segments given surrounding audio and text transcript. Using a binary mask $m$ , we split the audio into missing ( $x_{mis}$ ) and context ( $x_{ctx}$ ) portions, training the model to learn $p(x_{mis}|y,x_{ctx})$ . Following voicebox Le et al. (2023), we use an audio model and a duration predictor for fine-grained alignment control

The system works with audio frames $x=(x_{1},...,x_{N})$ , character sequence $y=(y_{1},...,y_{M})$ , and per-character durations $l=(l_{1},...,l_{M})$ . The frame-level transcript $z$ is derived by repeating each character $y_{j}$ according to its duration $l_{j}$ . For any input pair $(x,y)$ , we obtain $l$ and $z$ through forced alignment using a speech recognition model. The final prediction combines the audio model $q(x_{mis}|z,x_{ctx})$ and duration model $q(l_{mis}|y,l_{ctx})$ , where $l_{mis}$ and $l_{ctx}$ represent masked and unmasked portions of the duration sequence. Additionally, inspired by PFlow [citation], we explored an alternative duration predictor that takes a variable-length unmasked audio portion (minimum 2 seconds) to predict the duration distribution for each character in the text. This approach learns $q(l_{mis}|y,x_{ctx})$ directly, eliminating the need for explicit duration context $l_{ctx}$ .

3.3 Audio Model

Following voicebox Le et al. (2023), we implement a Continuous Normalizing Flow (CNF) model that learns the distribution of missing audio frames given the context. The audio is represented as an 80-dimensional log Mel spectrogram ( $x_{i}\in\mathbb{R}^{80}$ ). For a given input of context spectrogram $x_{ctx}\in\mathbb{R}^{N\times F}$ , flow state $x_{t}\in\mathbb{R}^{N\times F}$ , character sequence $z\in[K]^{N}$ (where $K$ is the number of character classes), and time step $t\in[0,1]$ , we employ a Transformer to parameterize the vector field $v_{t}$ . The model is trained using a masked version of the flow matching objective:

	$\displaystyle L_{audio-CFM}(\theta)$	$\displaystyle=\mathbb{E}_{t,m,q(x,z),p_{0}(x_{0})}\left\lVert u_{t}(x_{t}\|x)\right.$
		$\displaystyle\quad-\left.v_{t}(x_{t},x_{ctx},z;\theta)\right\rVert^{2}$		(1)

Where the masked context $x_{\text{ctx}}$ is defined as:

x_{i}^{\text{ctx}}=\begin{cases}0&\text{if }m_{i}=1,\\ x_{i}&\text{if }m_{i}=0,\end{cases}

, And the binary mask $m_{i}$ identifies the frames to be predicted (masked portion) versus the available context frames (unmasked portion). The input to the Transformer is then constructed by concatenating the three sequences $(x_{t},x_{\text{ctx}},z_{\text{emb}})$ along the feature dimension:

H_{c}=\text{Proj}([x_{t};x_{\text{ctx}};z_{\text{emb}}])\in\mathbb{R}^{N\times D},

where $D$ is the Transformer embedding dimension.

To condition on the flow step $t$ , a sinusoidal positional encoding maps $t\in[0,1]$ to a vector $h_{t}\in\mathbb{R}^{D}$ . The final input to the Transformer is $H_{c}$ concatenated with $h_{t}$ :

\widetilde{H}_{c}\in\mathbb{R}^{(N+1)\times D}.

The transformer outputs $v_{t}(x_{t},x_{\text{ctx}},z;\theta)\in\mathbb{R}^{N\times F}$ , which corresponds to the original sequence length $N$ .

Unlike Voicebox training, our loss function interpolates between the masked and unmasked portions, applying a weighted sum. Specifically, we assign a weight of 0.9 to the loss computed on the masked frames and a weight of 0.1 to the loss on the unmasked (context) frames. This strategy allows the model to prioritize learning the content of the segments that require prediction while simultaneously being regularized to maintain the integrity and quality of the provided audio context.

3.4 Duration Prediction

The Voicebox model Le et al. (2023) requires accurate character durations for high-quality speech generation. For Indian languages, predicting these durations presents unique challenges due to the rich diversity in regional pronunciations and speaking styles. Through extensive experimentation, we explored different approaches to duration prediction, analyzing their impact on both intelligibility and speaker similarity in the generated speech.

Dataset	# Speakers	# Lang.	Lang. Scope
IndicVoices Javed et al. (2024)	22563	22	Indic only
IndicSuperb Javed et al. (2022)	1218	12	Indic only
IndicTTS Kumar et al. (2023)	44	22	Indic only
Spring R et al. (2023)	7609	10	Indic only
FLEURS Conneau et al. (2022)	Not specified	102	International
Mucs Diwan et al. (2021)	Not specified	6	Indic only
Shrutilipi Bhogale et al. (2023a)	Not specified	12	Indic only
SYSPIN Abhayjeet et al. (2025)	18	9	Indic only
Common Voice Ardila et al. (2020)	Not specified	60	International
Vaani Team (2025)	112394	54	Indic only
Dhwani Shah et al. (2025)	Not specified	40	Indic only
Bhashini (Multiple datasets)	Not specified	19	Indic only
OpenSLR (Multiple datasets)	Not specified	25	International

Table 1: Number of Speakers, Languages, and Language Scope in Various Speech Datasets

3.4.1 Voicebox style infilling Duration Predictor

The duration predictor which was used in Voicebox Le et al. (2023) was used to predict durations from the text and feed them to the audio model. Inspired by the audio model, it models $q(l|y,l_{\text{ctx}})$ using a conditional vector field. $l$ represents duration, $l_{\text{ctx}}$ is the context duration, $y$ is the phonetic transcript. It swaps $(x,x_{\text{ctx}},z)$ from the audio model with $(l,l_{\text{ctx}},y)$ . Training is done using a masked version of the Conditional Flow Matching (CFM) loss.

While we use the regression-based duration modeling approach promoted by them where they regress the masked duration $l_{\text{miss}}$ given $l_{\text{ctx}}$ and $y$ . In our implementation, learnable embeddings of the text characters and corresponding durations are concatenated along the frame dimension. This concatenated representation is then projected down using a linear projection layer. To capture local temporal patterns, we introduce a convolutional network before passing the features to the transformer. The Transformer architecture is similar to the audio model with fewer parameters. Unlike Voicebox Le et al. (2023), in our implementation we train duration predictor with MSE loss over log-scaled durations over masked phones.

3.4.2 Speaker-Prompted Duration Predictor

Accurately modeling phoneme durations is critical for synthesizing natural-sounding speech, yet presents a significant challenge in the context of Indian languages. This difficulty is exacerbated by the high variety of dialects and pronunciations across different regions of the country, which introduces considerable variability in phoneme durations within our training data. To address this high variance and improve the robustness and generalization capability of duration prediction under data constraints, we developed an enhanced duration predictor. This design was inspired by approaches that leverage speech prompts for conditioning, such as the method explored in PFlow Kim et al. (2023).

Unlike conventional approaches that condition duration prediction directly on text and rely on forced aligned durations as contextual input, which can be unreliable in low-resource settings and often lead to unnatural prosody, we treat these alignments as weak supervision during training. Rather than conditioning the model on hard-aligned durations, we used a 3-second mel spectrogram segment $x_{p}$ , randomly sampled from the context mel spectrogram $x_{\text{ctx}}$ , together with the text sequence $c$ as input. This allows the model to implicitly learn prosodic patterns and speaker-specific durations from real mel prompt, guided by weak alignment signals, without depending on them explicitly at inference time.

The core of our model is a transformer-based encoder $f_{\text{enc}}$ that produces a speaker-conditioned text representation $h_{c}=f_{\text{enc}}(x_{p},c)$ . The encoder embeds the text sequence $c$ using learnable embeddings and projects the mel-spectrogram $x_{p}$ via a linear layer to match the text embedding dimensionality. Cross-attention is then applied from the text tokens $c$ to the speech frames in $x_{p}$ , enabling the model to fuse prosodic cues into the text representation.

The resulting representation $h_{c}$ is passed through two feed-forward layers with non-linear activation to predict log-duration values for each token in the input sequence of length $N$ . The model is trained using a mean squared error (MSE) loss between the predicted and ground-truth log-scaled durations:

\mathcal{L}_{\text{dur}}=\frac{1}{N}\sum_{i=1}^{N}\left(\log\hat{d}_{i}-\log d% _{i}\right)^{2},

where $\hat{d}_{i}$ and $d_{i}$ are the predicted and target durations for the $i$ -th token, respectively.

The audios generated by the Voicebox using these predicted durations were considerably better in both intelligibility and speaker similarity compared to using durations from an external forced alignment module.

4 Experimental Setup

4.1 Datasets

Several transcribed Indian language datasets were used for training the model to achieve a balanced representation of urban and rural speakers and recording devices. This created a diversity in the vocabulary, content, and recording channels, including a mix of read speech, voice commands, extempore discussions, and both wide and narrow-band recordings. Table 1 depicts the details of the different publically available datasets used for training the model.

4.2 Data Preprocessing

4.2.1 Text Normalization:

The transcripts, being from multiple datasets, had inconsistencies and were normalized to remove punctuation, spaces and other characters. Sentences containing words in the latin script were also removed from the Fluers and Spring datasets to create homogenized monolingual datasets for each target language.

4.2.2 Audio Filtering:

Some audios were unintelligible, noisy, or had transcription errors. To remove such audios, all audios were transcribed using the IndicWhisper Bhogale et al. (2023b) model and Word Error Rate (WER) of these transcripts with the ground-truth transcripts was calculated. All audios with WER more than 0.2 were discarded as low-intelligibility audios. For Hindi, this led to a reduction from 1.3 million audios to 1 million audios. From the discarded audios, audios with average CTC alignment scores greater than 0.9 were added back to the dataset, leading to an addition of 30k more audios for Hindi. Table 2 depicts the number of hours of data left for every language after filtering.

Language	Hours
Hindi	2913
Marathi	1962
Tamil	1190
Telugu	932
Bengali	1224

Table 2: Training dataset sizes in no. of hours after filtering

Table 3: WERs after transcribing generated speech through IndicConformer, where a sentence is 50% masked

Language	Dataset	Ground Truth durations	Infill style durations	Pflow style durations
Tamil	IndicSuperb	0.2794	0.2806	0.2622
	Mucs	0.3438	0.3431	0.3269
	overall	0.31160	0.31185	0.29455
Telugu	Fluers	0.4832	0.47	0.4938
	IndicSuperb	0.3661	0.4057	0.431
	Mucs	0.3842	0.3655	0.3894
	overall	0.41117	0.41373	0.43807
Bengali	Fluers	0.4136	0.4391	0.4103
	IndicSuperb	0.1948	0.2133	0.1936
	overall	0.3042	0.3262	0.30195
Hindi	Fluers	0.1797	0.1915	0.2051
	IndicSuperb	0.127	0.1542	0.1574
	Mucs	0.2022	0.2334	0.2368
	overall	0.16963	0.19303	0.19977
Marathi	Fluers	0.3962	0.348	0.3526
	IndicSuperb	0.1896	0.1862	0.39
	Mucs	0.1214	0.1381	0.351
	overall	0.23573	0.22410	0.36453

4.3 Model

Following Voicebox we employed a 103 M parameter model audio model. Transformer model based on the architecture proposed by Vaswani et al. (2023), consisting of 12 layers. Each layer utilizes multi-head self-attention with 16 heads and 512 hidden dim of feed-forward network. Positional information is incorporated using Rotary Positional Embedding (RoPE) Su et al. (2023) applied within the self-attention mechanism. We used RMSNorm Zhang and Sennrich (2019) for layer normalization. Additionally, the model incorporates UNet like skip connections Ronneberger et al. (2015) where outputs from the first half of the layers are concatenated and linearly combined with the inputs of the corresponding layers in the second half.

The Voicebox style duration model uses the same model as audio model with 16 heads, 512 embedding/FFN dimensions, with 12 layers for all our models. 51M parameters in total.

For the Speaker prompted duration predictor we employ a Speech-prompted Text Encoder inspired by Kim et al. (2023), and a convolution-based Duration Predictor. The Text Encoder processes input speech prompt and text embeddings. These are first linearly projected to a shared feature space and concatenated. A prenetwork of three convolutional layers processes this combined sequence. The prenetwork output is split into corresponding prompt and text segments. To differentiate modalities and provide positional context, we add positional encodings to each segment, defined as the sum of standard absolute positional encodings and a unique learnable embedding for prompt or text. These processed representations are then fed into a 12-layer Transformer (8 heads, 512 hidden dimension). The Transformer’s attention mechanism is configured to enable text tokens to attend to the speech prompt tokens. The Text Encoder’s output provides a speaker-conditional hidden representation.

The Duration Predictor is a shallow convolutional network. It takes the speaker-conditional hidden representation produced by the Text Encoder (prior to its final linear projection) to determine token durations. The total parameters of this model are 84M All models are trained in FP32.

4.4 Training

The audio models are trained for updates 750K/1M steps depending on the language with an effective batch size of 256k frames. For training efficiency, audio length is capped at 1,000 frames and chunked randomly if the length exceeds this threshold. Duration models are trained for 200K updates with an effective batch size of 200K frames. The AdamW optimizer is used with a peak learning rate of 2e-4, linearly warmed up for 5K steps and linearly decays over the rest of training. To encourage robustness of audio model, a probabilistic masking strategy is applied to the input mel representations. With a 50% probability, a random percentage r of the sequence tokens is masked, where r is sampled from a uniform distribution U[30,100]. With the remaining 50% probability, either the entire sequence is masked (with 90% probability, resulting in 45% overall) or no masking is applied (with 10% probability, resulting in 5% overall).

4.5 Evaluation Metrics

4.5.1 Speaker similarity:

To ensure that the generated audio was in the prompt audio speaker’s voice, we used various metrics to analyze the the speaker similarity of the prompt and the generated audio.

1.

Sim-o: The ECAPA-TDNN Desplanques et al. (2020) model, a speaker verification model trained on the Voxceleb2 dataset Chung et al. (2018), was used to obtain embeddings for the prompt and generated audio. Then, cosine similarity was calculated between these embeddings to obtain a similarity score. Higher sim-o scores indicate higher speaker similarity of two audio pairs.
2.

Similarity Mean Opinion Score (SMOS): 30 human evaluators were employed per language to score the similarity between the prompt and generated audio on a scale of 1 to 5, with 1 being completely dissimilar and 5 being exactly the same. We randomly picked 40 audios for scoring and each audio was annotated by 10 people. Appendix A contains the specific instructions and scoring guidelines given to the humans for SMOS.

4.5.2 Intelligibility

To ensure the quality of generated audios, we used several metrics to evaluate their intelligibility.

1.

Word Error Rate(WER): The IndicConformer AI4Bharat (2024) model was used to transcribe the generated audios and WER was calculated between these transcriptions and the ground truth transcriptions. A higher WER indicates a poor quality generation.
2.

Quality Mean Opinion Score (QMOS): 30 human evaluators were employed to score the quality of the generated audio on a scale of 1 to 5, with 1 being commpletely unintelligible and 5 being perfectly coherent. We randomly picked 40 audios for scoring and each audio was annotated by 10 people. Appendix A contains the specific instructions and scoring guidelines given to the humans for QMOS.

5 Results

The input to our Voicebox audio model was a speaker-specific speech prompt and ground-truth text for text generation. For the text, audio durations were obtained from ground truth durations, P-Flow style speaker-prompted durations and Voicebox-style infil durations. These durations were then used to generate speaker specific speech using the audio model. To measure intelligibility, WERs were calculated for the generated speech. To measure speaker similarity of the generated and original speech, Sim-o scores between the two audio parts were calculated.

5.1 Continuous Sentence Completion

We used each instance of the test set as a separate testing sample. Each sentence was split into two halves, with the latter half masked. The first half served as the speaker-specific prompt, while the ground-truth text of the masked second half was provided as the text input. Table 3 depicts the dataset-wise WERs for the generated speech from every duration prediction technique. To mitigate the bias introduced by errors from the ASR model used for transcription, we also compare WERs of audios using ground-truth forced aligned durations. The P-flow duration predictor offers gains over Infil-style durations for Tamil and Bengali; however, its performance degrades significantly for Marathi. Table 4 depicts the Sim-o scores between the original and generated audios for every duration prediction technique. We can observe that the P-flow duration predictor gives improvements over the infil-style durations for Tamil, Telugu and Bengali.

Language	GT Sim-o	Infil Sim-o	PFlow Sim-o
Marathi	0.6539	0.6481	0.6393
Tamil	0.6902	0.6833	0.6925
Telugu	0.6178	0.6292	0.633
Bengali	0.6682	0.6573	0.6706
Hindi	0.6512	0.6466	0.6344

Table 4: Similarity scores across languages for continuous sentence completion of GT, Infil, and PFlow systems

5.2 Human Evaluations

Human evaluations were used to get SMOS and QMOS scores. Detailed human annotation instructions are in Appendic A and B. Table 5 contains SMOS and QMOS scores which show that the speaker-prompted duration predictor performs better than the infilling duration predictor for Hindi and Tamil.

Lang.	Score		GT	Infill	Pflow
Hindi	QMOS	Naturality	4.3449	3.9742	4.3349
		Intelligibility	4.4646	4.2397	4.1313
	SMOS	Similarity	3.9867	3.57	4.1356
		Overall	4.2654	3.9280	4.2006
Tamil	QMOS	Naturality	4.4277	4.4792	4.4856
		Intelligibility	4.4378	4.4682	4.6063
	SMOS	Similarity	4.5343	4.4121	4.5619
		Overall	4.4666	4.4532	4.5513
Bengali	QMOS	Naturality	4.1688	4.2678	4.1125
		Intelligibility	4.4538	4.4748	4.3788
	SMOS	Similarity	3.8065	3.9685	3.9176
		Overall	4.1430	4.2370	4.1363

Table 5: QMOS and SMOS scores for Hindi, Tamil and Telugu outputs.

6 Conclusion

We have performed an analysis of different techniques to obtain speaker-specific durations during zero-shot speaker-specific TTS. We analyzed a Voicebox-style infillling duration predictor and a P-Flow-style speaker prompted duration predictor. It was observed that the speaker prompted duration predictor offered considerable benefits over the infilling duration predictor for Tamil, in both intelligibility and speaker similarity. Additionally as a general trend, the speaker-prompted duration predictor led to better speaker similarity, while the infilling duration predictor led to better intelligibility. This trade-off gives us insights into the importance of the duration prediction process and how it can affect not only the intelligibility, but also the speaker similarity of generated audios. We can leverage this knowledge to choose a suitable duration predictor based on our use-case, and also to develop a new duration predictor that adapts properties of both speaker-prompted and infilling-style duration predictors to improve both intelligibility and speaker-similarity.

References

Abhayjeet et al. (2025) Abhayjeet and 1 others. 2025. ’SYSPIN-S1. 0 Corpus - A TTS Corpus of 900+ hours in nine Indian Languages’.
AI4Bharat (2024) AI4Bharat. 2024. Indicconformer: A suite of asr models for indian languages. https://github.com/AI4Bharat/IndicConformerASR. Accessed: 2025-05-28.
Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. Preprint, arXiv:1912.06670.
Bhogale et al. (2023a) Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M. Khapra. 2023a. Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages. In ICASSP, pages 1–5. IEEE.
Bhogale et al. (2023b) Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, and Pratyush Kumar. 2023b. Vistaar: Diverse benchmarks and training sets for indian language asr. Preprint, arXiv:2305.15386.
Chung et al. (2018) Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Voxceleb2: Deep speaker recognition. In Interspeech 2018. ISCA.
Conneau et al. (2022) Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. 2022. Fleurs: Few-shot learning evaluation of universal representations of speech. Preprint, arXiv:2205.12446.
Desplanques et al. (2020) Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Interspeech 2020. ISCA.
Diwan et al. (2021) Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, and Karthik Sankaranarayanan. 2021. Mucs 2021: Multilingual and code-switching asr challenges for low resource indian languages. In Interspeech 2021, interspeech-2021. ISCA.
Javed et al. (2022) Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M. Khapra. 2022. Indicsuperb: A speech processing universal performance benchmark for indian languages. Preprint, arXiv:2208.11761.
Javed et al. (2024) Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya Sukumaran, Tripura Panchagnula, Sunjay Murali, Kunal Sharad Gandhi, Ambujavalli R, Manickam K M, C Venkata Vaijayanthi, Krishnan Srinivasa Raghavan Karunganni, and 2 others. 2024. Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages. Preprint, arXiv:2403.01926.
Kim et al. (2020) Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Preprint, arXiv:2005.11129.
Kim et al. (2021) Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. Preprint, arXiv:2106.06103.
Kim et al. (2023) Sungwon Kim, Kevin J. Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas T. Desta, Rafael Valle, Sungroh Yoon, and Bryan Catanzaro. 2023. P-flow: A fast and data-efficient zero-shot TTS through speech prompting. In Thirty-seventh Conference on Neural Information Processing Systems.
Kong et al. (2020) Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Preprint, arXiv:2010.05646.
Kumar et al. (2023) Gokul Karthik Kumar, Praveen S V, Pratyush Kumar, Mitesh M. Khapra, and Karthik Nandakumar. 2023. Towards building text-to-speech systems for the next billion users. Preprint, arXiv:2211.09536.
Le et al. (2023) Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. 2023. Voicebox: Text-guided multilingual universal speech generation at scale. Preprint, arXiv:2306.15687.
Lipman et al. (2023) Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow matching for generative modeling. Preprint, arXiv:2210.02747.
Mathieu and Nickel (2020) Emile Mathieu and Maximilian Nickel. 2020. Riemannian continuous normalizing flows. Advances in neural information processing systems, 33:2503–2515.
Miao et al. (2020) Chenfeng Miao, Shuang Liang, Zhencheng Liu, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao. 2020. Efficienttts: An efficient and high-quality text-to-speech architecture. Preprint, arXiv:2012.03500.
Panda et al. (2020) Soumya Priyadarsini Panda, Ajit Kumar Nayak, and Satyananda Champati Rai. 2020. A survey on speech synthesis techniques in indian languages. Multimedia Systems, 26:453 – 478.
R et al. (2023) Nithya R, Malavika S, Jordan F, Arjun Gangwar, Metilda N J, S Umesh, Rithik Sarab, Akhilesh Kumar Dubey, Govind Divakaran, Samudra Vijaya K, and Suryakanth V Gangashetty. 2023. Spring-inx: A multilingual indian language speech corpus by spring lab, iit madras. Preprint, arXiv:2310.14654.
Ren et al. (2022) Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2: Fast and high-quality end-to-end text to speech. Preprint, arXiv:2006.04558.
Ren et al. (2019) Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. Preprint, arXiv:1905.09263.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. Preprint, arXiv:1505.04597.
Sankar et al. (2024) Ashwin Sankar, Srija Anand, Praveen Srinivasa Varadhan, Sherry Thomas, Mehak Singal, Shridhar Kumar, Deovrat Mehendale, Aditi Krishana, Giri Raju, and Mitesh Khapra. 2024. Indicvoices-r: Unlocking a massive multilingual multi-speaker speech corpus for scaling indian tts. Preprint, arXiv:2409.05356.
Shah et al. (2025) Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, and Nagaraj Adiga. 2025. IndicST: Indian multilingual translation corpus for evaluating speech large language models. In Proc. ICASSP.
Shen et al. (2021) Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, and Yonghui Wu. 2021. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. Preprint, arXiv:2010.04301.
Shen et al. (2018) Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. Preprint, arXiv:1712.05884.
Shih et al. (2021) Kevin J. Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, and Bryan Catanzaro. 2021. RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models.
Su et al. (2023) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. Roformer: Enhanced transformer with rotary position embedding. Preprint, arXiv:2104.09864.
Tan et al. (2022) Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2022. Naturalspeech: End-to-end text to speech synthesis with human-level quality. Preprint, arXiv:2205.04421.
Team (2025) VAANI Team. 2025. Vaani: Capturing the language landscape for an inclusive digital india (phase 1). https://vaani.iisc.ac.in/.
van den Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. Preprint, arXiv:1609.03499.
Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention is all you need. Preprint, arXiv:1706.03762.
Wang et al. (2023) Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural codec language models are zero-shot text to speech synthesizers. Preprint, arXiv:2301.02111.
Wang et al. (2018) Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. Preprint, arXiv:1803.09017.
Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Preprint, arXiv:1910.07467.

Appendix A: QMOS guidelines

6.1 Naturalness Evaluation

Definition: Naturalness refers to how lifelike and fluid the synthesized speech sounds.

Score	Description
5	Completely natural, indistinguishable from a human speaker.
4	Mostly natural, but with slight unnatural elements.
3	Moderately natural, with noticeable synthetic artifacts or monotony.
2	Mostly unnatural, robotic or artificial-sounding.
1	Completely unnatural, heavily robotic, or difficult to listen to.

Table 6: Description of naturalness scores.

6.2 Intelligibility Evaluation

Definition: Intelligibility measures how easily the speech can be understood, regardless of how natural it sounds. It focuses on clarity and accuracy of pronunciation.

Score	Description
5	Perfectly clear; every word is easily understood.
4	Mostly clear, but with minor pronunciation errors or distortions.
3	Somewhat clear; requires some effort to understand certain words.
2	Mostly unclear; many words are difficult to recognize.
1	Completely unintelligible; nearly impossible to understand.

Table 7: Description of intelligibility scores.

6.3 Guidelines

1.

Listen to each audio sample carefully. Replay if necessary.
2.

Rate the sample separately for Naturalness, Intelligibility, and Speaker Similarity on a scale from 1 to 5.
3.

Avoid bias by focusing on the specific criteria, not personal preference.
4.

Explain the score clearly. Justify the score by describing key factors such as errors, inconsistencies, or deviations from the expected standard.

6.4 Important guidelines

•

Use headphones for better sound quality.
•

Ensure a quiet environment to avoid distractions.
•

Rate objectively without comparing different speech styles or accents.
•

Do not assume meaning—rate based on what you actually hear.

Thank you for your contribution!

Appendix B: SMOS guidelines

6.5 Speaker Similarity Evaluation

Definition: Speaker similarity refers to how much the synthesized voice resembles the reference speaker’s voice in terms of timbre, pitch, and prosody. Ignore other factors like intelligibility and only focus on speaker similarity.

Score	Description
5	Indistinguishable from the reference speaker.
4	Very similar, but with minor differences.
3	Moderately similar, but noticeable variations.
2	Weak similarity, with clear differences in voice identity.
1	Completely different from the reference speaker.

6.6 Guidelines

1.

Listen to each audio sample carefully. Replay if necessary.
2.

Rate the sample separately for Naturalness, Intelligibility, and Speaker Similarity on a scale from 1 to 5.
3.

Avoid bias by focusing on the specific criteria, not personal preference.
4.

Explain the score clearly. Justify the score by describing key factors such as errors, inconsistencies, or deviations from the expected standard.

6.7 Important guidelines

•

Use headphones for better sound quality.
•

Ensure a quiet environment to avoid distractions.
•

Rate objectively without comparing different speech styles or accents.
•

Do not assume meaning—rate based on what you actually hear.

Appendix C: Test set creation

We use the Vistaar test set for all our benchmarking. From this set, we select audio clips ranging from 3 to 10 seconds in duration. Test sets yielding fewer than 100 instances after this filtering are excluded. Table 8 lists the datasets and the number of samples used for benchmarking after filtering.

Table 8: Number of samples per dataset for each language.

Language	Dataset	Samples
Tamil	IndicSuperb	2598
	Mucs	1137
	Total	3738
Telugu	Fluers	374
	IndicSuperb	2892
	Mucs	1166
	Total	4432
Bengali	Fluers	787
	IndicSuperb	2898
	Total	3698
Hindi	Fluers	360
	IndicSuperb	1876
	Mucs	917
	Total	3170
Marathi	Fluers	825
	IndicSuperb	1006
	Mucs	316
	Total	2153

Thank you for your contribution!