Chirp 3: Instant custom voice

Instant custom voice in Text-to-Speech enables users to create personalized voice models by training a model with their own high-quality audio recordings. It allows for the rapid generation of personal voices, which can then be used to synthesize audio using the Cloud TTS API, supporting streaming and long-form text.

Try a Colab notebook

View notebook on GitHub

Language availability

Instant custom voice is supported in the following languages:

Language	BCP-47 Code	Consent Statement
Arabic (XA)	ar-XA	.أنا مالك هذا الصوت وأوافق على أن تستخدم Google هذا الصوت لإنشاء نموذج صوتي اصطناعي
Bengali (India)	bn-IN	আমি এই ভয়েসের মালিক এবং আমি একটি সিন্থেটিক ভয়েস মডেল তৈরি করতে এই ভয়েস ব্যবহার করে Google-এর সাথে সম্মতি দিচ্ছি।
Chinese (China)	cmn-CN	我是此声音的拥有者并授权谷歌使用此声音创建语音合成模型
English (Australia)	en-AU	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (India)	en-IN	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (UK)	en-GB	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (US)	en-US	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
French (Canada)	fr-CA	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
French (France)	fr-FR	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
German (Germany)	de-DE	Ich bin der Eigentümer dieser Stimme und bin damit einverstanden, dass Google diese Stimme zur Erstellung eines synthetischen Stimmmodells verwendet.
Gujarati (India)	gu-IN	હું આ વોઈસનો માલિક છું અને સિન્થેટિક વોઈસ મોડલ બનાવવા માટે આ વોઈસનો ઉપયોગ કરીને google ને હું સંમતિ આપું છું
Hindi (India)	hi-IN	मैं इस आवाज का मालिक हूं और मैं सिंथेटिक आवाज मॉडल बनाने के लिए Google को इस आवाज का उपयोग करने की सहमति देता हूं
Indonesian (Indonesia)	id-ID	Saya pemilik suara ini dan saya menyetujui Google menggunakan suara ini untuk membuat model suara sintetis.
Italian (Italy)	it-IT	Sono il proprietario di questa voce e acconsento che Google la utilizzi per creare un modello di voce sintetica.
Japanese (Japan)	ja-JP	私はこの音声の所有者であり、Googleがこの音声を使用して音声合成モデルを作成することを承認します。
Kannada (India)	kn-IN	ನಾನು ಈ ಧ್ವನಿಯ ಮಾಲಿಕ ಮತ್ತು ಸಂಶ್ಲೇಷಿತ ಧ್ವನಿ ಮಾದರಿಯನ್ನು ರಚಿಸಲು ಈ ಧ್ವನಿಯನ್ನು ಬಳಸಿಕೊಂಡುಗೂಗಲ್ ಗೆ ನಾನು ಸಮ್ಮತಿಸುತ್ತೇನೆ.
Korean (Korea)	ko-KR	나는 이 음성의 소유자이며 구글이 이 음성을 사용하여 음성 합성 모델을 생성할 것을 허용합니다.
Malayalam (India)	ml-IN	ഈ ശബ്ദത്തിന്റെ ഉടമ ഞാനാണ്, ഒരു സിന്തറ്റിക് വോയ്‌സ് മോഡൽ സൃഷ്ടിക്കാൻ ഈ ശബ്‌ദം ഉപയോഗിക്കുന്നതിന് ഞാൻ Google-ന് സമ്മതം നൽകുന്നു."
Marathi (India)	mr-IN	मी या आवाजाचा मालक आहे आणि सिंथेटिक व्हॉइस मॉडेल तयार करण्यासाठी हा आवाज वापरण्यासाठी मी Google ला संमती देतो
Dutch (Netherlands)	nl-NL	Ik ben de eigenaar van deze stem en ik geef Google toestemming om deze stem te gebruiken om een synthetisch stemmodel te maken.
Polish (Poland)	pl-PL	Jestem właścicielem tego głosu i wyrażam zgodę na wykorzystanie go przez Google w celu utworzenia syntetycznego modelu głosu.
Portuguese (Brazil)	pt-BR	Eu sou o proprietário desta voz e autorizo o Google a usá-la para criar um modelo de voz sintética.
Russian (Russia)	ru-RU	Я являюсь владельцем этого голоса и даю согласие Google на использование этого голоса для создания модели синтетического голоса.
Tamil (India)	ta-IN	நான் இந்த குரலின் உரிமையாளர் மற்றும் செயற்கை குரல் மாதிரியை உருவாக்க இந்த குரலை பயன்படுத்த குகல்க்கு நான் ஒப்புக்கொள்கிறேன்.
Telugu (India)	te-IN	నేను ఈ వాయిస్ యజమానిని మరియు సింతటిక్ వాయిస్ మోడల్ ని రూపొందించడానికి ఈ వాయిస్ ని ఉపయోగించడానికి googleకి నేను సమ్మతిస్తున్నాను.
Thai (Thailand)	th-TH	ฉันเป็นเจ้าของเสียงนี้ และฉันยินยอมให้ Google ใช้เสียงนี้เพื่อสร้างแบบจำลองเสียงสังเคราะห์
Turkish (Turkey)	tr-TR	Bu sesin sahibi benim ve Google'ın bu sesi kullanarak sentetik bir ses modeli oluşturmasına izin veriyorum.
Vietnamese (Vietnam)	vi-VN	Tôi là chủ sở hữu giọng nói này và tôi đồng ý cho Google sử dụng giọng nói này để tạo mô hình giọng nói tổng hợp.
Spanish (Spain)	es-ES	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.
Spanish (US)	es-US	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.

Regional Availability

Instant custom voice is available in the following Google Cloud regions:

Google Cloud Zone	Supported Method	Launch Readiness
`global`	Creation, Synthesis	Private Preview
`us`	Synthesis	Private Preview
`eu`	Synthesis	Private Preview
`asia-southeast1`	Synthesis	Private Preview

Supported output formats

The default response format is LINEAR16, but other supported formats include:

API Method	Format
`streaming`	ALAW, MULAW, OGG_OPUS and PCM
`batch`	ALAW, MULAW, MP3, OGG_OPUS and PCM

Feature support and limitations

Feature	Support	Description
SSML	No	SSML tags to personalize synthetic audio
Text-Based Prompting	Experimental	Use punctuation, pauses, and disfluency to add natural flow and pacing to Text-to-Speech.
Timestamps	No	Word-level timestamps
Pause Tags	No	Introduce on-demand pauses to synthesized audio
Pace Control	No	Adjust the speed of synthesized audio, from 0.25x speed to 2x speed.
Pronunciation Control	No	Custom pronunciations of words or phrases using IPA or X-SAMPA phonetic encoding

Use Chirp 3: Instant custom voice

The following sections explore how to use Chirp 3: Instant custom voice capabilities in Text-to-Speech API.

Record consent and reference audio

Record the consent statement: To comply with legal and ethical guidelines for Instant custom voice, record the required consent statement as a mono WAV file, with LINEAR16 encoding and a 24 kHz sampling rate, in the appropriate language. (I am the owner of this voice, and I consent to Google using this voice to create a synthetic voice model.)
Record reference audio: Use your computer microphone to record up to 10 seconds of audio as a LINEAR16-encoded, mono WAV file at a 24 kHz sampling rate. There should be no background noise during the recording. Record the consent and reference audio in the same environment.
Store audio files: Save the recorded audio files in a designated Cloud Storage location.

Create an instant custom voice using the REST API

An instant custom voice takes the form of a voice cloning key, which is a text-string representation of your reference audio data. This sample shows how to use the REST API to generate your voice cloning key. For more information about how this key is used to synthesize audio, see Synthesize with an instant custom voice using the REST API.

import requests, os, json

def create_instant_custom_voice_key(
    access_token, project_id, reference_audio_bytes, consent_audio_bytes
):
    url = "https://texttospeech.googleapis.com/v1beta1/voices:generateVoiceCloningKey"

    request_body = {
        "reference_audio": {
            "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 24000},
            "content": reference_audio_bytes,
        },
        "voice_talent_consent": {
            "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 24000},
            "content": consent_audio_bytes,
        },
        "consent_script": "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.",
        "language_code": "en-US",
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8",
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        return response_json.get("voiceCloningKey")

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Synthesize with an instant custom voice using the REST API

Use the voice cloning key to synthesize audio using the REST API.

import requests, os, json, base64
from IPython.display import Audio, display

def synthesize_text_with_cloned_voice(access_token, project_id, voice_key, text):
    url = "https://texttospeech.googleapis.com/v1beta1/text:synthesize"

    request_body = {
        "input": {
            "text": text
        },
        "voice": {
            "language_code": "en-US",
            "voice_clone": {
                "voice_cloning_key": voice_key,
            }
        },
        "audioConfig": {
            "audioEncoding": "LINEAR16",
            "sample_rate_hertz": 24000
        }
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8"
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        audio_content = response_json.get("audioContent")

        if audio_content:
            display(Audio(base64.b64decode(audio_content), rate=24000))
        else:
            print("Error: Audio content not found in the response.")
            print(response_json)

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Synthesize with an instant custom voice using the Python client library

This example uses the Python client library to perform instant custom voice synthesis, using a voice cloning key that has been saved to the file, voice_cloning_key.txt. To generate a voice cloning key, see Create an instant custom voice using the REST API.

from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning(
    voice_cloning_key: str,
    transcript: str,
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning and write output to a file.

  Args:
    voice_cloning_key: The voice cloning key.
    transcript: The transcript to synthesize.
    language_code: The language code.
    synthesis_output_path: The synthesis audio output path.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  voice = texttospeech.VoiceSelectionParams(
      language_code=language_code, voice_clone=voice_clone_params
  )
  request = texttospeech.SynthesizeSpeechRequest(
      input=texttospeech.SynthesisInput(text=transcript),
      voice=voice,
      audio_config=texttospeech.AudioConfig(
          audio_encoding=texttospeech.AudioEncoding.LINEAR16,
          sample_rate_hertz=24000,
      ),
  )
  response = tts_client.synthesize_speech(request)
  with open(synthesis_output_path, 'wb') as out:
    out.write(response.audio_content)
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning(
      voice_cloning_key=key,
      transcript='Hello world!',
      language_code='en-US',
      synthesis_output_path='/tmp/output.wav',
      tts_client=client,
  )

Streaming synthesize with an instant custom voice using the Python client library

This example uses the Python client library to perform instant custom voice streaming synthesis, using a voice cloning key saved to voice_cloning_key.txt. To generate a voice cloning key, see Create an instant custom voice using the REST API.

import io
import wave
from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning_with_simulated_streaming(
    voice_cloning_key: str,
    simulated_streamed_text: list[str],
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning for a given reference audio, voice talent consent, and consent script.

  Args:
    voice_cloning_key: The voice cloning key.
    simulated_streamed_text: The list of transcripts to synthesize, where each
      item represents a chunk of streamed text. This is used to simulate
      streamed text input and is not meant to be representative of real-world
      streaming usage.
    language_code: The language code.
    synthesis_output_path: The path to write the synthesis audio output to.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  streaming_config = texttospeech.StreamingSynthesizeConfig(
      voice=texttospeech.VoiceSelectionParams(
          language_code=language_code, voice_clone=voice_clone_params
      ),
      streaming_audio_config=texttospeech.StreamingAudioConfig(
          audio_encoding=texttospeech.AudioEncoding.PCM,
          sample_rate_hertz=24000,
      ),
  )
  config_request = texttospeech.StreamingSynthesizeRequest(
      streaming_config=streaming_config
  )

  # Request generator. Consider using Gemini or another LLM with output
  # streaming as a generator.
  def request_generator():
    yield config_request
    for text in simulated_streamed_text:
      yield texttospeech.StreamingSynthesizeRequest(
          input=texttospeech.StreamingSynthesisInput(text=text)
      )

  streaming_responses = tts_client.streaming_synthesize(request_generator())
  audio_buffer = io.BytesIO()
  for response in streaming_responses:
    print(f'Audio content size in bytes is: {len(response.audio_content)}')
    audio_buffer.write(response.audio_content)

  # Write collected audio outputs to a WAV file.
  with wave.open(synthesis_output_path, 'wb') as wav_file:
    wav_file.setnchannels(1)
    wav_file.setsampwidth(2)
    wav_file.setframerate(24000)
    wav_file.writeframes(audio_buffer.getvalue())
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning_with_simulated_streaming(
      voice_cloning_key=key,
      simulated_streamed_text=[
          'Hello world!',
          'This is the second text chunk.',
          'This simulates streaming text for synthesis.',
      ],
      language_code='en-US',
      synthesis_output_path='streaming_output.wav',
      tts_client=client,
  )