此页面由 Cloud Translation API 翻译。

Live API

借助 Live API，您可以与 Gemini 进行低延迟的双向语音和视频互动。借助 Live API，您可以为最终用户提供自然的、类似人类的语音对话体验，并能够使用语音指令中断模型的回答。

本文档介绍了使用 Live API 的基础知识，包括其功能、入门示例和基本用例代码示例。如果您想了解如何使用 Live API 开始交互式对话，请参阅使用 Live API 进行交互式对话。如果您想了解 Live API 可以使用哪些工具，请参阅内置工具。

支持的模型

Google Gen AI SDK 和 Vertex AI Studio 均支持使用实时 API。某些功能（例如文本输入和输出）只能通过 Gen AI SDK 使用。

您可以将 Live API 与以下模型搭配使用：

模型版本	可用性级别
`gemini-live-2.5-flash`	非公开正式版^*
`gemini-live-2.5-flash-preview-native-audio`	公开预览版

^*请与您的 Google 客户支持团队代表联系，以请求访问权限。

如需了解更多信息（包括技术规范和限制），请参阅 Live API 参考指南。

入门示例

您可以从我们的示例入手，开始使用 Live API：

Jupyter 笔记本：

演示应用和指南：

实时 API 功能

实时多模态理解：通过内置的音频和视频流支持，与 Gemini 对话，讨论其在视频画面中或通过屏幕分享看到的内容。
内置工具使用：将函数调用和基于 Google 搜索的 Grounding 等工具无缝集成到对话中，实现更实用、更动态的互动。
低延迟互动：与 Gemini 进行低延迟的类人互动。
多语言支持：支持 24 种语言的对话。
（仅限正式版）支持预配的吞吐量：使用固定费用、固定期限的订阅服务（提供多种期限长度），为 Vertex AI 上受支持的生成式 AI 模型（包括 Live API）预留吞吐量。

Gemini 2.5 Flash with Live API 还包含原生音频功能，目前以公开预览版的形式提供。原生音频引入了以下功能：

情感对话：Live API 可以理解用户的语气并做出相应回应。以不同方式说出相同的字词，可能会带来截然不同且更细致的对话。
主动音频和情境感知：Live API 可智能地忽略环境对话和其他无关音频，了解何时应聆听，何时应保持静默。

如需详细了解原生音频，请参阅内置工具。

支持的音频格式

Live API 支持以下音频格式：

输入音频：16 位原始 PCM 音频，16kHz，小端字节序
输出音频：原始 16 位 PCM 音频，24 kHz，小端字节序

根据音频输入获取文本回答

您可以将音频转换为 16 位 PCM、16 kHz 单声道格式，然后发送音频并接收文字回复。以下示例读取 WAV 文件并以正确的格式发送：

Python

# Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav
# Install helpers for converting files: pip install librosa soundfile

import asyncio
import io
from pathlib import Path
from google import genai
from google.genai import types
import soundfile as sf
import librosa

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"
config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:

        buffer = io.BytesIO()
        y, sr = librosa.load("sample.wav", sr=16000)
        sf.write(buffer, y, sr, format="RAW", subtype="PCM_16")
        buffer.seek(0)
        audio_bytes = buffer.read()

        # If already in correct format, you can use this:
        # audio_bytes = Path("sample.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
        )

        async for response in session.receive():
            if response.text is not None:
                print(response.text)

if __name__ == "__main__":
    asyncio.run(main())

根据文本输入获取语音回答

使用此示例发送文本输入并接收合成语音响应：

Python

import asyncio
import numpy as np
from IPython.display import Audio, Markdown, display
from google import genai
from google.genai.types import (
  Content,
  LiveConnectConfig,
  HttpOptions,
  Modality,
  Part,
  SpeechConfig,
  VoiceConfig,
  PrebuiltVoiceConfig,
)

client = genai.Client(
  vertexai=True,
  project=GOOGLE_CLOUD_PROJECT,
  location=GOOGLE_CLOUD_LOCATION,
)

voice_name = "Aoede"

config = LiveConnectConfig(
  response_modalities=["AUDIO"],
  speech_config=SpeechConfig(
      voice_config=VoiceConfig(
          prebuilt_voice_config=PrebuiltVoiceConfig(
              voice_name=voice_name,
          )
      ),
  ),
)

async with client.aio.live.connect(
  model="gemini-live-2.5-flash",
  config=config,
) as session:
  text_input = "Hello? Gemini are you there?"
  display(Markdown(f"**Input:** {text_input}"))

  await session.send_client_content(
      turns=Content(role="user", parts=[Part(text=text_input)]))

  audio_data = []
  async for message in session.receive():
      if (
          message.server_content.model_turn
          and message.server_content.model_turn.parts
      ):
          for part in message.server_content.model_turn.parts:
              if part.inline_data:
                  audio_data.append(
                      np.frombuffer(part.inline_data.data, dtype=np.int16)
                  )

  if audio_data:
      display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

如需查看有关发送文本的更多示例，请参阅我们的入门指南。

转录音频

Live API 可以转写输入和输出音频。使用以下示例启用转写功能：

Python

import asyncio
from google import genai
from google.genai import types

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"

config = {
    "response_modalities": ["AUDIO"],
    "input_audio_transcription": {},
    "output_audio_transcription": {}
}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello? Gemini are you there?"

        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.server_content.model_turn:
                print("Model turn:", response.server_content.model_turn)
            if response.server_content.input_transcription:
                print("Input transcript:", response.server_content.input_transcription.text)
            if response.server_content.output_transcription:
                print("Output transcript:", response.server_content.output_transcription.text)

if __name__ == "__main__":
    asyncio.run(main())

WebSockets

# Set model generation_config
CONFIG = {
    'response_modalities': ['AUDIO'],
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    'input_audio_transcription': {},
                    'output_audio_transcription': {}
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    input_transcriptions = []
    output_transcriptions = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode())
        server_content = response.pop("serverContent", None)
        if server_content is None:
            break

        if (input_transcription := server_content.get("inputTranscription")) is not None:
            if (text := input_transcription.get("text")) is not None:
                input_transcriptions.append(text)
        if (output_transcription := server_content.get("outputTranscription")) is not None:
            if (text := output_transcription.get("text")) is not None:
                output_transcriptions.append(text)

        model_turn = server_content.pop("modelTurn", None)
        if model_turn is not None:
            parts = model_turn.pop("parts", None)
            if parts is not None:
                for part in parts:
                    pcm_data = base64.b64decode(part["inlineData"]["data"])
                    responses.append(np.frombuffer(pcm_data, dtype=np.int16))

        # End of turn
        turn_complete = server_content.pop("turnComplete", None)
        if turn_complete:
            break

    if input_transcriptions:
        display(Markdown(f"**Input transcription >** {''.join(input_transcriptions)}"))

    if responses:
        # Play the returned audio message
        display(Audio(np.concatenate(responses), rate=24000, autoplay=True))

    if output_transcriptions:
        display(Markdown(f"**Output transcription >** {''.join(output_transcriptions)}"))

Live API 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

支持的模型

入门示例

实时 API 功能

支持的音频格式

根据音频输入获取文本回答

Python

根据文本输入获取语音回答

Python

转录音频

Python

WebSockets

更多信息

Live API