Veo 3 hiện đã có trong Gemini API! Tìm hiểu thêm

Trang này được dịch bởi Cloud Translation API.

Live API capabilities guide

Đây là hướng dẫn toàn diện về các chức năng và cấu hình có trong Live API. Hãy xem trang Bắt đầu sử dụng Live API để biết thông tin tổng quan và mã mẫu cho các trường hợp sử dụng phổ biến.

Trước khi bắt đầu

Làm quen với các khái niệm cốt lõi: Nếu bạn chưa làm việc này, trước tiên hãy đọc trang Bắt đầu sử dụng Live API . Phần này sẽ giới thiệu cho bạn các nguyên tắc cơ bản của Live API, cách API này hoạt động và sự khác biệt giữa các mô hình khác nhau và các phương thức tạo âm thanh tương ứng (âm thanh gốc hoặc bán xếp tầng).
Dùng thử Live API trong AI Studio: Bạn có thể thấy việc dùng thử Live API trong Google AI Studio là hữu ích trước khi bắt đầu xây dựng. Để sử dụng Live API trong Google AI Studio, hãy chọn Stream (Phát trực tiếp).

Thiết lập kết nối

Ví dụ sau đây cho thấy cách tạo một kết nối bằng khoá API:

Python

import asyncio
from google import genai

client = genai.Client()

model = "gemini-live-2.5-flash-preview"
config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        print("Session started")

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

import { GoogleGenAI, Modality } from '@google/genai';

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';
const config = { responseModalities: [Modality.TEXT] };

async function main() {

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        console.debug(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  // Send content...

  session.close();
}

main();

Phương thức tương tác

Các phần sau đây cung cấp ví dụ và ngữ cảnh hỗ trợ cho nhiều phương thức đầu vào và đầu ra có trong Live API.

Gửi và nhận tin nhắn văn bản

Sau đây là cách gửi và nhận tin nhắn văn bản:

Python

import asyncio
from google import genai

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello, how are you?"
        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.text is not None:
                print(response.text, end="")

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

import { GoogleGenAI, Modality } from '@google/genai';

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';
const config = { responseModalities: [Modality.TEXT] };

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  const inputTurns = 'Hello how are you?';
  session.sendClientContent({ turns: inputTurns });

  const turns = await handleTurn();
  for (const turn of turns) {
    if (turn.text) {
      console.debug('Received text: %s\n', turn.text);
    }
    else if (turn.data) {
      console.debug('Received inline data: %s\n', turn.data);
    }
  }

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Bản cập nhật nội dung bổ sung

Sử dụng các bản cập nhật gia tăng để gửi dữ liệu đầu vào văn bản, thiết lập bối cảnh phiên hoặc khôi phục bối cảnh phiên. Đối với các ngữ cảnh ngắn, bạn có thể gửi các lượt tương tác từng bước để biểu thị chính xác trình tự của các sự kiện:

Python

turns = [
    {"role": "user", "parts": [{"text": "What is the capital of France?"}]},
    {"role": "model", "parts": [{"text": "Paris"}]},
]

await session.send_client_content(turns=turns, turn_complete=False)

turns = [{"role": "user", "parts": [{"text": "What is the capital of Germany?"}]}]

await session.send_client_content(turns=turns, turn_complete=True)

JavaScript

let inputTurns = [
  { "role": "user", "parts": [{ "text": "What is the capital of France?" }] },
  { "role": "model", "parts": [{ "text": "Paris" }] },
]

session.sendClientContent({ turns: inputTurns, turnComplete: false })

inputTurns = [{ "role": "user", "parts": [{ "text": "What is the capital of Germany?" }] }]

session.sendClientContent({ turns: inputTurns, turnComplete: true })

Đối với các ngữ cảnh dài hơn, bạn nên cung cấp một bản tóm tắt thông báo duy nhất để giải phóng cửa sổ ngữ cảnh cho các lượt tương tác tiếp theo. Hãy xem phần Tiếp tục phiên để biết một phương thức khác để tải ngữ cảnh phiên.

Gửi và nhận âm thanh

Ví dụ phổ biến nhất về âm thanh, âm thanh sang âm thanh, được đề cập trong hướng dẫn Bắt đầu sử dụng.

Dưới đây là ví dụ về tính năng chuyển âm thanh thành văn bản. Ví dụ này đọc một tệp WAV, gửi tệp đó ở định dạng chính xác và nhận đầu ra văn bản:

Python

# Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav
# Install helpers for converting files: pip install librosa soundfile
import asyncio
import io
from pathlib import Path
from google import genai
from google.genai import types
import soundfile as sf
import librosa

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:

        buffer = io.BytesIO()
        y, sr = librosa.load("sample.wav", sr=16000)
        sf.write(buffer, y, sr, format='RAW', subtype='PCM_16')
        buffer.seek(0)
        audio_bytes = buffer.read()

        # If already in correct format, you can use this:
        # audio_bytes = Path("sample.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
        )

        async for response in session.receive():
            if response.text is not None:
                print(response.text)

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

// Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav
// Install helpers for converting files: npm install wavefile
import { GoogleGenAI, Modality } from '@google/genai';
import * as fs from "node:fs";
import pkg from 'wavefile';
const { WaveFile } = pkg;

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';
const config = { responseModalities: [Modality.TEXT] };

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  // Send Audio Chunk
  const fileBuffer = fs.readFileSync("sample.wav");

  // Ensure audio conforms to API requirements (16-bit PCM, 16kHz, mono)
  const wav = new WaveFile();
  wav.fromBuffer(fileBuffer);
  wav.toSampleRate(16000);
  wav.toBitDepth("16");
  const base64Audio = wav.toBase64();

  // If already in correct format, you can use this:
  // const fileBuffer = fs.readFileSync("sample.pcm");
  // const base64Audio = Buffer.from(fileBuffer).toString('base64');

  session.sendRealtimeInput(
    {
      audio: {
        data: base64Audio,
        mimeType: "audio/pcm;rate=16000"
      }
    }

  );

  const turns = await handleTurn();
  for (const turn of turns) {
    if (turn.text) {
      console.debug('Received text: %s\n', turn.text);
    }
    else if (turn.data) {
      console.debug('Received inline data: %s\n', turn.data);
    }
  }

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Sau đây là ví dụ về tính năng chuyển văn bản sang âm thanh. Bạn có thể nhận được âm thanh bằng cách đặt AUDIO làm phương thức phản hồi. Ví dụ này lưu dữ liệu nhận được dưới dạng tệp WAV:

Python

import asyncio
import wave
from google import genai

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {"response_modalities": ["AUDIO"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        wf = wave.open("audio.wav", "wb")
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(24000)

        message = "Hello how are you?"
        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.data is not None:
                wf.writeframes(response.data)

            # Un-comment this code to print audio data info
            # if response.server_content.model_turn is not None:
            #      print(response.server_content.model_turn.parts[0].inline_data.mime_type)

        wf.close()

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

import { GoogleGenAI, Modality } from '@google/genai';
import * as fs from "node:fs";
import pkg from 'wavefile';
const { WaveFile } = pkg;

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';
const config = { responseModalities: [Modality.AUDIO] };

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  const inputTurns = 'Hello how are you?';
  session.sendClientContent({ turns: inputTurns });

  const turns = await handleTurn();

  // Combine audio data strings and save as wave file
  const combinedAudio = turns.reduce((acc, turn) => {
    if (turn.data) {
      const buffer = Buffer.from(turn.data, 'base64');
      const intArray = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.byteLength / Int16Array.BYTES_PER_ELEMENT);
      return acc.concat(Array.from(intArray));
    }
    return acc;
  }, []);

  const audioBuffer = new Int16Array(combinedAudio);

  const wf = new WaveFile();
  wf.fromScratch(1, 24000, '16', audioBuffer);
  fs.writeFileSync('output.wav', wf.toBuffer());

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Định dạng âm thanh

Dữ liệu âm thanh trong Live API luôn là PCM 16 bit, nguyên gốc, little-endian. Đầu ra âm thanh luôn sử dụng tốc độ lấy mẫu là 24 kHz. Âm thanh đầu vào vốn có là 16 kHz, nhưng Live API sẽ lấy lại mẫu nếu cần, vì vậy, bạn có thể gửi bất kỳ tốc độ lấy mẫu nào. Để truyền tải tốc độ lấy mẫu của âm thanh đầu vào, hãy đặt loại MIME của mỗi Blob chứa âm thanh thành một giá trị như audio/pcm;rate=16000.

Bản chép lời

Bạn có thể bật tính năng chép lời đầu ra âm thanh của mô hình bằng cách gửi output_audio_transcription trong cấu hình thiết lập. Ngôn ngữ chép lời được suy luận từ câu trả lời của mô hình.

Python

import asyncio
from google import genai
from google.genai import types

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {"response_modalities": ["AUDIO"],
        "output_audio_transcription": {}
}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello? Gemini are you there?"

        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.server_content.model_turn:
                print("Model turn:", response.server_content.model_turn)
            if response.server_content.output_transcription:
                print("Transcript:", response.server_content.output_transcription.text)

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

import { GoogleGenAI, Modality } from '@google/genai';

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';

const config = {
  responseModalities: [Modality.AUDIO],
  outputAudioTranscription: {}
};

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  const inputTurns = 'Hello how are you?';
  session.sendClientContent({ turns: inputTurns });

  const turns = await handleTurn();

  for (const turn of turns) {
    if (turn.serverContent && turn.serverContent.outputTranscription) {
      console.debug('Received output transcription: %s\n', turn.serverContent.outputTranscription.text);
    }
  }

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Bạn có thể bật tính năng chép lời cho dữ liệu đầu vào âm thanh bằng cách gửi input_audio_transcription trong cấu hình thiết lập.

Python

import asyncio
from pathlib import Path
from google import genai
from google.genai import types

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {
    "response_modalities": ["TEXT"],
    "input_audio_transcription": {},
}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        audio_data = Path("16000.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_data, mime_type='audio/pcm;rate=16000')
        )

        async for msg in session.receive():
            if msg.server_content.input_transcription:
                print('Transcript:', msg.server_content.input_transcription.text)

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

import { GoogleGenAI, Modality } from '@google/genai';
import * as fs from "node:fs";
import pkg from 'wavefile';
const { WaveFile } = pkg;

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';

const config = {
  responseModalities: [Modality.TEXT],
  inputAudioTranscription: {}
};

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  // Send Audio Chunk
  const fileBuffer = fs.readFileSync("16000.wav");

  // Ensure audio conforms to API requirements (16-bit PCM, 16kHz, mono)
  const wav = new WaveFile();
  wav.fromBuffer(fileBuffer);
  wav.toSampleRate(16000);
  wav.toBitDepth("16");
  const base64Audio = wav.toBase64();

  // If already in correct format, you can use this:
  // const fileBuffer = fs.readFileSync("sample.pcm");
  // const base64Audio = Buffer.from(fileBuffer).toString('base64');

  session.sendRealtimeInput(
    {
      audio: {
        data: base64Audio,
        mimeType: "audio/pcm;rate=16000"
      }
    }
  );

  const turns = await handleTurn();

  for (const turn of turns) {
    if (turn.serverContent && turn.serverContent.outputTranscription) {
      console.log("Transcription")
      console.log(turn.serverContent.outputTranscription.text);
    }
  }
  for (const turn of turns) {
    if (turn.text) {
      console.debug('Received text: %s\n', turn.text);
    }
    else if (turn.data) {
      console.debug('Received inline data: %s\n', turn.data);
    }
    else if (turn.serverContent && turn.serverContent.inputTranscription) {
      console.debug('Received input transcription: %s\n', turn.serverContent.inputTranscription.text);
    }
  }

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Phát trực tuyến âm thanh và video

Thay đổi giọng nói và ngôn ngữ

Mỗi mô hình Live API đều hỗ trợ một bộ giọng nói riêng. Half-cascade hỗ trợ Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus và Zephyr. Âm thanh gốc hỗ trợ một danh sách dài hơn nhiều (giống với danh sách mô hình TTS). Bạn có thể nghe tất cả các giọng nói trong AI Studio.

Để chỉ định giọng nói, hãy đặt tên giọng nói trong đối tượng speechConfig trong quá trình thiết lập cấu hình phiên:

Python

config = {
    "response_modalities": ["AUDIO"],
    "speech_config": {
        "voice_config": {"prebuilt_voice_config": {"voice_name": "Kore"}}
    },
}

JavaScript

const config = {
  responseModalities: [Modality.AUDIO],
  speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName: "Kore" } } }
};

Live API hỗ trợ nhiều ngôn ngữ.

Để thay đổi ngôn ngữ, hãy đặt mã ngôn ngữ trong đối tượng speechConfig trong cấu hình phiên:

Python

config = {
    "response_modalities": ["AUDIO"],
    "speech_config": {
        "language_code": "de-DE"
    }
}

JavaScript

const config = {
  responseModalities: [Modality.AUDIO],
  speechConfig: { languageCode: "de-DE" }
};

Khả năng âm thanh gốc

Các chức năng sau đây chỉ dùng được với âm thanh gốc. Bạn có thể tìm hiểu thêm về âm thanh gốc trong phần Chọn mô hình và tạo âm thanh.

Cách sử dụng đầu ra âm thanh gốc

Để sử dụng đầu ra âm thanh gốc, hãy định cấu hình một trong các mô hình âm thanh gốc và đặt response_modalities thành AUDIO.

Hãy xem phần Gửi và nhận âm thanh để biết ví dụ đầy đủ.

Python

model = "gemini-2.5-flash-preview-native-audio-dialog"
config = types.LiveConnectConfig(response_modalities=["AUDIO"])

async with client.aio.live.connect(model=model, config=config) as session:
    # Send audio input and receive audio

JavaScript

const model = 'gemini-2.5-flash-preview-native-audio-dialog';
const config = { responseModalities: [Modality.AUDIO] };

async function main() {

  const session = await ai.live.connect({
    model: model,
    config: config,
    callbacks: ...,
  });

  // Send audio input and receive audio

  session.close();
}

main();

Đối thoại cảm xúc

Tính năng này cho phép Gemini điều chỉnh phong cách phản hồi cho phù hợp với cách diễn đạt và giọng điệu của câu lệnh.

Để sử dụng hộp thoại tình cảm, hãy đặt phiên bản API thành v1alpha và đặt enable_affective_dialog thành true trong thông báo thiết lập:

Python

client = genai.Client(http_options={"api_version": "v1alpha"})

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    enable_affective_dialog=True
)

JavaScript

const ai = new GoogleGenAI({ httpOptions: {"apiVersion": "v1alpha"} });

const config = {
  responseModalities: [Modality.AUDIO],
  enableAffectiveDialog: true
};

Xin lưu ý rằng hiện tại, chỉ các mô hình đầu ra âm thanh gốc mới hỗ trợ hộp thoại cảm xúc.

Âm thanh chủ động

Khi tính năng này được bật, Gemini có thể chủ động quyết định không phản hồi nếu nội dung không liên quan.

Để sử dụng, hãy đặt phiên bản API thành v1alpha và định cấu hình trường proactivity trong thông báo thiết lập, đồng thời đặt proactive_audio thành true:

Python

client = genai.Client(http_options={"api_version": "v1alpha"})

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    proactivity={'proactive_audio': True}
)

JavaScript

const ai = new GoogleGenAI({ httpOptions: {"apiVersion": "v1alpha"} });

const config = {
  responseModalities: [Modality.AUDIO],
  proactivity: { proactiveAudio: true }
}

Xin lưu ý rằng hiện tại, chỉ các mô hình đầu ra âm thanh gốc mới hỗ trợ âm thanh chủ động.

Đầu ra âm thanh trực tiếp kèm theo suy nghĩ

Đầu ra âm thanh gốc hỗ trợ khả năng tư duy, có sẵn thông qua một mô hình riêng biệt gemini-2.5-flash-exp-native-audio-thinking-dialog.

Hãy xem phần Gửi và nhận âm thanh để biết ví dụ đầy đủ.

Python

model = "gemini-2.5-flash-exp-native-audio-thinking-dialog"
config = types.LiveConnectConfig(response_modalities=["AUDIO"])

async with client.aio.live.connect(model=model, config=config) as session:
    # Send audio input and receive audio

JavaScript

const model = 'gemini-2.5-flash-exp-native-audio-thinking-dialog';
const config = { responseModalities: [Modality.AUDIO] };

async function main() {

  const session = await ai.live.connect({
    model: model,
    config: config,
    callbacks: ...,
  });

  // Send audio input and receive audio

  session.close();
}

main();

Phát hiện hoạt động bằng giọng nói (VAD)

Tính năng Phát hiện hoạt động bằng giọng nói (VAD) cho phép mô hình nhận dạng thời điểm một người đang nói. Điều này là cần thiết để tạo ra các cuộc trò chuyện tự nhiên, vì nó cho phép người dùng ngắt lời mô hình bất cứ lúc nào.

Khi VAD phát hiện thấy một đoạn ngắt lời, quá trình tạo nội dung đang diễn ra sẽ bị huỷ và loại bỏ. Chỉ những thông tin đã gửi đến máy khách mới được giữ lại trong nhật ký phiên. Sau đó, máy chủ sẽ gửi thông báo BidiGenerateContentServerContent để báo cáo sự gián đoạn.

Sau đó, máy chủ Gemini sẽ loại bỏ mọi lệnh gọi hàm đang chờ xử lý và gửi thông báo BidiGenerateContentServerContent kèm theo mã nhận dạng của các lệnh gọi đã huỷ.

Python

async for response in session.receive():
    if response.server_content.interrupted is True:
        # The generation was interrupted

        # If realtime playback is implemented in your application,
        # you should stop playing audio and clear queued playback here.

JavaScript

const turns = await handleTurn();

for (const turn of turns) {
  if (turn.serverContent && turn.serverContent.interrupted) {
    // The generation was interrupted

    // If realtime playback is implemented in your application,
    // you should stop playing audio and clear queued playback here.
  }
}

Tính năng VAD tự động

Theo mặc định, mô hình sẽ tự động thực hiện VAD trên luồng đầu vào âm thanh liên tục. Bạn có thể định cấu hình VAD bằng trường realtimeInputConfig.automaticActivityDetection của cấu hình thiết lập.

Khi luồng âm thanh bị tạm dừng trong hơn một giây (ví dụ: vì người dùng đã tắt micrô), bạn nên gửi sự kiện audioStreamEnd để xoá mọi âm thanh được lưu vào bộ nhớ đệm. Ứng dụng có thể tiếp tục gửi dữ liệu âm thanh bất cứ lúc nào.

Python

# example audio file to try:
# URL = "https://storage.googleapis.com/generativeai-downloads/data/hello_are_you_there.pcm"
# !wget -q $URL -O sample.pcm
import asyncio
from pathlib import Path
from google import genai
from google.genai import types

client = genai.Client()
model = "gemini-live-2.5-flash-preview"

config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        audio_bytes = Path("sample.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
        )

        # if stream gets paused, send:
        # await session.send_realtime_input(audio_stream_end=True)

        async for response in session.receive():
            if response.text is not None:
                print(response.text)

if __name__ == "__main__":
    asyncio.run(main())

JavaScript

// example audio file to try:
// URL = "https://storage.googleapis.com/generativeai-downloads/data/hello_are_you_there.pcm"
// !wget -q $URL -O sample.pcm
import { GoogleGenAI, Modality } from '@google/genai';
import * as fs from "node:fs";

const ai = new GoogleGenAI({});
const model = 'gemini-live-2.5-flash-preview';
const config = { responseModalities: [Modality.TEXT] };

async function live() {
  const responseQueue = [];

  async function waitMessage() {
    let done = false;
    let message = undefined;
    while (!done) {
      message = responseQueue.shift();
      if (message) {
        done = true;
      } else {
        await new Promise((resolve) => setTimeout(resolve, 100));
      }
    }
    return message;
  }

  async function handleTurn() {
    const turns = [];
    let done = false;
    while (!done) {
      const message = await waitMessage();
      turns.push(message);
      if (message.serverContent && message.serverContent.turnComplete) {
        done = true;
      }
    }
    return turns;
  }

  const session = await ai.live.connect({
    model: model,
    callbacks: {
      onopen: function () {
        console.debug('Opened');
      },
      onmessage: function (message) {
        responseQueue.push(message);
      },
      onerror: function (e) {
        console.debug('Error:', e.message);
      },
      onclose: function (e) {
        console.debug('Close:', e.reason);
      },
    },
    config: config,
  });

  // Send Audio Chunk
  const fileBuffer = fs.readFileSync("sample.pcm");
  const base64Audio = Buffer.from(fileBuffer).toString('base64');

  session.sendRealtimeInput(
    {
      audio: {
        data: base64Audio,
        mimeType: "audio/pcm;rate=16000"
      }
    }

  );

  // if stream gets paused, send:
  // session.sendRealtimeInput({ audioStreamEnd: true })

  const turns = await handleTurn();
  for (const turn of turns) {
    if (turn.text) {
      console.debug('Received text: %s\n', turn.text);
    }
    else if (turn.data) {
      console.debug('Received inline data: %s\n', turn.data);
    }
  }

  session.close();
}

async function main() {
  await live().catch((e) => console.error('got error', e));
}

main();

Với send_realtime_input, API sẽ tự động phản hồi âm thanh dựa trên VAD. Mặc dù send_client_content thêm các thông báo vào ngữ cảnh mô hình theo thứ tự, nhưng send_realtime_input được tối ưu hoá để có khả năng phản hồi nhanh chóng, nhưng lại không đảm bảo được thứ tự xác định.

Cấu hình VAD tự động

Để kiểm soát hoạt động VAD tốt hơn, bạn có thể định cấu hình các thông số sau. Hãy xem tài liệu tham khảo về API để biết thêm thông tin.

Python

from google.genai import types

config = {
    "response_modalities": ["TEXT"],
    "realtime_input_config": {
        "automatic_activity_detection": {
            "disabled": False, # default
            "start_of_speech_sensitivity": types.StartSensitivity.START_SENSITIVITY_LOW,
            "end_of_speech_sensitivity": types.EndSensitivity.END_SENSITIVITY_LOW,
            "prefix_padding_ms": 20,
            "silence_duration_ms": 100,
        }
    }
}

JavaScript

import { GoogleGenAI, Modality, StartSensitivity, EndSensitivity } from '@google/genai';

const config = {
  responseModalities: [Modality.TEXT],
  realtimeInputConfig: {
    automaticActivityDetection: {
      disabled: false, // default
      startOfSpeechSensitivity: StartSensitivity.START_SENSITIVITY_LOW,
      endOfSpeechSensitivity: EndSensitivity.END_SENSITIVITY_LOW,
      prefixPaddingMs: 20,
      silenceDurationMs: 100,
    }
  }
};

Tắt tính năng VAD tự động

Ngoài ra, bạn có thể tắt VAD tự động bằng cách đặt realtimeInputConfig.automaticActivityDetection.disabled thành true trong thông báo thiết lập. Trong cấu hình này, ứng dụng chịu trách nhiệm phát hiện lời nói của người dùng và gửi thông báo activityStart và activityEnd vào thời điểm thích hợp. audioStreamEnd không được gửi trong cấu hình này. Thay vào đó, mọi sự gián đoạn luồng đều được đánh dấu bằng thông báo activityEnd.

Python

config = {
    "response_modalities": ["TEXT"],
    "realtime_input_config": {"automatic_activity_detection": {"disabled": True}},
}

async with client.aio.live.connect(model=model, config=config) as session:
    # ...
    await session.send_realtime_input(activity_start=types.ActivityStart())
    await session.send_realtime_input(
        audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
    )
    await session.send_realtime_input(activity_end=types.ActivityEnd())
    # ...

JavaScript

const config = {
  responseModalities: [Modality.TEXT],
  realtimeInputConfig: {
    automaticActivityDetection: {
      disabled: true,
    }
  }
};

session.sendRealtimeInput({ activityStart: {} })

session.sendRealtimeInput(
  {
    audio: {
      data: base64Audio,
      mimeType: "audio/pcm;rate=16000"
    }
  }

);

session.sendRealtimeInput({ activityEnd: {} })

Số lượng mã thông báo

Bạn có thể tìm thấy tổng số mã thông báo đã sử dụng trong trường usageMetadata của thông báo máy chủ được trả về.

Python

async for message in session.receive():
    # The server will periodically send messages that include UsageMetadata.
    if message.usage_metadata:
        usage = message.usage_metadata
        print(
            f"Used {usage.total_token_count} tokens in total. Response token breakdown:"
        )
        for detail in usage.response_tokens_details:
            match detail:
                case types.ModalityTokenCount(modality=modality, token_count=count):
                    print(f"{modality}: {count}")

JavaScript

const turns = await handleTurn();

for (const turn of turns) {
  if (turn.usageMetadata) {
    console.debug('Used %s tokens in total. Response token breakdown:\n', turn.usageMetadata.totalTokenCount);

    for (const detail of turn.usageMetadata.responseTokensDetails) {
      console.debug('%s\n', detail);
    }
  }
}

Độ phân giải của nội dung nghe nhìn

Bạn có thể chỉ định độ phân giải của nội dung nghe nhìn cho nội dung nghe nhìn đầu vào bằng cách đặt trường mediaResolution trong cấu hình phiên:

Python

from google.genai import types

config = {
    "response_modalities": ["AUDIO"],
    "media_resolution": types.MediaResolution.MEDIA_RESOLUTION_LOW,
}

JavaScript

import { GoogleGenAI, Modality, MediaResolution } from '@google/genai';

const config = {
    responseModalities: [Modality.TEXT],
    mediaResolution: MediaResolution.MEDIA_RESOLUTION_LOW,
};

Các điểm hạn chế

Hãy cân nhắc những hạn chế sau đây của Live API khi bạn lên kế hoạch cho dự án của mình.

Phương thức phản hồi

Bạn chỉ có thể đặt một phương thức phản hồi (TEXT hoặc AUDIO) cho mỗi phiên trong cấu hình phiên. Việc thiết lập cả hai sẽ dẫn đến thông báo lỗi cấu hình. Điều này có nghĩa là bạn có thể định cấu hình mô hình để phản hồi bằng văn bản hoặc âm thanh, nhưng không thể phản hồi bằng cả hai trong cùng một phiên.

Xác thực ứng dụng

Theo mặc định, Live API chỉ cung cấp tính năng xác thực từ máy chủ đến máy chủ. Nếu đang triển khai ứng dụng Live API bằng phương pháp từ máy khách đến máy chủ, bạn cần sử dụng mã thông báo tạm thời để giảm thiểu rủi ro bảo mật.

Thời lượng phiên

Các phiên chỉ có âm thanh bị giới hạn ở 15 phút và các phiên có cả âm thanh và video bị giới hạn ở 2 phút. Tuy nhiên, bạn có thể định cấu hình các kỹ thuật quản lý phiên khác nhau cho số lượng tiện ích không giới hạn trong thời lượng phiên.

Cửa sổ ngữ cảnh

Một phiên có giới hạn cửa sổ ngữ cảnh là:

128.000 mã thông báo cho các mô hình đầu ra âm thanh gốc
32 nghìn token cho các mô hình Live API khác

Ngôn ngữ được hỗ trợ

Live API hỗ trợ các ngôn ngữ sau.

Ngôn ngữ	Mã BCP-47	Ngôn ngữ	Mã BCP-47
Tiếng Đức (Đức)	`de-DE`	Tiếng Anh (Úc)*	`en-AU`
Tiếng Anh (Vương quốc Anh)*	`en-GB`	Tiếng Anh (Ấn Độ)	`en-IN`
Tiếng Anh (Mỹ)	`en-US`	Tiếng Tây Ban Nha (Mỹ)	`es-US`
Tiếng Pháp (Pháp)	`fr-FR`	Tiếng Hindi (Ấn Độ)	`hi-IN`
Tiếng Bồ Đào Nha (Brazil)	`pt-BR`	Tiếng Ả Rập (Nói chung)	`ar-XA`
Tiếng Tây Ban Nha (Tây Ban Nha)*	`es-ES`	Tiếng Pháp (Canada)*	`fr-CA`
Tiếng Indo (Indonesia)	`id-ID`	Tiếng Ý (Ý)	`it-IT`
Tiếng Nhật (Nhật Bản)	`ja-JP`	Tiếng Thổ Nhĩ Kỳ (Thổ Nhĩ Kỳ)	`tr-TR`
Tiếng Việt (Việt Nam)	`vi-VN`	Tiếng Bengal (Ấn Độ)	`bn-IN`
Tiếng Gujarati (Ấn Độ)*	`gu-IN`	Tiếng Kannada (Ấn Độ)*	`kn-IN`
Tiếng Marathi (Ấn Độ)	`mr-IN`	Tiếng Malayalam (Ấn Độ)*	`ml-IN`
Tiếng Tamil (Ấn Độ)	`ta-IN`	Tiếng Telugu (Ấn Độ)	`te-IN`
Tiếng Hà Lan (Hà Lan)	`nl-NL`	Tiếng Hàn (Hàn Quốc)	`ko-KR`
Tiếng Trung Quan Thoại (Trung Quốc)*	`cmn-CN`	Tiếng Ba Lan (Ba Lan)	`pl-PL`
Tiếng Nga (Nga)	`ru-RU`	Tiếng Thái (Thái Lan)	`th-TH`

Các ngôn ngữ có dấu hoa thị (*) không hỗ trợ tính năng Âm thanh gốc.

Bước tiếp theo

Đọc hướng dẫn về Cách sử dụng công cụ và Quản lý phiên để biết thông tin cần thiết về cách sử dụng Live API một cách hiệu quả.
Dùng thử Live API trong Google AI Studio.
Để biết thêm thông tin về các mô hình Live API, hãy xem Gemini 2.0 Flash Live và Gemini 2.5 Flash Native Audio trên trang Mô hình.
Hãy thử thêm các ví dụ trong sổ tay Live API, sổ tay Live API Tools và tập lệnh Bắt đầu sử dụng Live API.