这是indexloc提供的服务,不要输入任何密码
Skip to content

support timed transcripts from tts #2580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jul 1, 2025
Merged

Conversation

longcw
Copy link
Contributor

@longcw longcw commented Jun 12, 2025

@longcw longcw requested a review from a team June 12, 2025 07:54
Copy link
Contributor

github-actions bot commented Jun 12, 2025

✅ Changeset File Detected

The following changeset entries were found:

  • patch - livekit-agents
  • patch - livekit-plugins-cartesia
  • patch - livekit-plugins-elevenlabs

Change description:
support aligned transcripts with timestamps from tts (#2580)

@@ -1089,6 +1076,12 @@ def _on_first_frame(_: asyncio.Future[None]) -> None:
model_settings=model_settings,
)
tasks.append(tts_task)
if (
(tts := self.tts)
and (tts.capabilities.timed_transcript or not tts.capabilities.streaming)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here if user created new AudioFrame in a customized tts_node but didn't forward the timed transcripts, we may missing the text response. wdyt? @theomonnom

@@ -233,8 +234,12 @@ def llm_node(
return Agent.default.llm_node(self, chat_ctx, tools, model_settings)

def transcription_node(
self, text: AsyncIterable[str], model_settings: ModelSettings
) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]:
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimedString is a str, to be fair I'm not even sure if we should encourage people to use the timed transcripts here

Suggested change
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings
self, text: AsyncIterable[str], model_settings: ModelSettings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any suggestions for the alternative for ppl to access the timed transcripts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just keep it implicit for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just keep it implicit for now

I understand the concern this may change in the future, but we should expose the timed transcripts to user in some way IMO, this was asked by some folks for awhile. Any alternatives for this?

return

if last_frame is not None:
last_frame.user_data["timed_transcripts"] = timed_transcripts
Copy link
Member

@theomonnom theomonnom Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to send the timed transcripts at the end of segment?
if so maybe this could just be a new field to SynthesizedAudio. (This also means that we don't have synchronized transcripts until the whole generation is done?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not at the end of segment, usually at the start of the tts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buffered timed_transcripts are always added to the next audio frame, not only the last frame.

@tbachlechner
Copy link

just adding my support for this one, excited to see you hopefully get this out soon!

@longcw longcw requested a review from theomonnom June 19, 2025 02:27
Copy link
Member

@theomonnom theomonnom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! this is awesome!

async for audio_frame in tts_node:
for text in audio_frame.userdata.get("timed_transcripts", []):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the key as a constant somewhere. Like https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/types.py

(with lk. prefix)


return url


def _to_timed_words(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit difficult to understand this code without actually running it. Not urgent, but it would be ideal to have some tests in place at some point

@theomonnom
Copy link
Member

Should we release a new version of livekit-rtc before merging?

@@ -41,6 +44,8 @@ class SynthesizedAudio:
class TTSCapabilities:
streaming: bool
"""Whether this TTS supports streaming (generally using websockets)"""
timed_transcript: bool = False
Copy link
Member

@theomonnom theomonnom Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use aligned_transcript on other parts of the code (AgentSession has use_tts_aligned_transcript). Let's use aligned here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@sansjack
Copy link

Should we release a new version of livekit-rtc before merging?

when is the EST for this to be added?

@longcw longcw changed the base branch from main to theo/agents1.2 July 1, 2025 09:33
def __init__(self):
super().__init__(instructions="You are a helpful assistant.")

self._closing_task: asyncio.Task[None] | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._closing_task: asyncio.Task[None] | None = None

sample_rate=tts.sample_rate,
num_channels=tts.num_channels,
)
self._wrapped_tts = tts
self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer()
self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer(
retain_format=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually we were not using retain_format for the StreamAdapter before. Since it is only used to generate a sentence.

In the PR I did, I was actually keeping the basic.SentenceTokenizer inside the transcription synchronization code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used in agent's tts_node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe I added it in this pr, we need to format if we use the timed transcript from stream adapter.

Copy link
Member

@theomonnom theomonnom Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, the synchronizer also needs the exact same formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it can be different. They process the sentences separately.

Copy link
Member

@theomonnom theomonnom Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but I thought the aligned transcripts returned by the TTSs were not including new lines/special characters. So I assumed retain_format was not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the StreamAdapter with OpenAI, the transcription_node is coming from the llm_node right?
In this case I really don't think we should wait for the TTS? Since we have the opt-in flag use_tts_aligned_transcript

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use tts aligned transcript enabled, the input of the transcription node is from tts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym for we shouldn't wait for the tts when using steam adapter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense, so by default, even if we use the StreamAdapter, it'll use the llm output for the transcription_node

@@ -233,8 +234,12 @@ def llm_node(
return Agent.default.llm_node(self, chat_ctx, tools, model_settings)

def transcription_node(
self, text: AsyncIterable[str], model_settings: ModelSettings
) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]:
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just keep it implicit for now

@longcw longcw merged commit d870f87 into theo/agents1.2 Jul 1, 2025
1 check passed
@longcw longcw deleted the longc/timed-transcription branch July 1, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: TTS interruption causes discrepancy between spoken audio and displayed chat text
4 participants