-
Notifications
You must be signed in to change notification settings - Fork 1.1k
support timed transcripts from tts #2580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Changeset File DetectedThe following changeset entries were found:
Change description: |
@@ -1089,6 +1076,12 @@ def _on_first_frame(_: asyncio.Future[None]) -> None: | |||
model_settings=model_settings, | |||
) | |||
tasks.append(tts_task) | |||
if ( | |||
(tts := self.tts) | |||
and (tts.capabilities.timed_transcript or not tts.capabilities.streaming) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern here if user created new AudioFrame
in a customized tts_node
but didn't forward the timed transcripts, we may missing the text response. wdyt? @theomonnom
@@ -233,8 +234,12 @@ def llm_node( | |||
return Agent.default.llm_node(self, chat_ctx, tools, model_settings) | |||
|
|||
def transcription_node( | |||
self, text: AsyncIterable[str], model_settings: ModelSettings | |||
) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]: | |||
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimedString is a str, to be fair I'm not even sure if we should encourage people to use the timed transcripts here
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings | |
self, text: AsyncIterable[str], model_settings: ModelSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have any suggestions for the alternative for ppl to access the timed transcripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should just keep it implicit for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should just keep it implicit for now
I understand the concern this may change in the future, but we should expose the timed transcripts to user in some way IMO, this was asked by some folks for awhile. Any alternatives for this?
return | ||
|
||
if last_frame is not None: | ||
last_frame.user_data["timed_transcripts"] = timed_transcripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea to send the timed transcripts at the end of segment?
if so maybe this could just be a new field to SynthesizedAudio. (This also means that we don't have synchronized transcripts until the whole generation is done?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not at the end of segment, usually at the start of the tts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the buffered timed_transcripts
are always added to the next audio frame, not only the last frame.
just adding my support for this one, excited to see you hopefully get this out soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! this is awesome!
async for audio_frame in tts_node: | ||
for text in audio_frame.userdata.get("timed_transcripts", []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the key as a constant somewhere. Like https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/types.py
(with lk.
prefix)
|
||
return url | ||
|
||
|
||
def _to_timed_words( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit difficult to understand this code without actually running it. Not urgent, but it would be ideal to have some tests in place at some point
Should we release a new version of livekit-rtc before merging? |
@@ -41,6 +44,8 @@ class SynthesizedAudio: | |||
class TTSCapabilities: | |||
streaming: bool | |||
"""Whether this TTS supports streaming (generally using websockets)""" | |||
timed_transcript: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use aligned_transcript
on other parts of the code (AgentSession has use_tts_aligned_transcript
). Let's use aligned
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
This reverts commit b4104c3.
when is the EST for this to be added? |
def __init__(self): | ||
super().__init__(instructions="You are a helpful assistant.") | ||
|
||
self._closing_task: asyncio.Task[None] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._closing_task: asyncio.Task[None] | None = None |
sample_rate=tts.sample_rate, | ||
num_channels=tts.num_channels, | ||
) | ||
self._wrapped_tts = tts | ||
self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer() | ||
self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer( | ||
retain_format=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually we were not using retain_format for the StreamAdapter before. Since it is only used to generate a sentence.
In the PR I did, I was actually keeping the basic.SentenceTokenizer inside the transcription synchronization code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used in agent
's tts_node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe I added it in this pr, we need to format if we use the timed transcript from stream adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, the synchronizer also needs the exact same formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it can be different. They process the sentences separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but I thought the aligned transcripts returned by the TTSs were not including new lines/special characters. So I assumed retain_format was not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the StreamAdapter with OpenAI, the transcription_node is coming from the llm_node right?
In this case I really don't think we should wait for the TTS? Since we have the opt-in flag use_tts_aligned_transcript
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If use tts aligned transcript enabled, the input of the transcription node is from tts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdym for we shouldn't wait for the tts when using steam adapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense, so by default, even if we use the StreamAdapter, it'll use the llm output for the transcription_node
@@ -233,8 +234,12 @@ def llm_node( | |||
return Agent.default.llm_node(self, chat_ctx, tools, model_settings) | |||
|
|||
def transcription_node( | |||
self, text: AsyncIterable[str], model_settings: ModelSettings | |||
) -> AsyncIterable[str] | Coroutine[Any, Any, AsyncIterable[str]] | Coroutine[Any, Any, None]: | |||
self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should just keep it implicit for now
Fix #2607, #2326 (comment)
needs livekit/python-sdks#456