Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -306,16 +306,17 @@ async def _run(self, output_emitter: tts.AudioEmitter) -> None:

async def _sentence_stream_task(ws: aiohttp.ClientWebSocketResponse) -> None:
context_id = utils.shortuuid()
base_pkt = _to_cartesia_options(self._opts, streaming=True)
async for ev in sent_tokenizer_stream:
token_pkt = base_pkt.copy()
# The opts may have changed between the time this class was instantiated and the time we start receiving
# sentences to synthesize. We use the latest options here by doing self._tts._opts instead of self._opts.
token_pkt = _to_cartesia_options(self._tts._opts, streaming=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain in what case you want to update the options after the tts_node started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series of Events:

  1. User Speaks ("I want to talk to Katie")
  2. llm_node_1 starts, calls update_options(voice=KATIE)
  3. tts_node_1 starts with voice=KATIE
  4. User interrupts the agent ("actually I want to speak to Max") -> llm_node_1 cancels, but tts_node continues
  5. llm_node_2 starts, calls update_options(voice=MAX)
  6. tts_node_1 synthesizes the LLM response, but in the KATIE voice instead of the MAX voice

Desired Behavior:

  • At step 6, we want the TTS to synthesize in the MAX voice, not the KATIE voice

Please let me know if this is reasonable and/or you plan to allow this functionality.
I think it is reasonable to expect the TTS to synthesize with the most up-to-date options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_node_2 starts, calls update_options(voice=MAX)
tts_node_1 synthesizes the LLM response, but in the KATIE voice instead of the MAX voice

does this actually happen? a new generation will create a new tts stream, ideally there should be a tts_node_2 for the llm_node_2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps only one LLM node persists.

The behavior can be replicated, though, by doing something like this:

  1. In the llm_node, call update_options with the new voice.
  2. This new voice is NOT reflected by the time we get to synthesizing. Only in the next turn is it updated.

If you make the change in this PR, the new voice will be reflected.
We need this by EOD, so will be hacking a version of the Cartesia.TTS() plugin in the meantime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's not applied because the tts_node is created in parallel with llm_node, before the update_options in llm_node is called.

instead of using options from tts instance, we may still want each tts stream has a copy of the options. maybe we should allow to create a new tts_node in the llm_node with the updated options, this will fix the issue for all TTS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using options from tts instance, we may still want each tts stream has a copy of the options.

I agree with this. It makes sense for stream options to be immutable once instantiated.

maybe we should allow to create a new tts_node in the llm_node with the updated options

What about a tts_node.restart() or tts_node.refresh() of some sort? I can also create new tts_node from within the llm_node but less clear how I would do that. Will take a look later this week

token_pkt["context_id"] = context_id
token_pkt["transcript"] = ev.token + " "
token_pkt["continue"] = True
self._mark_started()
await ws.send_str(json.dumps(token_pkt))

end_pkt = base_pkt.copy()
end_pkt = _to_cartesia_options(self._tts._opts, streaming=True)
end_pkt["context_id"] = context_id
end_pkt["transcript"] = " "
end_pkt["continue"] = False
Expand Down
Loading