Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions vllm_omni/entrypoints/openai/serving_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -1183,6 +1183,29 @@ def _build_voxtral_prompt(self, request: OpenAICreateSpeechRequest) -> dict[str,
mistral_tokenizer = cached_tokenizer_from_config(self.engine_client.model_config)
self._tts_tokenizer = mistral_tokenizer.instruct
if voice is not None:
# For custom uploaded voices, mistral_common doesn't know the voice name.
# Resolve to reference audio data stored at upload time instead.
voice_lower = voice.lower()
if voice_lower in self.uploaded_speakers:
speaker_info = self.uploaded_speakers[voice_lower]
Comment on lines +1186 to +1190
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test covering Voxtral inference with an uploaded voice name: ensure _build_voxtral_prompt() resolves the voice to the stored reference audio and calls encode_speech_request() with ref_audio (not voice). This prevents regressions of #2479 and covers the new branch.

Copilot uses AI. Check for mistakes.
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
Comment on lines +1191 to +1196
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch triggers for any entry in uploaded_speakers, including voices uploaded via upload_voice_embedding where file_path points to a .safetensors and mime_type is not audio. Reading and passing that file as ref_audio will produce invalid audio input for Voxtral. Gate this logic to audio-backed uploads (e.g., speaker_info.get("embedding_source") == "audio" or mime_type.startswith("audio/")) and return a clear error for embedding-only voices on Voxtral.

Suggested change
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
mime_type = speaker_info.get("mime_type", "audio/wav")
embedding_source = speaker_info.get("embedding_source")
is_audio_backed = embedding_source == "audio" or mime_type.startswith("audio/")
if not is_audio_backed:
raise ValueError(
f"Uploaded voice '{voice}' is embedding-only and cannot be used as Voxtral "
"reference audio. Please provide an audio-backed uploaded voice or pass ref_audio."
)
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

Copilot uses AI. Check for mistakes.
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
Comment on lines +1193 to +1199
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reimplements the same file-read/base64 logic as _get_uploaded_audio_data(), then immediately strips the data-URI prefix. Consider reusing _get_uploaded_audio_data(voice) and then partition(',') (or have a helper that returns raw base64) to avoid duplication and keep behavior consistent across call sites.

Copilot uses AI. Check for mistakes.
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}
# Fall through to voice-name path if file is missing
Comment on lines +1192 to +1208
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling through to the voice-name path when the uploaded voice file is missing will still fail for custom voices (the tokenizer doesn’t recognize the name). Instead of falling through, raise a user-facing error indicating the uploaded voice’s reference audio is missing/unreadable (and possibly suggest re-upload).

Suggested change
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}
# Fall through to voice-name path if file is missing
if not file_path.exists():
raise ValueError(
f"Reference audio for uploaded voice '{voice}' is missing. "
"Please re-upload the voice sample and try again."
)
try:
with open(file_path, "rb") as f:
audio_bytes = f.read()
except OSError as e:
raise ValueError(
f"Reference audio for uploaded voice '{voice}' could not be read. "
"Please re-upload the voice sample and try again."
) from e
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}

Copilot uses AI. Check for mistakes.
tokens = self._tts_tokenizer.encode_speech_request(SpeechRequest(input=text, voice=voice)).tokens
return {
"prompt_token_ids": tokens,
Expand Down
Loading