Skip to content

fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479)#2547

Closed
passionworkeer wants to merge 1 commit into
vllm-project:mainfrom
passionworkeer:fix-uploaded-custom-voice-ref-audio
Closed

fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479)#2547
passionworkeer wants to merge 1 commit into
vllm-project:mainfrom
passionworkeer:fix-uploaded-custom-voice-ref-audio

Conversation

@passionworkeer
Copy link
Copy Markdown

Fixes issue #2479 - Custom voice upload fails at inference with mistral_common tokenizer.

Problem

When a custom voice is uploaded via /audio/voices/upload, it is added to supported_speakers and uploaded_speakers, but _build_voxtral_prompt() calls encode_speech_request() with the voice name string. The mistral_common tokenizer only knows built-in voice names, not custom uploaded ones, causing the tokenizer to fail.

Fix

In _build_voxtral_prompt(), when �oice is found in uploaded_speakers, resolve it to the stored reference audio file and use SpeechRequest(input=text, ref_audio=...) instead of SpeechRequest(input=text, voice=voice). This mirrors how the CosyVoice3 and FishSpeech paths already handle uploaded voices.

Changed file

  • �llm_omni/entrypoints/openai/serving_speech.py: Modified _build_voxtral_prompt() to check if the voice is in uploaded_speakers and, if so, read the stored audio file and use it as
    ef_audio for the tokenizer.

Closes #2479

Copilot AI review requested due to automatic review settings April 7, 2026 09:10
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Voxtral TTS inference for custom uploaded voices by resolving uploaded voice names to their stored reference audio before tokenization (avoiding mistral_common’s “unknown voice” failures).

Changes:

  • Update _build_voxtral_prompt() to detect uploaded voices and pass ref_audio (base64) to encode_speech_request() instead of voice.
  • Load the stored uploaded voice sample from disk and attach it as multimodal audio input for the Voxtral prompt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1191 to +1196
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch triggers for any entry in uploaded_speakers, including voices uploaded via upload_voice_embedding where file_path points to a .safetensors and mime_type is not audio. Reading and passing that file as ref_audio will produce invalid audio input for Voxtral. Gate this logic to audio-backed uploads (e.g., speaker_info.get("embedding_source") == "audio" or mime_type.startswith("audio/")) and return a clear error for embedding-only voices on Voxtral.

Suggested change
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
mime_type = speaker_info.get("mime_type", "audio/wav")
embedding_source = speaker_info.get("embedding_source")
is_audio_backed = embedding_source == "audio" or mime_type.startswith("audio/")
if not is_audio_backed:
raise ValueError(
f"Uploaded voice '{voice}' is embedding-only and cannot be used as Voxtral "
"reference audio. Please provide an audio-backed uploaded voice or pass ref_audio."
)
file_path = Path(speaker_info["file_path"])
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

Copilot uses AI. Check for mistakes.
Comment on lines +1192 to +1208
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}
# Fall through to voice-name path if file is missing
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling through to the voice-name path when the uploaded voice file is missing will still fail for custom voices (the tokenizer doesn’t recognize the name). Instead of falling through, raise a user-facing error indicating the uploaded voice’s reference audio is missing/unreadable (and possibly suggest re-upload).

Suggested change
if file_path.exists():
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}
# Fall through to voice-name path if file is missing
if not file_path.exists():
raise ValueError(
f"Reference audio for uploaded voice '{voice}' is missing. "
"Please re-upload the voice sample and try again."
)
try:
with open(file_path, "rb") as f:
audio_bytes = f.read()
except OSError as e:
raise ValueError(
f"Reference audio for uploaded voice '{voice}' could not be read. "
"Please re-upload the voice sample and try again."
) from e
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}

Copilot uses AI. Check for mistakes.
Comment on lines +1193 to +1199
with open(file_path, "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
mime_type = speaker_info.get("mime_type", "audio/wav")
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reimplements the same file-read/base64 logic as _get_uploaded_audio_data(), then immediately strips the data-URI prefix. Consider reusing _get_uploaded_audio_data(voice) and then partition(',') (or have a helper that returns raw base64) to avoid duplication and keep behavior consistent across call sites.

Copilot uses AI. Check for mistakes.
Comment on lines +1186 to +1190
# For custom uploaded voices, mistral_common doesn't know the voice name.
# Resolve to reference audio data stored at upload time instead.
voice_lower = voice.lower()
if voice_lower in self.uploaded_speakers:
speaker_info = self.uploaded_speakers[voice_lower]
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test covering Voxtral inference with an uploaded voice name: ensure _build_voxtral_prompt() resolves the voice to the stored reference audio and calls encode_speech_request() with ref_audio (not voice). This prevents regressions of #2479 and covers the new branch.

Copilot uses AI. Check for mistakes.
@passionworkeer
Copy link
Copy Markdown
Author

@copilot 感谢 review!4条建议都很实用:

  1. embedding_source == "audio" 过滤非音频类型 ✓
  2. 缺失文件直接抛用户级错误,避免 tokenizer 那边 fail 得莫名其妙 ✓
  3. 复用 _get_uploaded_audio_data() 减少重复 ✓
  4. 单元测试会加上,覆盖 Voxtral + 上传语音这个分支 ✓

代码我去更新一下。

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: request changes

The fix addresses a real bug (custom uploaded voices failing with the mistral_common tokenizer), but the implementation has several issues that should be addressed before merging.

1. Code duplication — reuse _get_uploaded_audio_data() (major)

The new code manually reads the file, base64-encodes it, constructs a data URI, then immediately strips the data URI prefix. This duplicates logic already in _get_uploaded_audio_data() (lines 458-486). The method should be reused:

if voice_lower in self.uploaded_speakers:
    ref_audio = self._get_uploaded_audio_data(voice)
    if ref_audio is not None:
        # Strip data URI prefix for mistral_common
        _, _, ref_audio = ref_audio.partition(",")
        tokenized = self._tts_tokenizer.encode_speech_request(
            SpeechRequest(input=text, ref_audio=ref_audio)
        )
        audio = tokenized.audios[0]
        return {
            "prompt_token_ids": tokenized.tokens,
            "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
        }

2. Pointless data URI construction then immediate stripping

ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")

This builds a data URI only to tear it apart on the very next line. If you reuse _get_uploaded_audio_data(), at least the duplication is gone and the strip makes sense (since that method returns a data URI). Otherwise, just use audio_b64 directly.

3. No error handling on file read

_get_uploaded_audio_data() wraps file I/O in try/except and logs errors. The new code has a bare open(file_path, "rb") with no exception handling. If the file exists but is unreadable (permissions, disk error), this will raise an unhandled exception through the request path.

4. Silent fallthrough on missing file is problematic

# Fall through to voice-name path if file is missing

If the uploaded voice's file is missing, silently falling through to encode_speech_request(SpeechRequest(input=text, voice=voice)) will produce a confusing tokenizer error since the custom voice name is not a built-in voice. It would be better to raise a clear error or return a descriptive error so the user knows the upload file is missing.

5. Consider fixing this in the validator instead (architectural suggestion)

The Fish Speech path handles this in _validate_fish_speech_request() by resolving request.voicerequest.ref_audio before the prompt builder runs. The voxtral validator (_validate_voxtral_tts_request) does not do this. Adding the resolution there would be more consistent with the existing patterns, and _build_voxtral_prompt would not need any changes since it already handles ref_audio correctly (lines 1342-1343 strip the data URI prefix, lines 1356-1361 encode with ref_audio).

This approach is cleaner because the prompt builder stays simple and the uploaded-voice resolution logic lives in one place per model type (the validator).


Summary: The bug fix is needed, but please (a) reuse _get_uploaded_audio_data() instead of duplicating its logic, (b) add error handling, (c) avoid the silent fallthrough, and ideally (d) consider resolving in the validator for architectural consistency.

@Gaohan123
Copy link
Copy Markdown
Collaborator

@passionworkeer Hello, any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants