fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479) by passionworkeer · Pull Request #2547 · vllm-project/vllm-omni

passionworkeer · 2026-04-07T09:10:47Z

Fixes issue #2479 - Custom voice upload fails at inference with mistral_common tokenizer.

Problem

When a custom voice is uploaded via /audio/voices/upload, it is added to supported_speakers and uploaded_speakers, but _build_voxtral_prompt() calls encode_speech_request() with the voice name string. The mistral_common tokenizer only knows built-in voice names, not custom uploaded ones, causing the tokenizer to fail.

Fix

In _build_voxtral_prompt(), when �oice is found in uploaded_speakers, resolve it to the stored reference audio file and use SpeechRequest(input=text, ref_audio=...) instead of SpeechRequest(input=text, voice=voice). This mirrors how the CosyVoice3 and FishSpeech paths already handle uploaded voices.

Changed file

�llm_omni/entrypoints/openai/serving_speech.py: Modified _build_voxtral_prompt() to check if the voice is in uploaded_speakers and, if so, read the stored audio file and use it as
ef_audio for the tokenizer.

Closes #2479

…erence (#2479)

chatgpt-codex-connector · 2026-04-07T09:10:57Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copilot

Pull request overview

Fixes Voxtral TTS inference for custom uploaded voices by resolving uploaded voice names to their stored reference audio before tokenization (avoiding mistral_common’s “unknown voice” failures).

Changes:

Update _build_voxtral_prompt() to detect uploaded voices and pass ref_audio (base64) to encode_speech_request() instead of voice.
Load the stored uploaded voice sample from disk and attach it as multimodal audio input for the Voxtral prompt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T09:14:33Z

+                file_path = Path(speaker_info["file_path"])
+                if file_path.exists():
+                    with open(file_path, "rb") as f:
+                        audio_bytes = f.read()
+                    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
+                    mime_type = speaker_info.get("mime_type", "audio/wav")


This branch triggers for any entry in uploaded_speakers, including voices uploaded via upload_voice_embedding where file_path points to a .safetensors and mime_type is not audio. Reading and passing that file as ref_audio will produce invalid audio input for Voxtral. Gate this logic to audio-backed uploads (e.g., speaker_info.get("embedding_source") == "audio" or mime_type.startswith("audio/")) and return a clear error for embedding-only voices on Voxtral.

Suggested change

file_path = Path(speaker_info["file_path"])

if file_path.exists():

with open(file_path, "rb") as f:

audio_bytes = f.read()

audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

mime_type = speaker_info.get("mime_type", "audio/wav")

mime_type = speaker_info.get("mime_type", "audio/wav")

embedding_source = speaker_info.get("embedding_source")

is_audio_backed = embedding_source == "audio" or mime_type.startswith("audio/")

if not is_audio_backed:

raise ValueError(

f"Uploaded voice '{voice}' is embedding-only and cannot be used as Voxtral "

"reference audio. Please provide an audio-backed uploaded voice or pass ref_audio."

)

file_path = Path(speaker_info["file_path"])

if file_path.exists():

with open(file_path, "rb") as f:

audio_bytes = f.read()

audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

Copilot · 2026-04-07T09:14:33Z

+                if file_path.exists():
+                    with open(file_path, "rb") as f:
+                        audio_bytes = f.read()
+                    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
+                    mime_type = speaker_info.get("mime_type", "audio/wav")
+                    ref_audio = f"data:{mime_type};base64,{audio_b64}"
+                    # Strip data URI prefix for mistral_common
+                    _, _, ref_audio = ref_audio.partition(",")
+                    tokenized = self._tts_tokenizer.encode_speech_request(
+                        SpeechRequest(input=text, ref_audio=ref_audio)
+                    )
+                    audio = tokenized.audios[0]
+                    return {
+                        "prompt_token_ids": tokenized.tokens,
+                        "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
+                    }
+                # Fall through to voice-name path if file is missing


Falling through to the voice-name path when the uploaded voice file is missing will still fail for custom voices (the tokenizer doesn’t recognize the name). Instead of falling through, raise a user-facing error indicating the uploaded voice’s reference audio is missing/unreadable (and possibly suggest re-upload).

Suggested change

if file_path.exists():

with open(file_path, "rb") as f:

audio_bytes = f.read()

audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

mime_type = speaker_info.get("mime_type", "audio/wav")

ref_audio = f"data:{mime_type};base64,{audio_b64}"

# Strip data URI prefix for mistral_common

_, _, ref_audio = ref_audio.partition(",")

tokenized = self._tts_tokenizer.encode_speech_request(

SpeechRequest(input=text, ref_audio=ref_audio)

)

audio = tokenized.audios[0]

return {

"prompt_token_ids": tokenized.tokens,

"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},

}

# Fall through to voice-name path if file is missing

if not file_path.exists():

raise ValueError(

f"Reference audio for uploaded voice '{voice}' is missing. "

"Please re-upload the voice sample and try again."

)

try:

with open(file_path, "rb") as f:

audio_bytes = f.read()

except OSError as e:

raise ValueError(

f"Reference audio for uploaded voice '{voice}' could not be read. "

"Please re-upload the voice sample and try again."

) from e

audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

mime_type = speaker_info.get("mime_type", "audio/wav")

ref_audio = f"data:{mime_type};base64,{audio_b64}"

# Strip data URI prefix for mistral_common

_, _, ref_audio = ref_audio.partition(",")

tokenized = self._tts_tokenizer.encode_speech_request(

SpeechRequest(input=text, ref_audio=ref_audio)

)

audio = tokenized.audios[0]

return {

"prompt_token_ids": tokenized.tokens,

"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},

}

Copilot · 2026-04-07T09:14:34Z

+                    with open(file_path, "rb") as f:
+                        audio_bytes = f.read()
+                    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
+                    mime_type = speaker_info.get("mime_type", "audio/wav")
+                    ref_audio = f"data:{mime_type};base64,{audio_b64}"
+                    # Strip data URI prefix for mistral_common
+                    _, _, ref_audio = ref_audio.partition(",")


This reimplements the same file-read/base64 logic as _get_uploaded_audio_data(), then immediately strips the data-URI prefix. Consider reusing _get_uploaded_audio_data(voice) and then partition(',') (or have a helper that returns raw base64) to avoid duplication and keep behavior consistent across call sites.

Copilot · 2026-04-07T09:14:34Z

+            # For custom uploaded voices, mistral_common doesn't know the voice name.
+            # Resolve to reference audio data stored at upload time instead.
+            voice_lower = voice.lower()
+            if voice_lower in self.uploaded_speakers:
+                speaker_info = self.uploaded_speakers[voice_lower]


Please add a unit test covering Voxtral inference with an uploaded voice name: ensure _build_voxtral_prompt() resolves the voice to the stored reference audio and calls encode_speech_request() with ref_audio (not voice). This prevents regressions of #2479 and covers the new branch.

passionworkeer · 2026-04-14T09:05:01Z

@copilot 感谢 review！4条建议都很实用：

embedding_source == "audio" 过滤非音频类型 ✓
缺失文件直接抛用户级错误，避免 tokenizer 那边 fail 得莫名其妙 ✓
复用 _get_uploaded_audio_data() 减少重复 ✓
单元测试会加上，覆盖 Voxtral + 上传语音这个分支 ✓

代码我去更新一下。

lishunyang12

Review: request changes

The fix addresses a real bug (custom uploaded voices failing with the mistral_common tokenizer), but the implementation has several issues that should be addressed before merging.

1. Code duplication — reuse `_get_uploaded_audio_data()` (major)

The new code manually reads the file, base64-encodes it, constructs a data URI, then immediately strips the data URI prefix. This duplicates logic already in _get_uploaded_audio_data() (lines 458-486). The method should be reused:

if voice_lower in self.uploaded_speakers:
    ref_audio = self._get_uploaded_audio_data(voice)
    if ref_audio is not None:
        # Strip data URI prefix for mistral_common
        _, _, ref_audio = ref_audio.partition(",")
        tokenized = self._tts_tokenizer.encode_speech_request(
            SpeechRequest(input=text, ref_audio=ref_audio)
        )
        audio = tokenized.audios[0]
        return {
            "prompt_token_ids": tokenized.tokens,
            "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
        }

2. Pointless data URI construction then immediate stripping

ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")

This builds a data URI only to tear it apart on the very next line. If you reuse _get_uploaded_audio_data(), at least the duplication is gone and the strip makes sense (since that method returns a data URI). Otherwise, just use audio_b64 directly.

3. No error handling on file read

_get_uploaded_audio_data() wraps file I/O in try/except and logs errors. The new code has a bare open(file_path, "rb") with no exception handling. If the file exists but is unreadable (permissions, disk error), this will raise an unhandled exception through the request path.

4. Silent fallthrough on missing file is problematic

# Fall through to voice-name path if file is missing

If the uploaded voice's file is missing, silently falling through to encode_speech_request(SpeechRequest(input=text, voice=voice)) will produce a confusing tokenizer error since the custom voice name is not a built-in voice. It would be better to raise a clear error or return a descriptive error so the user knows the upload file is missing.

5. Consider fixing this in the validator instead (architectural suggestion)

The Fish Speech path handles this in _validate_fish_speech_request() by resolving request.voice → request.ref_audio before the prompt builder runs. The voxtral validator (_validate_voxtral_tts_request) does not do this. Adding the resolution there would be more consistent with the existing patterns, and _build_voxtral_prompt would not need any changes since it already handles ref_audio correctly (lines 1342-1343 strip the data URI prefix, lines 1356-1361 encode with ref_audio).

This approach is cleaner because the prompt builder stays simple and the uploaded-voice resolution logic lives in one place per model type (the validator).

Summary: The bug fix is needed, but please (a) reuse _get_uploaded_audio_data() instead of duplicating its logic, (b) add error handling, (c) avoid the silent fallthrough, and ideally (d) consider resolving in the validator for architectural consistency.

Gaohan123 · 2026-05-05T03:58:31Z

@passionworkeer Hello, any updates?

fix: use ref_audio path for uploaded custom voices in voxtral TTS inf…

19dfeed

…erence (#2479)

passionworkeer requested a review from hsliuustc0106 as a code owner April 7, 2026 09:10

Copilot AI review requested due to automatic review settings April 7, 2026 09:10

Copilot started reviewing on behalf of passionworkeer April 7, 2026 09:11 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

passionworkeer mentioned this pull request Apr 14, 2026

fix: handle uploaded voice as ref_audio in Voxtral TTS #2790

Closed

lishunyang12 requested changes Apr 16, 2026

View reviewed changes

yenuo26 linked an issue Apr 28, 2026 that may be closed by this pull request

Voxtral-4B-TTS-2603 fails to load with acoustic_transformer not found in MistralForCausalLM #2388

Open

1 task

passionworkeer closed this by deleting the head repository May 5, 2026

Swifty57 mentioned this pull request May 7, 2026

Voxtral-4B-TTS-2603 fails to load with acoustic_transformer not found in MistralForCausalLM #2388

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479)#2547

fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479)#2547
passionworkeer wants to merge 1 commit into
vllm-project:mainfrom
passionworkeer:fix-uploaded-custom-voice-ref-audio

passionworkeer commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

passionworkeer commented Apr 14, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Gaohan123 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

passionworkeer commented Apr 7, 2026

Problem

Fix

Changed file

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

passionworkeer commented Apr 14, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: request changes

1. Code duplication — reuse _get_uploaded_audio_data() (major)

2. Pointless data URI construction then immediate stripping

3. No error handling on file read

4. Silent fallthrough on missing file is problematic

5. Consider fixing this in the validator instead (architectural suggestion)

Uh oh!

Gaohan123 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Code duplication — reuse `_get_uploaded_audio_data()` (major)