fix: use ref_audio path for uploaded custom voices in voxtral TTS inference (#2479)#2547
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Pull request overview
Fixes Voxtral TTS inference for custom uploaded voices by resolving uploaded voice names to their stored reference audio before tokenization (avoiding mistral_common’s “unknown voice” failures).
Changes:
- Update
_build_voxtral_prompt()to detect uploaded voices and passref_audio(base64) toencode_speech_request()instead ofvoice. - Load the stored uploaded voice sample from disk and attach it as multimodal audio input for the Voxtral prompt.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| file_path = Path(speaker_info["file_path"]) | ||
| if file_path.exists(): | ||
| with open(file_path, "rb") as f: | ||
| audio_bytes = f.read() | ||
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | ||
| mime_type = speaker_info.get("mime_type", "audio/wav") |
There was a problem hiding this comment.
This branch triggers for any entry in uploaded_speakers, including voices uploaded via upload_voice_embedding where file_path points to a .safetensors and mime_type is not audio. Reading and passing that file as ref_audio will produce invalid audio input for Voxtral. Gate this logic to audio-backed uploads (e.g., speaker_info.get("embedding_source") == "audio" or mime_type.startswith("audio/")) and return a clear error for embedding-only voices on Voxtral.
| file_path = Path(speaker_info["file_path"]) | |
| if file_path.exists(): | |
| with open(file_path, "rb") as f: | |
| audio_bytes = f.read() | |
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | |
| mime_type = speaker_info.get("mime_type", "audio/wav") | |
| mime_type = speaker_info.get("mime_type", "audio/wav") | |
| embedding_source = speaker_info.get("embedding_source") | |
| is_audio_backed = embedding_source == "audio" or mime_type.startswith("audio/") | |
| if not is_audio_backed: | |
| raise ValueError( | |
| f"Uploaded voice '{voice}' is embedding-only and cannot be used as Voxtral " | |
| "reference audio. Please provide an audio-backed uploaded voice or pass ref_audio." | |
| ) | |
| file_path = Path(speaker_info["file_path"]) | |
| if file_path.exists(): | |
| with open(file_path, "rb") as f: | |
| audio_bytes = f.read() | |
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") |
| if file_path.exists(): | ||
| with open(file_path, "rb") as f: | ||
| audio_bytes = f.read() | ||
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | ||
| mime_type = speaker_info.get("mime_type", "audio/wav") | ||
| ref_audio = f"data:{mime_type};base64,{audio_b64}" | ||
| # Strip data URI prefix for mistral_common | ||
| _, _, ref_audio = ref_audio.partition(",") | ||
| tokenized = self._tts_tokenizer.encode_speech_request( | ||
| SpeechRequest(input=text, ref_audio=ref_audio) | ||
| ) | ||
| audio = tokenized.audios[0] | ||
| return { | ||
| "prompt_token_ids": tokenized.tokens, | ||
| "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]}, | ||
| } | ||
| # Fall through to voice-name path if file is missing |
There was a problem hiding this comment.
Falling through to the voice-name path when the uploaded voice file is missing will still fail for custom voices (the tokenizer doesn’t recognize the name). Instead of falling through, raise a user-facing error indicating the uploaded voice’s reference audio is missing/unreadable (and possibly suggest re-upload).
| if file_path.exists(): | |
| with open(file_path, "rb") as f: | |
| audio_bytes = f.read() | |
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | |
| mime_type = speaker_info.get("mime_type", "audio/wav") | |
| ref_audio = f"data:{mime_type};base64,{audio_b64}" | |
| # Strip data URI prefix for mistral_common | |
| _, _, ref_audio = ref_audio.partition(",") | |
| tokenized = self._tts_tokenizer.encode_speech_request( | |
| SpeechRequest(input=text, ref_audio=ref_audio) | |
| ) | |
| audio = tokenized.audios[0] | |
| return { | |
| "prompt_token_ids": tokenized.tokens, | |
| "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]}, | |
| } | |
| # Fall through to voice-name path if file is missing | |
| if not file_path.exists(): | |
| raise ValueError( | |
| f"Reference audio for uploaded voice '{voice}' is missing. " | |
| "Please re-upload the voice sample and try again." | |
| ) | |
| try: | |
| with open(file_path, "rb") as f: | |
| audio_bytes = f.read() | |
| except OSError as e: | |
| raise ValueError( | |
| f"Reference audio for uploaded voice '{voice}' could not be read. " | |
| "Please re-upload the voice sample and try again." | |
| ) from e | |
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | |
| mime_type = speaker_info.get("mime_type", "audio/wav") | |
| ref_audio = f"data:{mime_type};base64,{audio_b64}" | |
| # Strip data URI prefix for mistral_common | |
| _, _, ref_audio = ref_audio.partition(",") | |
| tokenized = self._tts_tokenizer.encode_speech_request( | |
| SpeechRequest(input=text, ref_audio=ref_audio) | |
| ) | |
| audio = tokenized.audios[0] | |
| return { | |
| "prompt_token_ids": tokenized.tokens, | |
| "multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]}, | |
| } |
| with open(file_path, "rb") as f: | ||
| audio_bytes = f.read() | ||
| audio_b64 = base64.b64encode(audio_bytes).decode("utf-8") | ||
| mime_type = speaker_info.get("mime_type", "audio/wav") | ||
| ref_audio = f"data:{mime_type};base64,{audio_b64}" | ||
| # Strip data URI prefix for mistral_common | ||
| _, _, ref_audio = ref_audio.partition(",") |
There was a problem hiding this comment.
This reimplements the same file-read/base64 logic as _get_uploaded_audio_data(), then immediately strips the data-URI prefix. Consider reusing _get_uploaded_audio_data(voice) and then partition(',') (or have a helper that returns raw base64) to avoid duplication and keep behavior consistent across call sites.
| # For custom uploaded voices, mistral_common doesn't know the voice name. | ||
| # Resolve to reference audio data stored at upload time instead. | ||
| voice_lower = voice.lower() | ||
| if voice_lower in self.uploaded_speakers: | ||
| speaker_info = self.uploaded_speakers[voice_lower] |
There was a problem hiding this comment.
Please add a unit test covering Voxtral inference with an uploaded voice name: ensure _build_voxtral_prompt() resolves the voice to the stored reference audio and calls encode_speech_request() with ref_audio (not voice). This prevents regressions of #2479 and covers the new branch.
|
@copilot 感谢 review!4条建议都很实用:
代码我去更新一下。 |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: request changes
The fix addresses a real bug (custom uploaded voices failing with the mistral_common tokenizer), but the implementation has several issues that should be addressed before merging.
1. Code duplication — reuse _get_uploaded_audio_data() (major)
The new code manually reads the file, base64-encodes it, constructs a data URI, then immediately strips the data URI prefix. This duplicates logic already in _get_uploaded_audio_data() (lines 458-486). The method should be reused:
if voice_lower in self.uploaded_speakers:
ref_audio = self._get_uploaded_audio_data(voice)
if ref_audio is not None:
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")
tokenized = self._tts_tokenizer.encode_speech_request(
SpeechRequest(input=text, ref_audio=ref_audio)
)
audio = tokenized.audios[0]
return {
"prompt_token_ids": tokenized.tokens,
"multi_modal_data": {"audio": [(audio.audio_array, audio.sampling_rate)]},
}2. Pointless data URI construction then immediate stripping
ref_audio = f"data:{mime_type};base64,{audio_b64}"
# Strip data URI prefix for mistral_common
_, _, ref_audio = ref_audio.partition(",")This builds a data URI only to tear it apart on the very next line. If you reuse _get_uploaded_audio_data(), at least the duplication is gone and the strip makes sense (since that method returns a data URI). Otherwise, just use audio_b64 directly.
3. No error handling on file read
_get_uploaded_audio_data() wraps file I/O in try/except and logs errors. The new code has a bare open(file_path, "rb") with no exception handling. If the file exists but is unreadable (permissions, disk error), this will raise an unhandled exception through the request path.
4. Silent fallthrough on missing file is problematic
# Fall through to voice-name path if file is missingIf the uploaded voice's file is missing, silently falling through to encode_speech_request(SpeechRequest(input=text, voice=voice)) will produce a confusing tokenizer error since the custom voice name is not a built-in voice. It would be better to raise a clear error or return a descriptive error so the user knows the upload file is missing.
5. Consider fixing this in the validator instead (architectural suggestion)
The Fish Speech path handles this in _validate_fish_speech_request() by resolving request.voice → request.ref_audio before the prompt builder runs. The voxtral validator (_validate_voxtral_tts_request) does not do this. Adding the resolution there would be more consistent with the existing patterns, and _build_voxtral_prompt would not need any changes since it already handles ref_audio correctly (lines 1342-1343 strip the data URI prefix, lines 1356-1361 encode with ref_audio).
This approach is cleaner because the prompt builder stays simple and the uploaded-voice resolution logic lives in one place per model type (the validator).
Summary: The bug fix is needed, but please (a) reuse _get_uploaded_audio_data() instead of duplicating its logic, (b) add error handling, (c) avoid the silent fallthrough, and ideally (d) consider resolving in the validator for architectural consistency.
|
@passionworkeer Hello, any updates? |
Fixes issue #2479 - Custom voice upload fails at inference with mistral_common tokenizer.
Problem
When a custom voice is uploaded via /audio/voices/upload, it is added to supported_speakers and uploaded_speakers, but _build_voxtral_prompt() calls encode_speech_request() with the voice name string. The mistral_common tokenizer only knows built-in voice names, not custom uploaded ones, causing the tokenizer to fail.
Fix
In _build_voxtral_prompt(), when �oice is found in uploaded_speakers, resolve it to the stored reference audio file and use SpeechRequest(input=text, ref_audio=...) instead of SpeechRequest(input=text, voice=voice). This mirrors how the CosyVoice3 and FishSpeech paths already handle uploaded voices.
Changed file
ef_audio for the tokenizer.
Closes #2479