[BugFix][VoxCPM2]: split multichar Chinese tokens to match training tokenization#2832
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Tested this on H20 (h20-server-1, GPU 1) on the PR branch (commit 64cf5ce = your fix rebased onto current main). Reverted my own prior tokenizer wrapper and any other local hacks before testing. Server log confirms the fix code is live: But Whisper still reports garbled Chinese:
For comparison, the broken pre-fix One hypothesis worth ruling out: the split map is built from Server start command used: Happy to share the WAVs if useful. cc @Sy0307 |
|
Fix looks correct. The lazy initialization of the split map is clean and the performance data shows no regression. Missing automated regression test for the tokenization behavior — manual ASR verification is good but not sufficient for preventing regressions. A unit test that asserts tokenized input "你好" produces the expected single-char token IDs would catch future regressions. |
…zation
VoxCPM2 was trained with mask_multichar_chinese_tokens which splits
multi-character Chinese tokens (e.g. "你好" id=23523) into single-char
IDs ("你" id=59496, "好" id=59495). The HuggingFace openbmb/VoxCPM2
model repo ships a plain LlamaTokenizerFast without this splitting,
causing garbled Chinese audio output via the /v1/audio/speech API.
Add _split_multichar_chinese() in preprocess() to fix up token IDs
before they reach the model. The split map is lazily built from the
tokenizer vocab on first request. The operation is idempotent so it
works correctly regardless of whether the tokenizer already does
char-level splitting.
Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Sy03 <1370724210@qq.com>
64cf5ce to
7df2dc9
Compare
|
I continued using the gradio_demo.py program and found that when not using voice cloning, the generated streaming and non-streaming output audio had no issues, except that on the webpage, the streaming output audio played twice continuously. However, when using voice cloning, both streaming and non-streaming output contents changed. These are the results of my tests. 6.mp45.mp4The branch used is: fix-voxcpm2-chinese-tokenizer Below is the operation log I used. (voxcpm-omni) root@AS-4124GS-TNR:/home/www# git clone https://github.com/vllm-project/vllm-omni.git (voxcpm-omni) root@AS-4124GS-TNR:/home/www# cd vllm-omni
(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# git checkout fix-voxcpm2-chinese-tokenizer (voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# git branch --show-current (voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# pip install -e . (voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# pip show vllm-omni (voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni#
These are the terminal logs from the recent test. |
|
Hi @Sy0307, I am also facing issues with voice cloning flow only in English Language. |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: LGTM
Clean, well-scoped fix for the multichar Chinese token mismatch. The approach of splitting at the serving layer and fail-fast validating at the model layer is sound.
Correctness
is_cjk_charcovers the main CJK Unicode blocks. Missing Extension E/F/G/I (U+2B820..U+323AF) but those are vanishingly rare in practice — fine to add later if needed.build_cjk_split_mapcorrectly strips the sentencepiece▁prefix, validates all constituent chars have non-UNK IDs, and caches the result.split_multichar_chineseis a clean O(n) pass, idempotent as documented.- The switch from
{"prompt": text}to{"prompt_token_ids": ids}correctly preserves BOS handling (the preprocess already strips leading BOS).
Thread safety
_voxcpm2_encode lazy-inits _voxcpm2_tokenizer in the async event loop with no await between the None-check and assignment, so no race. Good.
Performance
Lazy one-time map build + O(n) per-request scan on a small number of text tokens — negligible overhead, consistent with the benchmark numbers in the PR.
Minor suggestions (non-blocking)
-
Duplicate tokenizer load:
_voxcpm2_encodecallsAutoTokenizer.from_pretrained(model_name)which loads the tokenizer a second time in the serving process. If there's a way to reuse the engine's tokenizer (e.g. viaself.engine_client), that would save memory and startup time. Not critical since it's a one-time cost. -
_get_multichar_zh_split()in preprocess hot path: The lazy-build is fine, butany(tid in split_map for tid in token_ids)runs on every prefill. Since the serving layer is now responsible for splitting, this check should never fire in normal operation. Consider gating it behind a debug/assert mode if profiling ever shows it matters (unlikely with current token counts).
Tested logic looks correct. Approving.
|
I redeployed the latest merged project code to the server for testing, deleted the model, cleared the cache, and re-downloaded it. During testing, I found that the audio output is normal when not using voice cloning, but when using voice cloning, whether in Chinese or English, and whether streaming output is enabled or not, there are issues — all the audio is noisy and garbled. When not using voice cloning but enabling streaming output, the generated audio data overlaps, producing two identical outputs. Below is an example video I tested, hopefully it helps; output.mp4 |
|
I will handle bug in voice clone mode asap and thanks for your report. @gesla2024 |
…okenization (vllm-project#2832) Signed-off-by: Sy03 <1370724210@qq.com>
…okenization (vllm-project#2832) Signed-off-by: Sy03 <1370724210@qq.com>
…okenization (vllm-project#2832) Signed-off-by: Sy03 <1370724210@qq.com>
Purpose
Fix garbled Chinese audio output from VoxCPM2 via the
/v1/audio/speechAPI.Root cause: VoxCPM2 was trained with
mask_multichar_chinese_tokenswhich splits multi-character Chinese tokens (e.g. "你好" id=23523) into single-character IDs ("你" id=59496, "好" id=59495). The HuggingFaceopenbmb/VoxCPM2model repo ships a plainLlamaTokenizerFastwithout this splitting, so the model receives token IDs it was never trained on, producing garbled Chinese output.Related: #2758 (comment)
Test Plan
Tested on NVIDIA H20 with latest main (50ae1de), without the custom
tokenization_voxcpm2.pythat was previously masking the bug:vllm-omni serve openbmb/VoxCPM2 --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml --omni --trust-remote-codecurl /v1/audio/speech -d '{"input": "你好,这是一个测试程序。", ...}'Test Result
Correctness (whisper-base ASR):
Performance (A/B on H20, origin/main, torch.compile + CUDA Graph enabled):
Zero performance impact — the split map is lazily built once and the per-request lookup runs only during prefill on a few dozen tokens.
cc @linyueqian @gesla2024