Skip to content

[BugFix][VoxCPM2]: split multichar Chinese tokens to match training tokenization#2832

Merged
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:fix/voxcpm2-chinese-tokenizer
Apr 16, 2026
Merged

[BugFix][VoxCPM2]: split multichar Chinese tokens to match training tokenization#2832
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:fix/voxcpm2-chinese-tokenizer

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 15, 2026

Purpose

Fix garbled Chinese audio output from VoxCPM2 via the /v1/audio/speech API.

Root cause: VoxCPM2 was trained with mask_multichar_chinese_tokens which splits multi-character Chinese tokens (e.g. "你好" id=23523) into single-character IDs ("你" id=59496, "好" id=59495). The HuggingFace openbmb/VoxCPM2 model repo ships a plain LlamaTokenizerFast without this splitting, so the model receives token IDs it was never trained on, producing garbled Chinese output.

Related: #2758 (comment)

Test Plan

Tested on NVIDIA H20 with latest main (50ae1de), without the custom tokenization_voxcpm2.py that was previously masking the bug:

  1. Start server: vllm-omni serve openbmb/VoxCPM2 --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml --omni --trust-remote-code
  2. Send Chinese TTS: curl /v1/audio/speech -d '{"input": "你好,这是一个测试程序。", ...}'
  3. ASR verify with whisper-base

Test Result

Correctness (whisper-base ASR):

Input ASR Output Status
你好,这是一个测试程序。 你好,这是一个测试程序 Pass
人工智能正在深刻改变...AI技术的应用范围越来越广泛。 人工智能正在深刻改变...AI技术的应用范围越来越广泛 Pass
Hello, this is a quick test of VoxCPM2 synthesis. Hello, this is a quick test of Vox CPM2 synthesis. Pass

Performance (A/B on H20, origin/main, torch.compile + CUDA Graph enabled):

Baseline (no fix) With fix Diff
Avg RTF 0.111 0.108 -2.7% (noise)

Zero performance impact — the split map is lazily built once and the per-request lookup runs only during prefill on a few dozen tokens.

cc @linyueqian @gesla2024

@Sy0307 Sy0307 marked this pull request as ready for review April 15, 2026 19:48
@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 15, 2026 19:48
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 15, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

Tested this on H20 (h20-server-1, GPU 1) on the PR branch (commit 64cf5ce = your fix rebased onto current main). Reverted my own prior tokenizer wrapper and any other local hacks before testing.

Server log confirms the fix code is live:

voxcpm2_talker.py:1014  VoxCPM2: built multichar Chinese split map (19400 entries)

But Whisper still reports garbled Chinese:

Input Audio whisper-large-v3 whisper-small (forced lang=zh)
你好,这是一个测试程序。 2.24s ふっガタン (lang=ja) 方日遊
Hello, this is a test program. 3.36s correct EN n/a
你好,这是一个voxcpm2 测试程序 在vllm-omni 0.19 中测试的。 6.56s mostly garbled wichtig订阅今天节目点个 Esen 辑入

For comparison, the broken pre-fix zh_only.wav was ~1.12s. With your fix it's 2.24s, so the split is doing something (the model is consuming a longer/different ID sequence), but the audio itself is still unintelligible.

One hypothesis worth ruling out: the split map is built from self.tts.text_tokenizer.tokenizer.get_vocab(), but the IDs reaching preprocess() come from vllm's request-side tokenizer (the one wired through cached_tokenizer_from_config for /v1/audio/speech). If those two tokenizers don't share the exact same vocab IDs (e.g. different added_tokens.json ordering), the split map would expand the wrong IDs and the model would still see out-of-distribution sequences. Could you double-check that self.tts.text_tokenizer.tokenizer is <the one vllm uses for input encoding>, or at minimum that tokenizer.encode("你好") returns the same ID list in both paths?

Server start command used:

CUDA_VISIBLE_DEVICES=1 HF_HOME=/mnt/data4/huggingface \
  vllm serve openbmb/VoxCPM2 \
    --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml \
    --omni --port 8071 --trust-remote-code --enforce-eager \
    --gpu-memory-utilization 0.8

Happy to share the WAVs if useful. cc @Sy0307

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Fix looks correct. The lazy initialization of the split map is clean and the performance data shows no regression.

Missing automated regression test for the tokenization behavior — manual ASR verification is good but not sufficient for preventing regressions. A unit test that asserts tokenized input "你好" produces the expected single-char token IDs would catch future regressions.

Sy0307 added 2 commits April 16, 2026 11:18
…zation

VoxCPM2 was trained with mask_multichar_chinese_tokens which splits
multi-character Chinese tokens (e.g. "你好" id=23523) into single-char
IDs ("你" id=59496, "好" id=59495). The HuggingFace openbmb/VoxCPM2
model repo ships a plain LlamaTokenizerFast without this splitting,
causing garbled Chinese audio output via the /v1/audio/speech API.

Add _split_multichar_chinese() in preprocess() to fix up token IDs
before they reach the model. The split map is lazily built from the
tokenizer vocab on first request. The operation is idempotent so it
works correctly regardless of whether the tokenizer already does
char-level splitting.

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 force-pushed the fix/voxcpm2-chinese-tokenizer branch from 64cf5ce to 7df2dc9 Compare April 16, 2026 03:52
@gesla2024
Copy link
Copy Markdown

I continued using the gradio_demo.py program and found that when not using voice cloning, the generated streaming and non-streaming output audio had no issues, except that on the webpage, the streaming output audio played twice continuously. However, when using voice cloning, both streaming and non-streaming output contents changed.

These are the results of my tests.

6.mp4
5.mp4

The branch used is: fix-voxcpm2-chinese-tokenizer

Below is the operation log I used.

(voxcpm-omni) root@AS-4124GS-TNR:/home/www# git clone https://github.com/vllm-project/vllm-omni.git
Cloning into 'vllm-omni'...
remote: Enumerating objects: 22046, done.
remote: Counting objects: 100% (456/456), done.
remote: Compressing objects: 100% (364/364), done.
remote: Total 22046 (delta 283), reused 92 (delta 92), pack-reused 21590 (from 4)
Receiving objects: 100% (22046/22046), 22.74 MiB | 4.12 MiB/s, done.
Resolving deltas: 100% (14865/14865), done.

(voxcpm-omni) root@AS-4124GS-TNR:/home/www# cd vllm-omni
(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# git fetch origin pull/2832/head:fix-voxcpm2-chinese-tokenizer
remote: Enumerating objects: 16, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 16 (delta 14), reused 16 (delta 14), pack-reused 0 (from 0)
Unpacking objects: 100% (16/16), 3.27 KiB | 418.00 KiB/s, done.
From https://github.com/vllm-project/vllm-omni

  • [new ref] refs/pull/2832/head -> fix-voxcpm2-chinese-tokenizer

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# git checkout fix-voxcpm2-chinese-tokenizer
Switched to branch 'fix-voxcpm2-chinese-tokenizer'

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# git branch --show-current
fix-voxcpm2-chinese-tokenizer

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# pip install -e .
Looking in indexes: https://mirrors.pku.edu.cn/pypi/web/simple
Obtaining file:///home/www/vllm-omni
....
....
Created wheel for vllm-omni: filename=vllm_omni-0.19.0rc2.dev143+g7df2dc9ac-0.editable-py3-none-any.whl size=11373 sha256=9695c3d794f9da5329cdc4279ac6e8340bf4ecbc0209828ddeab6e267cb449ee
Stored in directory: /tmp/pip-ephem-wheel-cache-cx0eq_ew/wheels/40/d1/3d/b5974f53b81623adf2d6340c122f931855062fa6d82ca19137
Successfully built vllm-omni
Installing collected packages: vllm-omni
Successfully installed vllm-omni-0.19.0rc2.dev143+g7df2dc9ac

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# pip show vllm-omni
Name: vllm-omni
Version: 0.19.0rc2.dev143+g7df2dc9ac
Summary: A framework for efficient model inference with omni-modality models
Home-page: https://github.com/vllm-project/vllm-omni
Author: vLLM-Omni Team
Author-email:
License-Expression: Apache-2.0
Location: /root/miniconda3/envs/voxcpm-omni/lib/python3.12/site-packages
Editable project location: /home/www/vllm-omni
Requires: accelerate, aenum, av, cache-dit, diffusers, einops, fa3-fwd, imageio, janus, omegaconf, onnxruntime, openai-whisper, prettytable, pydub, pyzmq, resampy, soundfile, sox, torchsde, tqdm, x-transformers
Required-by:

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni#

(voxcpm-omni) root@AS-4124GS-TNR:/home/www/vllm-omni# vllm-omni serve /home/VoxCPM/models/VoxCPM2 \ --stage-configs-path /home/www/vllm-omni/vllm_omni/model_executor/stage_configs/voxcpm2.yaml \ --omni \ --port 8071 \ --trust-remote-code \ --enforce-eager \ --gpu-memory-utilization 0.8 (APIServer pid=577724) INFO 04-16 12:20:30 [utils.py:299] vLLM server version 0.19.0, serving model /home/VoxCPM/models/VoxCPM2 (APIServer pid=577724) INFO 04-16 12:20:30 [utils.py:233] non-default args: {'model_tag': '/home/VoxCPM/models/VoxCPM2', 'port': 8071, 'model': '/home/VoxCPM/models/VoxCPM2', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8} (APIServer pid=577724) INFO 04-16 12:20:30 [omni_base.py:93] [AsyncOmni] Initializing with model /home/VoxCPM/models/VoxCPM2 (APIServer pid=577724) INFO 04-16 12:20:30 [async_omni_engine.py:244] [AsyncOmniEngine] Initializing with model /home/VoxCPM/models/VoxCPM2 (APIServer pid=577724) WARNING 04-16 12:20:30 [async_omni_engine.py:1290] stage_configs_path is set — the following top-level engine args are ignored (per-stage YAML takes precedence): attention_config, compilation_config, enforce_eager, eplb_config, gpu_memory_utilization, kernel_config, profiler_config, reasoning_parser_plugin, structured_outputs_config, trust_remote_code (APIServer pid=577724) WARNING 04-16 12:20:30 [utils.py:115] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. (APIServer pid=577724) INFO: Started server process [577724] (APIServer pid=577724) INFO: Waiting for application startup. (APIServer pid=577724) INFO: Application startup complete. (APIServer pid=577724) INFO 04-16 12:22:02 [serving_speech.py:831] VoxCPM2 serving: built multichar split map (19789 entries) (APIServer pid=577724) INFO 04-16 12:22:02 [serving_speech.py:1599] TTS speech request speech-8da1d309698701d6: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:31987 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) WARNING 04-16 12:22:02 [input_processor.py:235] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat(). (APIServer pid=577724) INFO 04-16 12:22:02 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-8da1d309698701d6 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:22:02 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-8da1d309698701d6 (Worker pid=578523) WARNING 04-16 12:22:02 [gpu_model_runner.py:350] additional_information on request data is deprecated, use model_intermediate_buffer (Worker pid=578523) INFO 04-16 12:22:03 [voxcpm2_talker.py:1037] VoxCPM2: built multichar Chinese split map (19789 entries) (Worker pid=578523) WARNING 04-16 12:22:03 [gpu_model_runner.py:1378] _merge_additional_information_update is deprecated, use _update_intermediate_buffer (Worker pid=578523) INFO 04-16 12:22:03 [voxcpm2_talker.py:574] VoxCPM2: torch.compile applied to: LocDiT, feat_encoder, AudioVAE, scaffold+residual (CUDA Graph, skipping compile), projections (APIServer pid=577724) INFO 04-16 12:22:13 [async_omni.py:272] [AsyncOmni] Request speech-8da1d309698701d6 aborted. (APIServer pid=577724) INFO 04-16 12:22:13 [serving_speech.py:1208] Streaming request speech-8da1d309698701d6 cancelled by client (APIServer pid=577724) INFO 04-16 12:22:13 [orchestrator.py:868] [Orchestrator] Aborted request(s) ['speech-8da1d309698701d6'] (APIServer pid=577724) INFO 04-16 12:22:14 [serving_speech.py:1599] TTS speech request speech-9018a3c9a173f56c: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:28370 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:22:14 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-9018a3c9a173f56c prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:22:14 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-9018a3c9a173f56c (Worker pid=578523) INFO 04-16 12:22:17 [voxcpm2_talker.py:647] CUDA Graph captured for scaffold (batch_size=1) (Worker pid=578523) INFO 04-16 12:22:17 [voxcpm2_talker.py:647] CUDA Graph captured for residual (batch_size=1) (APIServer pid=577724) INFO 04-16 12:22:18 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO 04-16 12:22:25 [serving_speech.py:1599] TTS speech request speech-928c90092de4b36f: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:43881 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:22:25 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-928c90092de4b36f prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:22:25 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-928c90092de4b36f (APIServer pid=577724) INFO 04-16 12:22:25 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO 04-16 12:22:43 [serving_speech.py:1599] TTS speech request speech-9ade2235cc52453f: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:8409 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:22:43 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-9ade2235cc52453f prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:22:43 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-9ade2235cc52453f (APIServer pid=577724) INFO 04-16 12:22:44 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO 04-16 12:23:25 [serving_speech.py:1599] TTS speech request speech-87ea32c87659cc30: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:47383 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:23:25 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-87ea32c87659cc30 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:23:25 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-87ea32c87659cc30 (APIServer pid=577724) INFO 04-16 12:24:04 [async_omni.py:272] [AsyncOmni] Request speech-87ea32c87659cc30 aborted. (APIServer pid=577724) INFO 04-16 12:24:04 [serving_speech.py:1208] Streaming request speech-87ea32c87659cc30 cancelled by client (APIServer pid=577724) INFO 04-16 12:24:04 [orchestrator.py:868] [Orchestrator] Aborted request(s) ['speech-87ea32c87659cc30'] (APIServer pid=577724) INFO 04-16 12:24:07 [serving_speech.py:1599] TTS speech request speech-934c0a9617435fff: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO 04-16 12:24:07 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-934c0a9617435fff prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:24:07 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-934c0a9617435fff (APIServer pid=577724) INFO 04-16 12:24:13 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO: 160.213.131.122:6553 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:25:13 [serving_speech.py:1599] TTS speech request speech-9f875e40994aa22f: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO: 160.213.131.122:45180 - "POST /v1/audio/speech HTTP/1.1" 200 OK (APIServer pid=577724) INFO 04-16 12:25:13 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-9f875e40994aa22f prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:25:13 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-9f875e40994aa22f (APIServer pid=577724) INFO 04-16 12:25:20 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO 04-16 12:25:35 [serving_speech.py:1599] TTS speech request speech-a5b415b8015a0b79: text='你好吗,我是一个测试程序', model=voxcpm2 (APIServer pid=577724) INFO 04-16 12:25:35 [orchestrator.py:670] [Orchestrator] _handle_add_request: stage=0 req=speech-a5b415b8015a0b79 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=0 num_sampling_params=1 (APIServer pid=577724) INFO 04-16 12:25:35 [stage_engine_core_client.py:170] [StageEngineCoreClient] Stage-0 adding request: speech-a5b415b8015a0b79 (APIServer pid=577724) INFO 04-16 12:25:41 [omni_base.py:162] [Summary] {} (APIServer pid=577724) INFO: 160.213.131.122:31840 - "POST /v1/audio/speech HTTP/1.1" 200 OK ^C(Worker pid=578523) WARNING 04-16 12:29:08 [multiproc_executor.py:871] WorkerProc was terminated (Worker pid=578523) INFO 04-16 12:29:08 [multiproc_executor.py:764] Parent process exited, terminating worker queues (APIServer pid=577724) INFO 04-16 12:29:08 [omni_base.py:295] [AsyncOmni] Shutting down (APIServer pid=577724) INFO 04-16 12:29:08 [async_omni_engine.py:1622] [AsyncOmniEngine] Shutting down Orchestrator (APIServer pid=577724) INFO 04-16 12:29:08 [orchestrator.py:212] [Orchestrator] Received shutdown signal (APIServer pid=577724) INFO 04-16 12:29:08 [orchestrator.py:941] [Orchestrator] Shutting down all stages (APIServer pid=577724) INFO: Shutting down (Worker pid=578523) (APIServer pid=577724) INFO: Waiting for application shutdown. (APIServer pid=577724) INFO: Application shutdown complete. (APIServer pid=577724) INFO: Finished server process [577724] (APIServer pid=577724) INFO 04-16 12:29:08 [omni_base.py:295] [AsyncOmni] Shutting down

These are the terminal logs from the recent test.

@codeHackeR321
Copy link
Copy Markdown

Hi @Sy0307, I am also facing issues with voice cloning flow only in English Language.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: LGTM

Clean, well-scoped fix for the multichar Chinese token mismatch. The approach of splitting at the serving layer and fail-fast validating at the model layer is sound.

Correctness

  • is_cjk_char covers the main CJK Unicode blocks. Missing Extension E/F/G/I (U+2B820..U+323AF) but those are vanishingly rare in practice — fine to add later if needed.
  • build_cjk_split_map correctly strips the sentencepiece prefix, validates all constituent chars have non-UNK IDs, and caches the result.
  • split_multichar_chinese is a clean O(n) pass, idempotent as documented.
  • The switch from {"prompt": text} to {"prompt_token_ids": ids} correctly preserves BOS handling (the preprocess already strips leading BOS).

Thread safety

_voxcpm2_encode lazy-inits _voxcpm2_tokenizer in the async event loop with no await between the None-check and assignment, so no race. Good.

Performance

Lazy one-time map build + O(n) per-request scan on a small number of text tokens — negligible overhead, consistent with the benchmark numbers in the PR.

Minor suggestions (non-blocking)

  1. Duplicate tokenizer load: _voxcpm2_encode calls AutoTokenizer.from_pretrained(model_name) which loads the tokenizer a second time in the serving process. If there's a way to reuse the engine's tokenizer (e.g. via self.engine_client), that would save memory and startup time. Not critical since it's a one-time cost.

  2. _get_multichar_zh_split() in preprocess hot path: The lazy-build is fine, but any(tid in split_map for tid in token_ids) runs on every prefill. Since the serving layer is now responsible for splitting, this check should never fire in normal operation. Consider gating it behind a debug/assert mode if profiling ever shows it matters (unlikely with current token counts).

Tested logic looks correct. Approving.

@linyueqian linyueqian merged commit 7d64a7c into vllm-project:main Apr 16, 2026
8 checks passed
@gesla2024
Copy link
Copy Markdown

I redeployed the latest merged project code to the server for testing, deleted the model, cleared the cache, and re-downloaded it. During testing, I found that the audio output is normal when not using voice cloning, but when using voice cloning, whether in Chinese or English, and whether streaming output is enabled or not, there are issues — all the audio is noisy and garbled.

When not using voice cloning but enabling streaming output, the generated audio data overlaps, producing two identical outputs.

Below is an example video I tested, hopefully it helps;

output.mp4

@codeHackeR321 codeHackeR321 mentioned this pull request Apr 17, 2026
1 task
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 17, 2026

I will handle bug in voice clone mode asap and thanks for your report. @gesla2024

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants