Skip to content

[Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False#1913

Merged
hsliuustc0106 merged 11 commits into
vllm-project:mainfrom
JuanPZuluaga:feat/talker-cuda-graph-batched
Mar 20, 2026
Merged

[Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False#1913
hsliuustc0106 merged 11 commits into
vllm-project:mainfrom
JuanPZuluaga:feat/talker-cuda-graph-batched

Conversation

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

in the Qwen3TTS Talker we run the CodePredictor on every decode step to generate the remaining 15 residual codebook tokens. currently, we have torch.compile(mode="default", dynamic=True), in high concurrency settings, we still have quite some overhead.

here, we eliminate a bit of the overhead by switching to fixed-shape CUDA graph capture: reduce-overhead + dynamic=False. We pad the input to fixed shape [bucket_bsz, 17, H] with is_causal=True so Inductor captures its own internal CUDA graphs, by design we know the talker has a max_seq_len of 16. So the code predictor has: [talker_hidden, code_0_embed, code_1_embed, ..., code_N_embed].

Also. we add some batch-size bucketing, where we pad batch dimension to power-of-2 buckets [1, 2, 4, 8, 16] matching what vLLM already does. we also implement a buffering + position_id caching for pre-allocating proj_buf for max batch size and caches position_ids per bucket to avoid per-step allocations.

Test Plan

Run evaluation as in #1852 and #1797

I used this YAML config:

stage_args:
  - stage_id: 0
    stage_type: llm
    runtime:
      devices: "0"
      max_batch_size: 16

  - stage_id: 1
    stage_type: llm
    runtime:
      devices: "0"
      max_batch_size: 16

runtime:
  max_inflight: 16
  connectors:
    connector_of_shared_memory:
      codec_streaming: true
      codec_chunk_frames: 32
      codec_left_context_frames: 32

Test Result

see the results and plots, results are clear, we improve in every setting.

Benchmark Results

Metric Concurrency cuda_graph main
TTFP (ms) 4 153.9 193.1
TTFP (ms) 8 340.7 423.3
TTFP (ms) 16 1078.3 1242.0
E2E (ms) 4 1766.4 2321.5
E2E (ms) 8 2776.9 3317.0
E2E (ms) 16 4577.7 5194.0
RTF 4 0.313 0.405
RTF 8 0.556 0.702
RTF 16 0.797 0.885
Throughput (audio-s/s) 4 12.66 9.79
Throughput (audio-s/s) 8 15.27 11.94
Throughput (audio-s/s) 16 19.26 17.38

Improvement (cuda_graph vs main)

Metric Concurrency Improvement
TTFP 4 +20.3%
TTFP 8 +19.5%
TTFP 16 +13.2%
E2E 4 +23.9%
E2E 8 +16.3%
E2E 16 +11.9%
RTF 4 +22.7%
RTF 8 +20.7%
RTF 16 +9.9%

Plot saved to vllm_omni/comparison.png

comparison
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga JuanPZuluaga changed the title [Optim][Qwen3TTS][CodePredictor] support torch.compile reduce-overhead with fixed-shapes [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False Mar 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01ce53a615

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +507 to +508
lm_heads = self._lm_heads_list
codec_embeds = self._codec_embeds_list
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Initialize cached heads in no-compile fallback

When supports_torch_inductor() is false, _setup_compile() now returns early after setting only _compiled_model_fwd, but forward() unconditionally reads _lm_heads_list and _codec_embeds_list from these cached fields. In that environment (e.g., CPU or unsupported GPU), those fields stay None, so the first decode step fails with a NoneType subscript error instead of using the previous working path. Please populate these caches in the fallback branch (or avoid relying on them when compile is disabled).

Useful? React with 👍 / 👎.

JuanPZuluaga added 2 commits March 16, 2026 08:17
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@univa-HARRY
Copy link
Copy Markdown

I’m following this and trying to reproduce it exactly based on the corresponding feature branch, but the server crashes when concurrency reaches around 3–4.

Is there any stage-config.yaml option I should pay attention to besides the settings you shared?

The error message is as follows.

[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514] Failed on request speech-beacc4826ddfc154: EngineCore encountered an issue. See stack trace (above) for the root cause.
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514] Traceback (most recent call last):
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1507, in generation_single_request
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]     async for res in cast(AsyncLLM, stage_engine).generate(ein, llm_sampling_params, rid):
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 564, in generate
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]     q = await self.add_request(
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]         ^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 309, in add_request
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514]     raise EngineDeadError()
[Stage-0] ERROR 03-16 09:43:22 [omni_stage.py:1514] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=7) ERROR 03-16 09:43:22 [async_omni.py:742] [AsyncOrchestrator] Stage 0 error on request speech-beacc4826ddfc154: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700] Streaming speech generation failed for speech-beacc4826ddfc154: {'request_id': 'speech-beacc4826ddfc154', 'stage_id': 0, 'error': 'EngineCore encountered an issue. See stack trace (above) for the root cause.'}
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700] Traceback (most recent call last):
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/openai/serving_speech.py", line 646, in _generate_audio_chunks
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]     async for res in generator:
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 437, in generate
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]     async for output in self._process_async_results(
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 597, in _process_async_results
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]     engine_outputs, finished, output_to_yield = self._process_single_result(
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 745, in _process_single_result
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700]     raise RuntimeError(result)
(APIServer pid=7) ERROR 03-16 09:43:22 [serving_speech.py:700] RuntimeError: {'request_id': 'speech-beacc4826ddfc154', 'stage_id': 0, 'error': 'EngineCore encountered an issue. See stack trace (above) for the root cause.'}
(APIServer pid=7) ERROR:    Exception in ASGI application
(APIServer pid=7)   + Exception Group Traceback (most recent call last):
(APIServer pid=7)   |   File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 81, in collapse_excgroups
(APIServer pid=7)   |     yield
(APIServer pid=7)   |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 270, in __call__
(APIServer pid=7)   |     async with anyio.create_task_group() as task_group:
(APIServer pid=7)   |                ^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   |   File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 783, in __aexit__
(APIServer pid=7)   |     raise BaseExceptionGroup(
(APIServer pid=7)   | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
(APIServer pid=7)   +-+---------------- 1 ----------------
(APIServer pid=7)     | Traceback (most recent call last):
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=7)     |     result = await app(  # type: ignore[func-returns-value]
(APIServer pid=7)     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
(APIServer pid=7)     |     return await self.app(scope, receive, send)
(APIServer pid=7)     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1160, in __call__
(APIServer pid=7)     |     await super().__call__(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
(APIServer pid=7)     |     await self.middleware_stack(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
(APIServer pid=7)     |     raise exc
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
(APIServer pid=7)     |     await self.app(scope, receive, _send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 87, in __call__
(APIServer pid=7)     |     await self.app(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
(APIServer pid=7)     |     raise exc
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
(APIServer pid=7)     |     await self.app(scope, receive, send_wrapper)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
(APIServer pid=7)     |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=7)     |     raise exc
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=7)     |     await app(scope, receive, sender)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(APIServer pid=7)     |     await self.app(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
(APIServer pid=7)     |     await self.middleware_stack(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
(APIServer pid=7)     |     await route.handle(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
(APIServer pid=7)     |     await self.app(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 130, in app
(APIServer pid=7)     |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=7)     |     raise exc
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=7)     |     await app(scope, receive, sender)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 117, in app
(APIServer pid=7)     |     await response(scope, receive, send)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 269, in __call__
(APIServer pid=7)     |     with collapse_excgroups():
(APIServer pid=7)     |          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)     |   File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
(APIServer pid=7)     |     self.gen.throw(value)
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 87, in collapse_excgroups
(APIServer pid=7)     |     raise exc
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 273, in wrap
(APIServer pid=7)     |     await func()
(APIServer pid=7)     |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in stream_response
(APIServer pid=7)     |     async for chunk in self.body_iterator:
(APIServer pid=7)     |   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/openai/serving_speech.py", line 646, in _generate_audio_chunks
(APIServer pid=7)     |     async for res in generator:
(APIServer pid=7)     |   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 437, in generate
(APIServer pid=7)     |     async for output in self._process_async_results(
(APIServer pid=7)     |   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 597, in _process_async_results
(APIServer pid=7)     |     engine_outputs, finished, output_to_yield = self._process_single_result(
(APIServer pid=7)     |                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)     |   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 745, in _process_single_result
(APIServer pid=7)     |     raise RuntimeError(result)
(APIServer pid=7)     | RuntimeError: {'request_id': 'speech-beacc4826ddfc154', 'stage_id': 0, 'error': 'EngineCore encountered an issue. See stack trace (above) for the root cause.'}
(APIServer pid=7)     +------------------------------------
(APIServer pid=7) 
(APIServer pid=7) During handling of the above exception, another exception occurred:
(APIServer pid=7) 
(APIServer pid=7) Traceback (most recent call last):
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=7)     result = await app(  # type: ignore[func-returns-value]
(APIServer pid=7)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
(APIServer pid=7)     return await self.app(scope, receive, send)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1160, in __call__
(APIServer pid=7)     await super().__call__(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
(APIServer pid=7)     await self.middleware_stack(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
(APIServer pid=7)     raise exc
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
(APIServer pid=7)     await self.app(scope, receive, _send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 87, in __call__
(APIServer pid=7)     await self.app(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
(APIServer pid=7)     raise exc
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
(APIServer pid=7)     await self.app(scope, receive, send_wrapper)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
(APIServer pid=7)     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=7)     raise exc
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=7)     await app(scope, receive, sender)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(APIServer pid=7)     await self.app(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
(APIServer pid=7)     await self.middleware_stack(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
(APIServer pid=7)     await route.handle(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
(APIServer pid=7)     await self.app(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 130, in app
(APIServer pid=7)     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=7)     raise exc
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=7)     await app(scope, receive, sender)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 117, in app
(APIServer pid=7)     await response(scope, receive, send)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 269, in __call__
(APIServer pid=7)     with collapse_excgroups():
(APIServer pid=7)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
(APIServer pid=7)     self.gen.throw(value)
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/_utils.py", line 87, in collapse_excgroups
(APIServer pid=7)     raise exc
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 273, in wrap
(APIServer pid=7)     await func()
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in stream_response
(APIServer pid=7)     async for chunk in self.body_iterator:
(APIServer pid=7)   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/openai/serving_speech.py", line 646, in _generate_audio_chunks
(APIServer pid=7)     async for res in generator:
(APIServer pid=7)   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 437, in generate
(APIServer pid=7)     async for output in self._process_async_results(
(APIServer pid=7)   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 597, in _process_async_results
(APIServer pid=7)     engine_outputs, finished, output_to_yield = self._process_single_result(
(APIServer pid=7)                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/vllm-workspace/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 745, in _process_single_result
(APIServer pid=7)     raise RuntimeError(result)
(APIServer pid=7) RuntimeError: {'request_id': 'speech-beacc4826ddfc154', 'stage_id': 0, 'error': 'EngineCore encountered an issue. See stack trace (above) for the root cause.'}

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Hello, my configuration is:

# Qwen3-TTS batch_size=4 config (streaming with async_chunk)
# Enables concurrent request processing with max_inflight=4
# 2-stage pipeline: Talker -> Code2Wav
async_chunk: true
stage_args:
  - stage_id: 0
    stage_type: llm
    runtime:
      devices: "0"
      max_batch_size: 16
    engine_args:
      model_stage: qwen3_tts
      model_arch: Qwen3TTSTalkerForConditionalGeneration
      hf_overrides:
        architectures: [Qwen3TTSTalkerForConditionalGeneration]
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      enforce_eager: false
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: latent
      gpu_memory_utilization: 0.3
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 4096
      max_model_len: 4096
      custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk
    output_connectors:
      to_stage_1: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.9
      top_k: 50
      max_tokens: 4096
      seed: 42
      detokenize: false
      repetition_penalty: 1.05
      stop_token_ids: [2150]

  - stage_id: 1
    stage_type: llm
    runtime:
      devices: "0"
      max_batch_size: 16
    engine_args:
      model_stage: code2wav
      model_arch: Qwen3TTSCode2Wav
      hf_overrides:
        architectures: [Qwen3TTSCode2Wav]
      worker_type: generation
      scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
      enforce_eager: true
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: audio
      gpu_memory_utilization: 0.2
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 8192
      max_model_len: 32768
    engine_input_source: [0]
    final_output: true
    final_output_type: audio
    input_connectors:
      from_stage_0: connector_of_shared_memory
    tts_args:
      max_instructions_length: 500
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 65536
      seed: 42
      detokenize: true
      repetition_penalty: 1.0

runtime:
  enabled: true
  defaults:
    window_size: -1
    max_inflight: 16

  connectors:
    connector_of_shared_memory:
      name: SharedMemoryConnector
      extra:
        shm_threshold_bytes: 65536
        codec_streaming: true
        connector_get_sleep_s: 0.01
        connector_get_max_wait_first_chunk: 3000
        connector_get_max_wait: 300
        codec_chunk_frames: 32
        codec_left_context_frames: 32

  edges:
    - from: 0
      to: 1
      window_size: -1

could you try this to see whether it solves your issue? @univa-HARRY

@linyueqian linyueqian self-requested a review March 16, 2026 17:10
@univa-HARRY
Copy link
Copy Markdown

@JuanPZuluaga

I guess it was the issue of max_num_batched_tokens in stage 0.
setting max_num_batched_tokens: 512 -> 8192 / max_model_len: 4096 -> 8192
resolves the problem. Thank you.

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 17, 2026

Need more consideration. torch.compile(mode="reduce-overhead", dynamic=False) may have crash with enforce_eager: false as repeat cuda_graph capture/replay issue. I have tried this so plz think twice. @JuanPZuluaga

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Need more consideration. torch.compile(mode="reduce-overhead", dynamic=False) may have crash with enforce_eager: false as repeat cuda_graph capture/replay issue. I have tried this so plz think twice. @JuanPZuluaga

@Sy0307 good point, this would be an issue if the compiled module would be inside CUDAGraphWrapper scope. But the CodePredictor module is explicitly excluded from vLLM's CUDA graphs (_cudagraph_mode = CUDAGraphMode.NONE in _talker_mtp_forward), so the two graph systems are independent and should not conflict each other i think. it's kind of verified in the benchmark with different concu=4/8/16 also with enforce_eager: false on the Talker.

However, if there's an specific experiment that you'd like me to try, i could that as well.

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 17, 2026

Need more consideration. torch.compile(mode="reduce-overhead", dynamic=False) may have crash with enforce_eager: false as repeat cuda_graph capture/replay issue. I have tried this so plz think twice. @JuanPZuluaga

@Sy0307 good point, this would be an issue if the compiled module would be inside CUDAGraphWrapper scope. But the CodePredictor module is explicitly excluded from vLLM's CUDA graphs (_cudagraph_mode = CUDAGraphMode.NONE in _talker_mtp_forward), so the two graph systems are independent and should not conflict each other i think. it's kind of verified in the benchmark with different concu=4/8/16 also with enforce_eager: false on the Talker.

However, if there's an specific experiment that you'd like me to try, i could that as well.

Make sense. This addresses the main concern. Thanks for the improvement.

@univa-HARRY
Copy link
Copy Markdown

@JuanPZuluaga
Your PR and the other two optimization PRs you mentioned are independent of each other, so if all three are applied, is it correct to expect a multiplicative improvement in latency?

Also, when do you expect these optimizations to be incorporated?

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

@JuanPZuluaga Your PR and the other two optimization PRs you mentioned are independent of each other, so if all three are applied, is it correct to expect a multiplicative improvement in latency?

Also, when do you expect these optimizations to be incorporated?

I think these optimizations should be added soon.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 18, 2026
@linyueqian linyueqian added this to the v0.18.0 milestone Mar 18, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

I was looking at the change from growing seq_len to always passing max_seq=17, and one thing concerns me: proj_buf is pre-allocated and never zeroed between requests. The old code only passed proj_buf[:bsz, :step+1, :] so stale data was invisible. Now the full proj_buf[:padded_bsz, :max_seq, :] goes into the compiled forward every step, meaning positions step+2 through 16 still hold leftover embeddings from whatever ran in that batch slot last time. The causal mask should prevent attention to those positions, but if there's any off-by-one in the mask, the stale values would silently corrupt the output with no error, just subtly wrong audio. Might be worth adding a proj_buf[:padded_bsz].zero_() at the top of each forward call, or at least a test that verifies identical output between the old growing-window path and the new fixed-window path across consecutive requests with different batch sizes.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 19, 2026

I am adding the new results after merging main:

EDIT: updated the benchmark results

comparison

I used this YAML -- relevant params changed in YAML:

  - stage_id: 0
      max_batch_size: 16
      max_num_batched_tokens: 4096
  - stage_id: 1
      max_batch_size: 16
....
    max_inflight: 16
....
        codec_chunk_frames: 25
        codec_left_context_frames: 25

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 19, 2026

just added the proj_buf[:padded_bsz].zero_() to the code. @linyueqian

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit b4342fb into vllm-project:main Mar 20, 2026
7 checks passed
@JuanPZuluaga JuanPZuluaga deleted the feat/talker-cuda-graph-batched branch March 20, 2026 05:09
@evezhier evezhier mentioned this pull request Mar 20, 2026
5 tasks
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 22, 2026
### vllm-omni-audio-tts
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-perf
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-api
- Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection
- Changes:
  - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection

### vllm-omni-contrib
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-cicd
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-api
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-perf
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-contrib
- Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0

### vllm-omni-serving
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-contrib
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-api
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0`

### vllm-omni-cicd
- Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test.
- Changes:
  - Bug fix: [CI] Fix test.

### vllm-omni-cicd
- Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml`

### vllm-omni-cicd
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-perf
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-serving
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-image-gen
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-perf
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-serving
- Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni
- Changes:
  - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni

### vllm-omni-image-gen
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images
- Additions:
  - Qwen-Image-Layered
  - Qwen-Image-Layered
  - Qwen-Image-Layered

### vllm-omni-api
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images

### vllm-omni-cicd
- Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3)

### vllm-omni-serving
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-audio-tts
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-perf
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-serving
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-api
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-serving
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-cicd
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-api
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Additions:
  - `/v1/chat/completions`

### vllm-omni-perf
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)

### vllm-omni-contrib
- Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case)
- Changes:
  - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case)

### vllm-omni-contrib
- Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support

### vllm-omni-cicd
- Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash
- Changes:
  - Bug fix: Fix Base voice clone streaming quality and stop-token crash

### vllm-omni-cicd
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models
- Changes:
  - Bug fix: Fix OmniGen2 transformer config loading for HF models

### vllm-omni-audio-tts
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-perf
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-audio-tts
- Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated

### vllm-omni-contrib
- Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models
- Changes:
  - New feature: [Docs] Add Wan2.1-T2V as supported video generation models

### vllm-omni-video-gen
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-perf
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-audio-tts
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-perf
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-api
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-perf
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-contrib
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-serving
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-cicd
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-image-gen
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-contrib
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-distributed
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-quantization
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-cicd
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-perf
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-contrib
- Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0

### vllm-omni-contrib
- Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section
- Changes:
  - New feature: Add `Governance` section

### vllm-omni-distributed
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism

### vllm-omni-cicd
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…erhead and dynamic False (vllm-project#1913)

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Co-authored-by: JuanPZuluaga <juanz9312@gmal.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants