[NPU] Support code predictor NPU graph#2695
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Impressive improvement for performance! Have you check the accuracy of outputted audios? |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review blocked by gate failures.
- DCO: ACTION_REQUIRED
- pre-commit: FAILURE
Please fix both before this can proceed.
Preliminary note: the hardcoded 2048×2048 fusion causal mask has no guard — if _num_groups + 1 > 2048, npu_fusion_attention will silently misbehave. Consider asserting or at minimum logging a warning when max_seq exceeds the mask size.
|
NPU fusion attention + NPUGraph for code predictor. Two issues:
|
| # Ascend SDPA is_causal migration example uses a fixed 2048x2048 | ||
| # compressed causal mask with sparse_mode=2. | ||
| fusion_mask = torch.triu( | ||
| torch.ones(2048, 2048, dtype=torch.bool), |
There was a problem hiding this comment.
ditto on the 2048 — please add an assert against self._num_groups + 1 at minimum.
I only ran a functional test with a single case, and the generated audio matches the output in eager-mode. Is there any dataset available for accuracy test? In addition, I will continue to complete the performance and accuracy tests for three model types. |
|
@hsliuustc0106 @lishunyang12 Thanks!I will update in a follow-up commit. |
Signed-off-by: XIN GAO <1037396230@qq.com>
Signed-off-by: XIN GAO <1037396230@qq.com>
e714a89 to
2c021e3
Compare
|
which platform 910B? 910C? |
Signed-off-by: XIN GAO <1037396230@qq.com>
910B |
|
@gcanlin @hsliuustc0106 @lishunyang12 The previously mentioned issues have all been fixed. I also reran the functional and performance benchmarks, and updated the PR description . |
|
@gxxx-hum Thanks for clear benchmark! I ran a benchmark for Qwen3-TTS in v0.18.0 and get about 1.5 RTF. Why does this PR get the 2.7~3.1 RTF for 1 concurrency? Any idea about it? |
We can align the parameters. The environment I used is:
Also, I adjusted async_chunk: true
stage_args:
- stage_id: 0
stage_type: llm
is_comprehension: true
runtime:
devices: "0"
engine_args:
model_stage: qwen3_tts
max_num_seqs: 1
model_arch: Qwen3TTSTalkerForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
enforce_eager: true
trust_remote_code: true
async_scheduling: false
enable_prefix_caching: false
engine_output_type: latent
gpu_memory_utilization: 0.6
distributed_executor_backend: "mp"
max_num_batched_tokens: 512
max_model_len: 4096
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk
# Use named connector to apply runtime.connectors.extra.
output_connectors:
to_stage_1: connector_of_shared_memory
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
detokenize: false
repetition_penalty: 1.05
stop_token_ids: [2150]
- stage_id: 1
stage_type: llm
runtime:
devices: "0"
engine_args:
model_stage: code2wav
max_num_seqs: 1
model_arch: Qwen3TTSCode2Wav
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
enforce_eager: true
trust_remote_code: true
async_scheduling: false
enable_prefix_caching: false
engine_output_type: audio
gpu_memory_utilization: 0.2
distributed_executor_backend: "mp"
# Must be divisible by num_code_groups and cover (left_context + chunk).
max_num_batched_tokens: 32768
# async_chunk appends windows per step; max_model_len must cover accumulated stream.
max_model_len: 32768
engine_input_source: [0]
final_output: true
final_output_type: audio
# Distributed connector configuration
input_connectors:
from_stage_0: connector_of_shared_memory
tts_args:
max_instructions_length: 500
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 65536
seed: 42
detokenize: true
repetition_penalty: 1.0
runtime:
enabled: true
defaults:
window_size: -1
max_inflight: 1
connectors:
connector_of_shared_memory:
name: SharedMemoryConnector
extra:
shm_threshold_bytes: 65536
# Frame-aligned codec streaming transport.
codec_streaming: true
# Connector polling / timeout (unit: loop count, sleep interval in seconds).
connector_get_sleep_s: 0.01
connector_get_max_wait_first_chunk: 3000
connector_get_max_wait: 300
# Align with Omni: small chunks with sufficient context overlap.
codec_chunk_frames: 25
codec_left_context_frames: 25
edges:
- from: 0
to: 1
window_size: -1 |
|
Good job! I ran it on A3 and get more impressive performance. |
|
|
I might be wrong,but the 910C seems closer to a 2 x 910B designed with shared memory and on-package interconnect @hahadashi |
The correct commit in the vllm-omni repository here should be 32af3af. |
|
I plan to land this PR first though it's adding new hardcode for hardware. But before abstracting attention backend(See #2967), we may need at least 2 instances. And cuda hardcode is existing for a long time. So I think integrating NPU hardcode temporarily would make sense. And I will abstract them ASAP. @hsliuustc0106 |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
gcanlin
left a comment
There was a problem hiding this comment.
LGTM. I updated the latest test results in the PR description. And have some legacy things needed to do:
- abstract code predictor attention to CustomOp;
- abstract sub-model graph wrapper;
- analyze why the performance is better when enabling use_cuda_graph for code predictor of Qwen3-Omni.
Signed-off-by: XIN GAO <1037396230@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Optimize Qwen3-TTS code predictor inference on NPU.
Test Plan
Functional Test
Used
examples/online_serving/qwen3_tts/openai_speech_client.pyfor smoke tests and compared Eager vs NPUGraph outputs :Performance Test
A3(Tesed by @gcanlin)
Qwen3-TTS Performance Comparison
Note:
E2EL,AUDIO_TTFP, andAUDIO_RTFRequest throughputandAudio throughputmain: 50 requests,current PR: 10 requests), so throughput and tail-latency results should be interpreted with that in mindQwen3-Omni Performance Comparison
Note:
E2EL,TTFT,TPOT,ITL,AUDIO_TTFP, andAUDIO_RTFConcurrency = 1
Key Takeaways for Concurrency = 1
TPOT/ITL) is mostly unchanged, though P99 ITL regressedConcurrency = 4
Key Takeaways for Concurrency = 4
TTFT,TPOT, meanITL), but P99 ITL regressedOverall Summary
Across both concurrency settings, the current PR shows:
The main remaining regression to watch is:
Key Takeaways
Compared with
main, the current PR shows:Short PR Summary
Used
benchmarks/qwen3_tts/vllm_omni/bench_tts_serve.pyfor performance tests and compared Eager vs NPUGraph, Using Base as an example :Test Result
Functional Test
Summary
Generated audio from Base, CustomVoice, and VoiceDesign was tested with both short/long texts and Chinese/English inputs. The spoken content was consistent and no obvious noise or artifacts were observed.
For CustomVoice, minor speaking-rate differences were observed. Running the same script multiple times in both Eager and NPUGraph modes also produced audio with slightly different speaking speeds, so this is likely due to CustomVoice being more sensitive to the sampling parameters rather than an NPUGraph-specific regression.
Base
tts_output_long_base.wav
tts_output_long_base_graph.wav
CustomVoice
Eager
customvoice_long_zh_long.wav
NPUGraph
customvoice_long_zh_long_graph.wav
VoiceDesign
Eager
voicedesign_loli_joke.wav
NPUGraph
voicedesign_loli_joke_graph.wav
Performance Test
Summary
Base: Clear improvements are observed at concurrency 1/4/10. At low concurrency, request throughput and audio throughput nearly doubled; at concurrency 10, throughput still improved by over 50%. E2E and Audio RTF both dropped by around 50% at low concurrency, and still improved by around 35% at concurrency 10. TTFP dropped by around 50% at low concurrency and around 10% at concurrency 10.
CustomVoice: Clear improvements are observed at concurrency 1/4/10. At low concurrency, throughput improved by up to 133%, and at concurrency 10, it still improved by up to 95%. E2E and Audio RTF dropped by around 56%-48% at concurrency 1/4/10. TTFP decreased by around 38%-32%.
VoiceDesign: Clear improvements are observed at concurrency 1/4/10. At low concurrency, throughput improved by around 130%, and at concurrency 10, it still improved by around 30%. E2E dropped by around 50% at low concurrency and around 20% at concurrency 10. TTFP decreased by around 45% at low concurrency and around 32% at concurrency 10.
Base
CustomVoice
VoiceDesign
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)