[Perf][Bugfix] cache hot buffers in qwen3_tts talker; fall back on evicted state#3688
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 11f6dc321f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| key = (orig_sr, target_sr) | ||
| if key not in self._resampler_cache: | ||
| self._resampler_cache[key] = AudioResampler(target_sr=target_sr) |
There was a problem hiding this comment.
The new _resampler_cache never evicts entries, and its key includes the request-provided source sample rate (orig_sr), so a long-lived service can accumulate one AudioResampler per distinct input rate indefinitely. In contexts where users can upload arbitrary audio metadata, this becomes an unbounded memory growth path that did not exist before this change (previously the resampler was ephemeral per call). Consider capping this cache (LRU) or normalizing/whitelisting accepted sample rates before caching.
Useful? React with 👍 / 👎.
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
11f6dc3 to
29b4199
Compare
|
Tagging @linyueqian, as this should be aligned wit #3535 |
Verification (H20-3e + L20X, c=128 / 512 reqs each)Ran A/B against H20-3e (
L20X (same deploy,
Server logs grepped for [Low] The defensive fallback is wired into LGTM otherwise. Nice cleanup of the per-step allocations and the |
| @lru_cache(maxsize=8) | ||
| def _cached_mel_filter_bank(sampling_rate: int, n_fft: int, n_mels: int, fmin: int, fmax: int | None) -> torch.Tensor: | ||
| return mel_filter_bank(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax) | ||
|
|
||
|
|
||
| @lru_cache(maxsize=8) | ||
| def _cached_hann_window(win_size: int) -> torch.Tensor: | ||
| return torch.hann_window(win_size) | ||
|
|
||
|
|
There was a problem hiding this comment.
Should we add some kind of global cache for these kind of objects? Like there are other systems that also extract mel_filter_bank or do resampling. @linyueqian
There was a problem hiding this comment.
make sense to me. this should be a common module.
There was a problem hiding this comment.
ok, i'll work on another PR on making this for all the audio models.
|
Fresh A/B on 2x L20X with the bundled default_voice (1.7B-CustomVoice, seed-tts EN bucket text<=50)
voice_clone (0.6B-Base, same bucket)
c=4/c=64 default and c=16 voice_clone reproduce the win on this hardware class (different from your single-GPU H100 baseline). No regression at high c. Approving. One follow-up though: the wins flatten at c>=64 voice_clone and c>=96 default, which makes sense since your PR targets per-prefill CPU-side allocation (mel/hann/resampler) and the AR decode step dominates once concurrency saturates the GPU. Would you have appetite for a second pass at the decode-bound regime, e.g. fused MTP kernel, fuller decode CUDA graph capture, or trimming D2H syncs that remain in |
linyueqian
left a comment
There was a problem hiding this comment.
See bench numbers above. Approving.
|
@linyueqian i could work on this in a follow up PR. |
Additional Qwen3-TTS validationI ran an additional A/B validation. Setup:
default_voice: Qwen3-TTS-12Hz-1.7B-CustomVoice
For voice_clone: Qwen3-TTS-12Hz-0.6B-Base
For Mean / median observationsThis does not look like a p99-only improvement. In the
For ConclusionLGTM from my side. The results are directionally consistent with the earlier validation: this PR helps the Qwen3-TTS hot path, especially for One non-blocking caveat: the benefit is workload- and saturation-regime-dependent. In this run, Thanks for nice work ;) |
…icted state (vllm-project#3688) Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…icted state (vllm-project#3688) Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
few optimizations for the
talkersatge of Qwen3TTS, which give some benefits inconcurrency=64and fewer:mel_spectrogramreallocated the mel filter bank and Hann window on every call; the voice-cloning path constructed a freshAudioResamplerper request..item()host sync inref_code_lenemit. Forced a D2H stall per request span.tts_pad_embedandhidden_states['last']could be evicted by_update_states'finished_req_idscleanup before the talker's final decode step, raisingRuntimeError("Missing tts_pad_embed in additional_information; prefill must run first.")and killing in-flight streams.talker_mtp's residual codebook path: Q-1 separateembeddingcalls per step instead of a single gather over a stacked weight.Test Plan
Test Result
I used
bench_tts_continuity.pyfrom @linyueqian for the underrun rate.Comparison with qwen3_tts_high_concurrency.yaml
I used: https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/deploy/qwen3_tts_high_concurrency.yaml
but modified the config yaml where both stages are using the same GPU=0 (as I don't have access to 2).
ASR quality at c=16: 16/16 pass at CER 0.012 (non-stream), 16/16 at 0.011 (stream). No regression.
Headline gains (decode-heavy regime):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)