[Bugfix] bound transfer save queue and tighten orchestrator idle backoff#3690
[Bugfix] bound transfer save queue and tighten orchestrator idle backoff#3690JuanPZuluaga wants to merge 1 commit into
Conversation
|
please also provide the status before this commit so that we can see the improvement |
|
@hsliuustc0106 thanks for the comments, I updated the PR description; the benefits come at high concurrency. Tagging @linyueqian, as this should be aligned with #3535 (btw, thanks @Sy0307 for the amazing improvement in performance with: #3662) |
|
Re-ran an A/B on a 2-GPU H20 split with the harness from #3618 (Qwen3-TTS-Base voice_clone,
Two things to flag:
My guess on the c=128 throughput dip: the global Net take: queue cap is the right idea and the underrun improvement is real and reproduces. Would be good to surface the c=128 throughput tradeoff in the PR description, and if you want the merge to be unconditional improvement, expose the cap as a config knob so the high-c regime can opt out. Happy to share the raw JSONs if useful. |
|
On making 512 adaptive: I don't think it's worth doing on a single global counter since one slow stream would still drag the cap for everyone, so I'd suggest just exposing it as a |
|
Re-ran on a different box (H200-hsliu, single-GPU slim deploy, GPU 3) to see if the c=128 picture is hardware-sensitive. Same bench-base vs bench-pr3690 setup, only Juan's 4-file commit differs.
c=64 is a robust win on both boxes. c=128 is the one that doesn't generalize: on H200 the single positive signal from H20 (p99 underrun −22.6%) flips to +7.7%, with throughput, E2EL, and mean underrun also slightly worse. That's consistent with the slim H200 deploy already being in a vocoder-saturated regime where there's nothing useful for the global queue cap to do. Reinforces what I think the right move is: keep the PR small but expose the cap as a Raw JSONs available if you want them. |
|
I don't object to bounding the save queue, but this backpressure point is too coarse: it turns downstream consumption pressure into a global block on the scheduler / producer path.
self._save_semaphore.acquire()This is not a wait inside a background worker; it is a wait on the producer path. If the real bottleneck is the downstream stage, vocoder, or connector, blocking the upstream producer does not make the downstream side faster. It only moves the waiting from the We previously tested a similar save/transfer-side throttling mechanism on Fish S2 / FastSR. It did not show a clear win, and in some configurations it slightly regressed, around ~5%. My read is that this is the same failure mode: when the downstream side is already close to saturation, global backpressure does not reduce actual compute; it only shrinks the producer-consumer parallelism window. It can also introduce head-of-line blocking: one slow stream or slow consumer can make the global semaphore throttle all following streams, instead of throttling only itself. So I don't think this cap should be hard-coded to 512. This value is counted by task, not by payload bytes or by stream, so its meaning changes across models, hardware, and stage outputs. At minimum, it should be exposed as a Separately, |
|
@JuanPZuluaga any thoughts on @Sy0307's 5-20 comment? Two things line up with my c=128 H20/H200 numbers and I think both should land before merge:
What's your read? |
|
thanks for the comments! @Sy0307, I'm working on making it configurable. @linyueqian |
47f259a to
0245ea7
Compare
|
I computed the numbers again @linyueqian
|
c4e7529 to
dd38710
Compare
|
@linyueqian i updated the PR body with the latest evaluation. |
linyueqian
left a comment
There was a problem hiding this comment.
lgtm @Sy0307 ptal as well
|
@Sy0307 here's an 8× H20 141GB sweep against your Method: Dual-card (
|
| c | E2EL_p99 Δ | TTFP_p99 Δ | RTF_p99 Δ |
|---|---|---|---|
| 8 | +8.8% 🔴 | -2.1% 🟢 | +1.9% 🟡 |
| 16 | +1.5% 🟡 | +12.1% 🔴 | +2.9% 🟡 |
| 32 | +0.7% | -1.7% 🟢 | -14.5% 🟢 |
| 48 | +3.5% 🟡 | +2.2% 🟡 | +0.6% |
| 64 | +1.1% 🟡 | +2.3% 🟡 | +4.0% 🟡 |
| 96 | -0.9% | -4.2% 🟢 | +0.4% |
| 128 | +3.7% 🟡 | +5.5% 🔴 | -2.5% 🟢 |
Single-card (both stages co-located on one GPU, matching the PR description)
| c | E2EL_p99 Δ | TTFP_p99 Δ | RTF_p99 Δ |
|---|---|---|---|
| 8 | -5.4% 🟢 | -4.9% 🟢 | -3.2% 🟢 |
| 16 | -4.2% 🟢 | -0.9% | +7.6% 🔴 |
| 32 | -0.4% | -7.2% 🟢 | -4.3% 🟢 |
| 48 | +1.1% 🟡 | -3.9% 🟢 | -4.5% 🟢 |
| 64 | +5.8% 🔴 | -0.8% | -1.7% 🟢 |
| 96 | +1.9% 🟡 | +1.6% 🟡 | +1.5% 🟡 |
| 128 | +0.8% | -1.2% 🟢 | -4.3% 🟢 |
Δ legend: 🟢 ≤−1%, neutral within ±1%, 🟡 +1..+5%, 🔴 ≥+5%. Lower is better for all three.
c=64 dual-card lands at +1.1% E2EL_p99 (within run-to-run noise). c=64 single-card sits at +5.8% E2EL_p99 with RTF and TTFP both flat, so if that's a real signal it's in tail total latency, not streaming keep-up. c=64 is also the most statistically stable cell in each table (n=256), so it's the best comparison point.
Two caveats worth flagging:
UR_ratenot measured.vllm bench servedoesn't emit underrun rate, so the PR description's-64% UR_rate at c=64headline is not verified here. The continuity bench that does compute it can't talk toQwen3-TTS-Basewithout a per-requestref_audio(instant HTTP 400 from the voice_clone validator), so verifying that claim needs the script extended to attach a fixed reference audio.- Low-c cells used 64–256 prompts vs the PR description's 512, so treat the big greens/reds at c=4 and c=8 as noisy.
Net: no dual-card blocker that I can see, but the c=64 single-card +5.8% E2EL is worth a second look. Happy to re-run any cell at num_prompts=512 if it helps tighten percentiles.
linyueqian
left a comment
There was a problem hiding this comment.
need furthur investigation
dd38710 to
2aa7f23
Compare
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
2aa7f23 to
1c1f13c
Compare
|
@linyueqian thanks for the detailed test. I'll keep working on this and see how we can improve things. Thanks! |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Under high concurrency, the
OmniTransferAdapterBasesave path is unbounded:OmniChunkTransferAdapter.save_asyncappends to_pending_save_reqsfaster thansave_loopcan drain it, so the deque (and the per-task GPU references it holds) bloats as concurrency increases. Pair fix: the orchestrator's idlepoll path sleeps 1 ms between empty passes, which inflates TTFA at every batch boundary even when work is queued elsewhere.
This PR caps in-flight saves with a
Semaphore(512)(acquired insave_async, released insave_loop'sfinally) and drops the idle sleep to 0.1 ms. Surgical change — only the transfer adapter and orchestrator idle path are touched.Test Plan
Test Result
I used
bench_tts_continuity.pyfrom @linyueqian for the underrun rate.Comparison with
qwen3_tts_high_concurrency.yamlI used: https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/deploy/qwen3_tts_high_concurrency.yaml
but modified the config yaml where both stages are using the same
GPU=0(as I don't have access to 2).also:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)