[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions#2398
Conversation
…tage on KV-ready instead of decode-finished Signed-off-by: natureofnature <wzliu@connect.hku.hk>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e831bcd5f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
|
Can you also update |
|
This file also need update: |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
They have been removed. @princepride |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
…/v1/chat/completions (vllm-project#2398) Signed-off-by: natureofnature <wzliu@connect.hku.hk>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Fix timestep mismatching for bagel ar/dit mode:
The /v1/chat/completions endpoint for disaggregated pipeline image generation only forwarded height and width from the request's extra_body to the diffusion stage sampling params, but ignored num_inference_steps. This caused the DiT stage to always fall back to the hardcoded default of 50 timesteps regardless of the client-specified value.
Forward to next stage on KV-ready instead of decode-finished
In the disaggregated pipeline, the orchestrator previously waited for AR Stage-0 to fully finish decoding (up to max_tokens tokens) before forwarding the request to the DiT stage. However, the DiT stage only needs the prefill KV cache for conditioning and does not depend on decode outputs. This change makes the AR scheduler emit a kv_ready signal as soon as KV cache extraction completes, and the orchestrator immediately forwards the request to the DiT stage upon receiving this signal, eliminating the unnecessary wait for AR decode to finish. For Bagel with max_tokens=2048, this reduces disaggregated t2i end-to-end latency from ~22s to ~19.7s (matching single-stage baseline) and disaggregated i2i from ~35.8s to ~27.3s at 50 timesteps.
_mark_request_for_kv_transfer(req_id, snapshot_len) ↓ model_runner : extract kv cache ↓ model_runner_output.kv_extracted_req_ids includes req_id ↓ scheduler: emit kv_ready signal ↓ orchestrator._handle_kv_ready_raw_outputs: receives kv signal ↓ orchestrator._forward_to_next_stage forward to DiTTest Plan
Test Result
Using default max token settins, on H800 GPU,
text to image
t2i Promt:
"A cute cat wearing sunglasses"
size: 1024x1024
image to image
input image:

i2i Prompt:
"Transform this photo into a soft watercolor illustration while preserving the original composition, natural lighting, fur details, and face. Keep balanced exposure and realistic contrast."
size: 1024x1024
50 Time Steps
Before
t2ii2it2ii2iAfter
t2ii2it2ii2i10 Time Steps
Before
t2ii2it2ii2iAfter
t2ii2it2ii2iEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.@princepride @hsliuustc0106
BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)