[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions by natureofnature · Pull Request #2398 · vllm-project/vllm-omni

natureofnature · 2026-04-01T03:53:14Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Fix timestep mismatching for bagel ar/dit mode:

The /v1/chat/completions endpoint for disaggregated pipeline image generation only forwarded height and width from the request's extra_body to the diffusion stage sampling params, but ignored num_inference_steps. This caused the DiT stage to always fall back to the hardcoded default of 50 timesteps regardless of the client-specified value.

Forward to next stage on KV-ready instead of decode-finished

In the disaggregated pipeline, the orchestrator previously waited for AR Stage-0 to fully finish decoding (up to max_tokens tokens) before forwarding the request to the DiT stage. However, the DiT stage only needs the prefill KV cache for conditioning and does not depend on decode outputs. This change makes the AR scheduler emit a kv_ready signal as soon as KV cache extraction completes, and the orchestrator immediately forwards the request to the DiT stage upon receiving this signal, eliminating the unnecessary wait for AR decode to finish. For Bagel with max_tokens=2048, this reduces disaggregated t2i end-to-end latency from ~22s to ~19.7s (matching single-stage baseline) and disaggregated i2i from ~35.8s to ~27.3s at 50 timesteps.

_mark_request_for_kv_transfer(req_id, snapshot_len)
       ↓
model_runner : extract kv cache
       ↓
model_runner_output.kv_extracted_req_ids includes req_id
       ↓
scheduler: emit kv_ready signal
       ↓
orchestrator._handle_kv_ready_raw_outputs: receives kv signal
       ↓
orchestrator._forward_to_next_stage forward to DiT

Test Plan

Bagel image to image and text to image
Non disaggregated mode and AR/DIT disaggregated mode

Test Result

Using default max token settins, on H800 GPU,

text to image

t2i Promt：
"A cute cat wearing sunglasses"
size: 1024x1024

image to image

input image:

i2i Prompt：
"Transform this photo into a soft watercolor illustration while preserving the original composition, natural lighting, fur details, and face. Keep balanced exposure and realistic contrast."
size: 1024x1024

50 Time Steps

	baseline	current
t2i
i2i

Before

Mode	Task	Total (mean)	AR/Residual	DiT	Note
single-stage baseline	`t2i`	19.7710s	0.0000s (0.0%)	19.7710s (100.0%)	single-stage baseline; no stage transfer
single-stage baseline	`i2i`	22.4768s	0.0000s (0.0%)	22.4768s (100.0%)	single-stage baseline; no stage transfer
split + shared memory	`t2i`	22.0154s	2.7262s (12.4%)	19.2892s (87.6%)	shared_memory
split + shared memory	`i2i`	35.8336s	12.8266s (35.8%)	23.0070s (64.2%)	shared_memory; cfg companion path

After

Mode	Task	Total (mean)	AR/Residual	DiT	Note
single-stage baseline	`t2i`	19.7809s	0.0000s (0.0%)	19.7809s (100.0%)	single-stage baseline; no stage transfer
single-stage baseline	`i2i`	22.4857s	0.0000s (0.0%)	22.4857s (100.0%)	single-stage baseline; no stage transfer
split + shared memory	`t2i`	19.7001s	0.3751s (1.9%)	19.3251s (98.1%)	shared_memory
split + shared memory	`i2i`	27.3342s	4.3009s (15.7%)	23.0333s (84.3%)	shared_memory; cfg companion path

10 Time Steps

t2i	i2i

Before

Mode	Task	Total (mean)	AR/Residual	DiT	Note
single-stage baseline	`t2i`	4.1809s	0.0000s (0.0%)	4.1809s (100.0%)	single-stage baseline; no stage transfer
single-stage baseline	`i2i`	5.1568s	0.0000s (0.0%)	5.1568s (100.0%)	single-stage baseline; no stage transfer
split + shared memory	`t2i`	19.7721s	0.4466s (2.3%)	19.3255s (97.7%)	shared_memory
split + shared memory	`i2i`	37.8469s	12.6156s (33.3%)	22.2961s (58.9%)	shared_memory; cfg companion path

After

Mode	Task	Total (mean)	AR/Residual	DiT	Note
single-stage baseline	`t2i`	4.1235s	0.0000s (0.0%)	4.1235s (100.0%)	single-stage baseline; no stage transfer
single-stage baseline	`i2i`	5.1478s	0.0000s (0.0%)	5.1478s (100.0%)	single-stage baseline; no stage transfer
split + shared memory	`t2i`	4.1242s	0.3748s (9.1%)	3.7494s (90.9%)	shared_memory
split + shared memory	`i2i`	9.9261s	4.1899s (42.2%)	5.7362s (57.8%)	shared_memory; cfg companion path

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

@princepride @hsliuustc0106

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

…tage on KV-ready instead of decode-finished Signed-off-by: natureofnature <wzliu@connect.hku.hk>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e831bcd5f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

princepride · 2026-04-01T07:15:30Z

Can you also update max_tokens related code in vllm-omni/tests/e2e/offline_inference/test_bagel_text2img.py and vllm-omni/tests/e2e/offline_inference/test_bagel_img2img.py?

princepride · 2026-04-01T07:16:42Z

This file also need update: vllm-omni/examples/offline_inference/bagel/end2end.py

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature · 2026-04-01T07:32:30Z

This file also need update: vllm-omni/examples/offline_inference/bagel/end2end.py

They have been removed. @princepride

princepride

LGTM

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

…/v1/chat/completions (vllm-project#2398) Signed-off-by: natureofnature <wzliu@connect.hku.hk>

1. fix timestep mismatching for bagel ar/dit mode 2.Forward to next s…

e831bcd

…tage on KV-ready instead of decode-finished Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature requested a review from hsliuustc0106 as a code owner April 1, 2026 03:53

chatgpt-codex-connector Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread vllm_omni/engine/orchestrator.py

Comment thread vllm_omni/engine/orchestrator.py Outdated

natureofnature changed the title ~~[Perf] Bagel KV-ready early forwarding and time step consistency~~ [Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions Apr 1, 2026

natureofnature added 3 commits April 1, 2026 06:43

fix kv signal

b5d9a67

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

remove pef log

9bc90eb

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

remove perf log

12ca8ab

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature added 2 commits April 1, 2026 07:28

removed hard code max token settings

17c97ab

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

removed hard code max token settings

6e0720e

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

princepride approved these changes Apr 1, 2026

View reviewed changes

add a stop_after_transfer option, default to True

8b2c03e

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

princepride added the ready label to trigger buildkite CI label Apr 1, 2026

princepride enabled auto-merge (squash) April 1, 2026 08:48

princepride merged commit d40840b into vllm-project:main Apr 1, 2026
7 of 8 checks passed

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Perf] Bagel KV-ready early forwarding and time step consistency for …

5c9eace

…/v1/chat/completions (vllm-project#2398) Signed-off-by: natureofnature <wzliu@connect.hku.hk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions#2398

[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions#2398
princepride merged 7 commits intovllm-project:mainfrom
natureofnature:bagel/kv_transfer_opt

natureofnature commented Apr 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

princepride commented Apr 1, 2026

Uh oh!

princepride commented Apr 1, 2026

Uh oh!

natureofnature commented Apr 1, 2026

Uh oh!

princepride left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

natureofnature commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Fix timestep mismatching for bagel ar/dit mode:

Forward to next stage on KV-ready instead of decode-finished

Test Plan

Test Result

text to image

image to image

50 Time Steps

Before

After

10 Time Steps

Before

After

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

princepride commented Apr 1, 2026

Uh oh!

princepride commented Apr 1, 2026

Uh oh!

natureofnature commented Apr 1, 2026

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

natureofnature commented Apr 1, 2026 •

edited

Loading