[Bagel]: Support multistage img2img#1669
Conversation
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29692a7c51
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: princepride <wangzhipeng628@gmail.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
do we have any test for this multistage example?
| @@ -0,0 +1,1009 @@ | |||
| from collections import deque | |||
There was a problem hiding this comment.
is this file the wapper for the bagel model for AR part?
There was a problem hiding this comment.
We can't wrapper vLLM's implementation anymore, whether through direct calls or by inheriting modules. First, the vllm's img2text task only requires ViT, whereas our img2img task needs both ViT and VAE. This means we have to modify both the mm_processor and the vision encoder. During implementation, I also discovered another issue: the vision latents for the VAE part actually use DiT model weights rather than AR weights. We must load the full weights into the AR stage.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review
Rating: 7.5/10 | Verdict:
Summary
Comprehensive implementation of Bagel multistage img2img support with AR wrapper, VAE encoder, MoT routing, and KV transfer metadata. The feature is well-designed but has a critical issue that needs to be addressed before merge.
🔴 Critical Issue: FIFO Queue Metadata Mismatch
Location: vllm_omni/model_executor/models/bagel/bagel.py:513
def get_kv_transfer_metadata(self, req_id: str) -> dict[str, Any] | None:
if self._ropes_queue:
return self._ropes_queue.popleft() # Ignores req_id!
return NoneProblem: The req_id parameter is completely ignored. Metadata is returned in FIFO order regardless of which request is asking. When multiple requests are in-flight and finish out of order, a request can receive another request's ropes/image_shape, corrupting downstream diffusion conditioning.
Suggested Fix: Use a dict keyed by req_id instead of a queue:
self._ropes_metadata: dict[str, dict[str, Any]] = {}
def get_kv_transfer_metadata(self, req_id: str) -> dict[str, Any] | None:
return self._ropes_metadata.pop(req_id, None)🟡 Questions
1. Independent Batching Support
Does this PR support independent (decoupled) batching for multistage inference? Specifically:
- Can AR and diffusion stages have different
max_batch_size? - Are requests queued and processed independently at each stage?
This is important for production scenarios where AR is lightweight but diffusion is memory-intensive. True independent batching would allow e.g., AR with batch_size=8 and diffusion with batch_size=2.
2. Tests for Multistage Example
Do we have any automated tests for this multistage img2img flow? I noticed the PR description has manual test commands, but I don't see corresponding test files. Adding tests for:
- Multi-request scenarios with the metadata flow
- E2E img2img + text2img in multistage config
- Regression tests for single-stage mode
would help ensure stability.
3. Patch Alternatives (re: my inline comment)
The patch.py modification to ModelConfig.is_mm_prefix_lm works, but are there plans to:
- Contribute this upstream to vLLM so Bagel is in the built-in list?
- Use a HF config flag instead of patching?
Not blocking, but worth considering for maintainability.
🟢 Strengths
- Complete Feature - Supports img2img, text2img, text2text, img2text
- Clean Architecture - AR wrapper + VAE encoder + MoT routing
- Good Documentation - PR has test screenshots and commands
Action Items:
- Fix FIFO queue → dict keyed by req_id
- Add tests for multistage flow
- Clarify independent batching support
Thanks for the implementation! Once the metadata issue is fixed, this will be a great addition. 🦐
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@natureofnature PTAL |
I can add an e2e test for img2img, but it requires an image as input. Previously, in vLLM, Simon helped me upload test images to S3 bucket for CI/CD purposes, but I’m not sure where we are storing our test images in vLLM-Omni now. @congw729 |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
We haven't tried uploading test input images to the S3 bucket. Can you show me the PR/link you just mentioned? I will try to satisfied your need. |
vllm-project/vllm#23229 (comment) Last time, Simon Mo helped me upload images to the S3 bucket. |
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
| logger.warning(f"Request {req_id} has no block IDs, skipping") | ||
| continue | ||
|
|
||
| custom_metadata = data.get("custom_metadata") |
There was a problem hiding this comment.
why we need introduce the custom_metadata? is there any doc to explain this arg? o.w., this many bring poos user/dev experiences.
There was a problem hiding this comment.
Because we need transfer RoPE and image shape from ar stage to dit stage
| """ | ||
| if isinstance(layer_kv, torch.Tensor): | ||
| if layer_kv.ndim < 3 or layer_kv.shape[0] != 2: | ||
| if layer_kv.ndim >= 3 and layer_kv.shape[0] == 2: |
There was a problem hiding this comment.
this looks like a kv layer issue? can we make it as a util function?
There was a problem hiding this comment.
Sure, I will do it. In vLLM, different backend will return different shape of paged attention block, for example in: vllm/v1/attention/backends/flash_attn.py
@staticmethod
def get_kv_cache_shape(
num_blocks: int,
block_size: int,
num_kv_heads: int,
head_size: int,
cache_dtype_str: str = "auto",
) -> tuple[int, ...]:
if block_size % 16 != 0:
raise ValueError("Block size must be a multiple of 16.")
return (2, num_blocks, block_size, num_kv_heads, head_size)
However, in: vllm/v1/attention/backends/flashinfer.py
@staticmethod
def get_kv_cache_shape(
num_blocks: int,
block_size: int,
num_kv_heads: int,
head_size: int,
cache_dtype_str: str = "auto",
) -> tuple[int, ...]:
return (num_blocks, 2, block_size, num_kv_heads, head_size)
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Purpose
Fully support Bagel multi-stage deploy.
Test Plan
img2img(stage0, stage1)
text2img(stag0, stage1)
text2text(stage0)
img2text(stage0)