[Feature] Implements batching support for batch processing to qwen-image by farbodbj · Pull Request #390 · vllm-project/vllm-omni

farbodbj · 2025-12-20T08:12:30Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Addresses #388
For now it only implements batching logic in vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py
For finalization and enabling user-end usability the this line at vllm_omni/diffusion/worker/gpu_worker.py should also be updated to work with batches. But since making this change would break existing models I didn't make this change. However after batching is implemented for other models, this change is trivial

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-20T08:14:39Z

+                model, subfolder="vae", local_files_only=local_files_only
+            ).to(self.device)
+            logger.info("Loaded Qwen-Image VAE successfully")
+            self.transformer = QwenImageTransformer2DModel()


Pipeline fails to construct transformer

The constructor now calls QwenImageTransformer2DModel() without any arguments, but QwenImageTransformer2DModel.__init__ requires an od_config parameter (see vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py lines 483-500). Instantiating QwenImagePipeline will therefore raise a TypeError before any model weights are loaded or the new batching logic runs, blocking all uses of the pipeline.

Useful? React with 👍 / 👎.

ZJY0516

I’m not sure: do you want to batch multiple requests, or batch multiple prompts within a single request?

ZJY0516 · 2025-12-20T09:07:45Z

-            self.device
-        )
-        self.transformer = QwenImageTransformer2DModel(od_config=od_config)
+        logger.info("Loaded Qwen-Image scheduler successfully")


ZJY0516 · 2025-12-20T09:07:48Z

    ):
        super().__init__()
        self.od_config = od_config
-        self.weights_sources = [


Why we need to change this?

ZJY0516 · 2025-12-20T09:10:55Z

        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
-        prompt_embeds_mask = prompt_embeds_mask.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds_mask = prompt_embeds_mask.repeat(1, num_images_per_prompt)


This is copied from diffusers. Why we need to change this?

ZJY0516 · 2025-12-20T09:11:24Z


            # Broadcast timestep to match batch size
            timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)
-


This may break teacache

farbodbj · 2025-12-20T09:25:52Z

I’m not sure: do you want to batch multiple requests, or batch multiple prompts within a single request?

my end to end usecase is to be run the following code:

model.generate(
        prompt=prompts,
        height=height,
        width=width,
        num_inference_steps=generation_steps,
    )

where prompts is a list of strings. The batching logic (as in vllm dynamic batching) should be handled from outside IMHO

ZJY0516 · 2025-12-20T09:27:52Z

The problem is batching may not yield performance gain.
For example(qwen-iamge):

1 prompt per request: 66s
2 prompts per request: 131s

hsliuustc0106 · 2025-12-20T15:28:17Z

The problem is batching may not yield performance gain. For example(qwen-iamge):

1 prompt per request: 66s 2 prompts per request: 131s

Since it's already compute bound for 1 request :)

farbodbj · 2025-12-20T20:00:56Z

The problem is batching may not yield performance gain. For example(qwen-iamge):
1 prompt per request: 66s 2 prompts per request: 131s

Since it's already compute bound for 1 request :)

Considering the case where there's enough CUDA cores available for more than one request, doesn't batching cause under utilization of the GPU and hence reduced throughput? Can processing a single prompt use the full capacity?

In the scenario when the serving logic accumulates the request with a good heuristic, I think processing requests in batches provides better throughput. Correct me if I'm wrong on this though, you guys are experts on this :)

hsliuustc0106 · 2025-12-21T08:00:04Z

I recommend you to read this RFC #290

SamitHuang · 2025-12-21T14:54:41Z

i think we should update vllm-omni/vllm_omni/diffusion/worker/gpu_worker.py to support multi-request batching at first

def execute_model(self, reqs: list[OmniDiffusionRequest], od_config: OmniDiffusionConfig) -> DiffusionOutput: """ Execute a forward pass. """ assert self.pipeline is not None # TODO: dealing with first req for now req = reqs[0]

I have made this change locally but just dropping this change here would cause problems with other models.

thanks, what are the detailed problems?

Currently model implmenetations need some changes to support batching, changing this will cause inference errors

farbodbj · 2025-12-22T06:03:06Z

I recommend you to read this RFC #290

Thanks. I have performed the benchmark, you can see that from 9:40 until 10 I enabled batch mode (batch_size = 4) and kept a constant load using 16 workers. you can see the response times and queue size too. Then right after the first experiment (from about 10:10) I used the no batch mode and with the same load. Then at the last peak I used batch_size = 8Batching seems to improve stability, reduce queue size and increase memory utilization but does not seem to be improving the throughput

But an importatnt thing I want to note is that this section in the code should throw at least a warning when passed multiple requests. It took me some time to find this part in the code and find what was wrong with my inference server:

    def execute_model(self, reqs: list[OmniDiffusionRequest], od_config: OmniDiffusionConfig) -> DiffusionOutput:
        """
        Execute a forward pass.
        """
        assert self.pipeline is not None
        # TODO: dealing with first req for now
        req = reqs[0]

lishunyang12 · 2026-02-21T07:31:39Z

Should we close this pr as batching mechanism has been implemented in #797

farbodbj requested a review from hsliuustc0106 as a code owner December 20, 2025 08:12

chatgpt-codex-connector Bot reviewed Dec 20, 2025

View reviewed changes

farbodbj force-pushed the main branch from 3da4072 to cbe7224 Compare December 20, 2025 08:16

hsliuustc0106 requested a review from ZJY0516 December 20, 2025 08:59

ZJY0516 reviewed Dec 20, 2025

View reviewed changes

[Feature] Implements batching support for batch processing to qwen-image

945ab6f

farbodbj force-pushed the main branch from cbe7224 to 945ab6f Compare December 20, 2025 15:14

SamitHuang reviewed Dec 21, 2025

View reviewed changes

This was referenced Dec 23, 2025

[RFC]: Support batch request of diffusion models #427

Closed

[RFC]: Support batch request of diffusion models #427 JiusiServe/vllm-omni#9

Closed

hsliuustc0106 mentioned this pull request Jan 1, 2026

Add Batch Processing Pipeline - New Feature for Qwen-Image QwenLM/Qwen-Image#186

Open

fhfuih mentioned this pull request Jan 16, 2026

[Frontend][Model] Support batch request with refined OmniDiffusionReq… #797

Merged

5 tasks

hsliuustc0106 closed this Mar 13, 2026


		# Broadcast timestep to match batch size
		timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)

Conversation

farbodbj commented Dec 20, 2025

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farbodbj commented Dec 20, 2025

Uh oh!

ZJY0516 commented Dec 20, 2025

Uh oh!

hsliuustc0106 commented Dec 20, 2025

Uh oh!

farbodbj commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Dec 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farbodbj commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

farbodbj commented Dec 20, 2025 •

edited

Loading

farbodbj commented Dec 22, 2025 •

edited

Loading

lishunyang12 commented Feb 21, 2026 •

edited

Loading