[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines by dongbo910220 · Pull Request #1528 · vllm-project/vllm-omni

dongbo910220 · 2026-02-26T20:39:18Z

Purpose

This PR implements Bitsandbytes (BNB) 4-bit quantization support for Diffusion Pipelines in vllm-omni, as proposed in RFC #1527.

By integrating BNB quantization, we significantly reduce the peak VRAM requirements for diffusion model inference, enabling high-quality image generation on GPUs with limited memory (e.g., consumer-grade cards) and allowing for larger batch sizes in production environments.

Co-author: @Michael-Zzq

Users can enable quantization by specifying the backend:

vllm serve Tongyi-MAI/Z-Image-Turbo --omni --quantization bitsandbytes

Test Plan

pytest tests/diffusion/test_bitsandbytes_quantization.py`

Test Result

passed

Qualitative Comparison (VRAM Profile)

Baseline Output	BNB 4bit Output

Peak: ~24.5 GiB	Peak: ~17.1 GiB

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

lishunyang12

Left a few comments. The main concerns are the duplicated alias maps (three copies of the same normalization table) and the global mutable state for the from_pretrained monkey-patch.

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

lishunyang12

Left a few comments — the thread-safety issue in the patch mechanism and the CLI conflict handler are the main blockers.

lishunyang12 · 2026-03-12T13:01:37Z

Left a second round of comments. A few observations beyond the inline feedback:

The monkeypatch of PreTrainedModel.from_pretrained via global state + context vars + refcounting works but is fragile and hard to maintain long-term.
The offloader integration communicates via ad-hoc dynamic attrs (_bnb_quantized_components, _bnb_offload_skip_components) on nn.Module — these can be lost during .to() / deepcopy / serialization.
conflict_handler="resolve" in CLI is a design issue, not just a bug — silently clobbers any upstream arg with the same name. Needs a different approach (e.g. diffusion-specific flag name).

I'd suggest iterating on the CLI flag and the patch mechanism before merging. Happy to re-review once those are addressed. cc @david6666666

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

dongbo910220 · 2026-03-12T15:56:27Z

@lishunyang12 Addressed the patch race, moved offload tracking to a dedicated state, and switched to diffusion-specific CLI flags.

lishunyang12 · 2026-03-13T13:38:29Z

Thanks for the updates. Tried to test this E2E but the branch conflicts with main — hitting ModuleNotFoundError: No module named 'vllm.renderers.protocol' on import. Can you rebase onto main?

Two inline comments still open (the _is_vllm_linear duplication and getattr default in sequential_backend.py).

Will do a final pass + E2E test after the rebase.

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

dongbo910220 · 2026-03-14T15:03:13Z

@lishunyang12 updated. Merged origin/main into the branch to resolve the conflict and unblock E2E.

Signed-off-by: dongbo910220 <1275604947@qq.com>

lishunyang12

Looks good — all the substantive issues from earlier rounds are resolved.I will test it locally.

lishunyang12 · 2026-03-17T14:19:04Z

Verified

BF16： 25GB~

Bitandbyte：18GB~

lishunyang12

LGTM

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00bf76ace1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T15:14:37Z

+    if backend == "bitsandbytes_8bit":
+        bnb_config = BitsAndBytesConfig(
+            load_in_8bit=True,
+            llm_int8_enable_fp32_cpu_offload=bool(enable_cpu_offload),


Honor llm_int8 CPU-offload setting in BnB config

DiffusionBitsAndBytesConfig defines llm_int8_enable_fp32_cpu_offload, but this load-time path wires BitsAndBytesConfig.llm_int8_enable_fp32_cpu_offload to enable_cpu_offload (the diffusion offload mode) instead of the quantization config field. As a result, users who set llm_int8_enable_fp32_cpu_offload=true in quantization_config are ignored unless pipeline offload is also enabled, and enabling pipeline offload forces this flag on even when explicitly false, which changes memory/device placement behavior unexpectedly during HF from_pretrained quantization.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-17T15:14:37Z

+            type=str,
+            default=None,
+            dest="quantization_config",


Parse diffusion quantization JSON before stage-config merge

--diffusion-quantization-config is registered as type=str, so it can remain a raw JSON string in engine_args. Only the default-stage construction path normalizes this field, but YAML stage-config flows merge base_engine_args directly and can pass that string into OmniDiffusionConfig, which rejects non-dict/non-config quantization_config with a TypeError; this makes the new flag fail for --stage-configs-path users unless they use the older parsed --quantization-config option.

Useful? React with 👍 / 👎.

Gaohan123 · 2026-03-23T16:32:29Z

Please fix conflicts. Thanks

dongbo910220 · 2026-03-23T16:58:36Z

Please fix conflicts. Thanks

Found an issue with INT8 scenario, after fixing it, I'll resolve the conflict. Thanks!

dongbo910220 · 2026-03-28T03:34:21Z

@Michael-Zzq pls fix the issue with INT8 scenario, thanks

lishunyang12 · 2026-04-11T20:31:30Z

@dongbo910220 Take note vllm-project/vllm#39583

dongbo910220 and others added 4 commits February 25, 2026 20:09

diffusion: bitsandbytes quantization alignment

03dc8ad

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

diffusion: drop unused offload shim

55ac8a0

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

diffusion: improve bitsandbytes 4bit path

5b53ede

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

diffusion: support text_encoder* bnb defaults

73efabe

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

lishunyang12 reviewed Feb 27, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/data.py Outdated

Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py

Comment thread vllm_omni/entrypoints/cli/serve.py Outdated

Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py

lishunyang12 mentioned this pull request Mar 1, 2026

[Benchmark] Add quantization quality benchmark script (LPIPS) #1575

Closed

4 tasks

This was referenced Mar 9, 2026

[RFC]: Unified Quantization Framework for all models/all platforms/all methods #1763

Closed

[Core] Unified quantization framework #1764

Merged

diffusion: dedupe quant aliases and clean serve args

bf0aab1

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

david6666666 mentioned this pull request Mar 12, 2026

[RFC]: v0.18.0 diffusion support JiusiServe/vllm-omni#160

Closed

10 tasks

lishunyang12 reviewed Mar 12, 2026

View reviewed changes

lishunyang12 mentioned this pull request Mar 12, 2026

Int8 Quantization Support for DiT (Z-Image & Qwen-Image) #1470

Merged

8 tasks

This was referenced Mar 12, 2026

[RFC] Q1 Quantization Support #1057

Closed

[RFC]: Continuous Quantization Support #1854

Open

diffusion: harden bnb patching and offload state

25829f6

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

Merge origin/main into feat/diffusion-bnb-quant

f7962d6

Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>

dongbo910220 force-pushed the feat/diffusion-bnb-quant branch from 079b7aa to f7962d6 Compare March 15, 2026 14:43

Merge origin/main into feat/diffusion-bnb-quant

00bf76a

Signed-off-by: dongbo910220 <1275604947@qq.com>

lishunyang12 reviewed Mar 17, 2026

View reviewed changes

lishunyang12 approved these changes Mar 17, 2026

View reviewed changes

dongbo910220 marked this pull request as ready for review March 17, 2026 15:04

dongbo910220 requested a review from hsliuustc0106 as a code owner March 17, 2026 15:04

chatgpt-codex-connector Bot reviewed Mar 17, 2026

View reviewed changes

dongbo910220 marked this pull request as draft March 23, 2026 16:58

Conversation

dongbo910220 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Qualitative Comparison (VRAM Profile)

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 commented Mar 12, 2026

Uh oh!

dongbo910220 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Mar 13, 2026

Uh oh!

dongbo910220 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 commented Mar 23, 2026

Uh oh!

dongbo910220 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongbo910220 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongbo910220 commented Feb 26, 2026 •

edited

Loading

dongbo910220 commented Mar 12, 2026 •

edited

Loading

dongbo910220 commented Mar 14, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 commented Mar 17, 2026 •

edited

Loading

dongbo910220 commented Mar 23, 2026 •

edited

Loading

dongbo910220 commented Mar 28, 2026 •

edited

Loading