Skip to content

[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines#1528

Draft
dongbo910220 wants to merge 8 commits intovllm-project:mainfrom
dongbo910220:feat/diffusion-bnb-quant
Draft

[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines#1528
dongbo910220 wants to merge 8 commits intovllm-project:mainfrom
dongbo910220:feat/diffusion-bnb-quant

Conversation

@dongbo910220
Copy link
Copy Markdown
Contributor

@dongbo910220 dongbo910220 commented Feb 26, 2026

Purpose

This PR implements Bitsandbytes (BNB) 4-bit quantization support for Diffusion Pipelines in vllm-omni, as proposed in RFC #1527.

By integrating BNB quantization, we significantly reduce the peak VRAM requirements for diffusion model inference, enabling high-quality image generation on GPUs with limited memory (e.g., consumer-grade cards) and allowing for larger batch sizes in production environments.

Co-author: @Michael-Zzq

Users can enable quantization by specifying the backend:

vllm serve Tongyi-MAI/Z-Image-Turbo --omni --quantization bitsandbytes

Test Plan

pytest tests/diffusion/test_bitsandbytes_quantization.py`

Test Result

passed

Qualitative Comparison (VRAM Profile)

Baseline Output BNB 4bit Output
baseline_vram bnb4bit_vram
Peak: ~24.5 GiB Peak: ~17.1 GiB

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

dongbo910220 and others added 4 commits February 25, 2026 20:09
Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. The main concerns are the duplicated alias maps (three copies of the same normalization table) and the global mutable state for the from_pretrained monkey-patch.

Comment thread vllm_omni/diffusion/data.py Outdated
Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py
Comment thread vllm_omni/entrypoints/cli/serve.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py
Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments — the thread-safety issue in the patch mechanism and the CLI conflict handler are the main blockers.

Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py Outdated
Comment thread vllm_omni/entrypoints/cli/serve.py Outdated
Comment thread vllm_omni/diffusion/data.py Outdated
Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py Outdated
Comment thread tests/diffusion/test_bitsandbytes_quantization.py
Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py
Comment thread vllm_omni/diffusion/quantization/bitsandbytes.py Outdated
Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated
Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py
Comment thread vllm_omni/diffusion/data.py Outdated
Comment thread vllm_omni/diffusion/offloader/sequential_backend.py Outdated
Comment thread vllm_omni/entrypoints/omni.py
@lishunyang12
Copy link
Copy Markdown
Collaborator

Left a second round of comments. A few observations beyond the inline feedback:

  1. The monkeypatch of PreTrainedModel.from_pretrained via global state + context vars + refcounting works but is fragile and hard to maintain long-term.
  2. The offloader integration communicates via ad-hoc dynamic attrs (_bnb_quantized_components, _bnb_offload_skip_components) on nn.Module — these can be lost during .to() / deepcopy / serialization.
  3. conflict_handler="resolve" in CLI is a design issue, not just a bug — silently clobbers any upstream arg with the same name. Needs a different approach (e.g. diffusion-specific flag name).

I'd suggest iterating on the CLI flag and the patch mechanism before merging. Happy to re-review once those are addressed. cc @david6666666

Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
@dongbo910220
Copy link
Copy Markdown
Contributor Author

dongbo910220 commented Mar 12, 2026

@lishunyang12 Addressed the patch race, moved offload tracking to a dedicated state, and switched to diffusion-specific CLI flags.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Thanks for the updates. Tried to test this E2E but the branch conflicts with main — hitting ModuleNotFoundError: No module named 'vllm.renderers.protocol' on import. Can you rebase onto main?

Two inline comments still open (the _is_vllm_linear duplication and getattr default in sequential_backend.py).

Will do a final pass + E2E test after the rebase.

Signed-off-by: dongbo910220 <1275604947@qq.com>

Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
@dongbo910220
Copy link
Copy Markdown
Contributor Author

dongbo910220 commented Mar 14, 2026

@lishunyang12 updated. Merged origin/main into the branch to resolve the conflict and unblock E2E.

@dongbo910220 dongbo910220 force-pushed the feat/diffusion-bnb-quant branch from 079b7aa to f7962d6 Compare March 15, 2026 14:43
Signed-off-by: dongbo910220 <1275604947@qq.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — all the substantive issues from earlier rounds are resolved.I will test it locally.

@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Mar 17, 2026

Verified

BF16: 25GB~
baseline

Bitandbyte:18GB~
bnb4bit

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongbo910220 dongbo910220 marked this pull request as ready for review March 17, 2026 15:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00bf76ace1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if backend == "bitsandbytes_8bit":
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=bool(enable_cpu_offload),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor llm_int8 CPU-offload setting in BnB config

DiffusionBitsAndBytesConfig defines llm_int8_enable_fp32_cpu_offload, but this load-time path wires BitsAndBytesConfig.llm_int8_enable_fp32_cpu_offload to enable_cpu_offload (the diffusion offload mode) instead of the quantization config field. As a result, users who set llm_int8_enable_fp32_cpu_offload=true in quantization_config are ignored unless pipeline offload is also enabled, and enabling pipeline offload forces this flag on even when explicitly false, which changes memory/device placement behavior unexpectedly during HF from_pretrained quantization.

Useful? React with 👍 / 👎.

Comment on lines +289 to +291
type=str,
default=None,
dest="quantization_config",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse diffusion quantization JSON before stage-config merge

--diffusion-quantization-config is registered as type=str, so it can remain a raw JSON string in engine_args. Only the default-stage construction path normalizes this field, but YAML stage-config flows merge base_engine_args directly and can pass that string into OmniDiffusionConfig, which rejects non-dict/non-config quantization_config with a TypeError; this makes the new flag fail for --stage-configs-path users unless they use the older parsed --quantization-config option.

Useful? React with 👍 / 👎.

@Gaohan123
Copy link
Copy Markdown
Collaborator

Please fix conflicts. Thanks

@dongbo910220
Copy link
Copy Markdown
Contributor Author

dongbo910220 commented Mar 23, 2026

Please fix conflicts. Thanks

Found an issue with INT8 scenario, after fixing it, I'll resolve the conflict. Thanks!

@dongbo910220 dongbo910220 marked this pull request as draft March 23, 2026 16:58
@dongbo910220
Copy link
Copy Markdown
Contributor Author

dongbo910220 commented Mar 28, 2026

@Michael-Zzq pls fix the issue with INT8 scenario, thanks

@lishunyang12
Copy link
Copy Markdown
Collaborator

@dongbo910220 Take note vllm-project/vllm#39583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants