[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines#1528
[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines#1528dongbo910220 wants to merge 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. The main concerns are the duplicated alias maps (three copies of the same normalization table) and the global mutable state for the from_pretrained monkey-patch.
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments — the thread-safety issue in the patch mechanism and the CLI conflict handler are the main blockers.
|
Left a second round of comments. A few observations beyond the inline feedback:
I'd suggest iterating on the CLI flag and the patch mechanism before merging. Happy to re-review once those are addressed. cc @david6666666 |
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
|
@lishunyang12 Addressed the patch race, moved offload tracking to a dedicated state, and switched to diffusion-specific CLI flags. |
|
Thanks for the updates. Tried to test this E2E but the branch conflicts with main — hitting Two inline comments still open (the Will do a final pass + E2E test after the rebase. |
Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Michael-Zhou <08123338@cumt.edu.cn>
|
@lishunyang12 updated. Merged origin/main into the branch to resolve the conflict and unblock E2E. |
079b7aa to
f7962d6
Compare
Signed-off-by: dongbo910220 <1275604947@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00bf76ace1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if backend == "bitsandbytes_8bit": | ||
| bnb_config = BitsAndBytesConfig( | ||
| load_in_8bit=True, | ||
| llm_int8_enable_fp32_cpu_offload=bool(enable_cpu_offload), |
There was a problem hiding this comment.
Honor llm_int8 CPU-offload setting in BnB config
DiffusionBitsAndBytesConfig defines llm_int8_enable_fp32_cpu_offload, but this load-time path wires BitsAndBytesConfig.llm_int8_enable_fp32_cpu_offload to enable_cpu_offload (the diffusion offload mode) instead of the quantization config field. As a result, users who set llm_int8_enable_fp32_cpu_offload=true in quantization_config are ignored unless pipeline offload is also enabled, and enabling pipeline offload forces this flag on even when explicitly false, which changes memory/device placement behavior unexpectedly during HF from_pretrained quantization.
Useful? React with 👍 / 👎.
| type=str, | ||
| default=None, | ||
| dest="quantization_config", |
There was a problem hiding this comment.
Parse diffusion quantization JSON before stage-config merge
--diffusion-quantization-config is registered as type=str, so it can remain a raw JSON string in engine_args. Only the default-stage construction path normalizes this field, but YAML stage-config flows merge base_engine_args directly and can pass that string into OmniDiffusionConfig, which rejects non-dict/non-config quantization_config with a TypeError; this makes the new flag fail for --stage-configs-path users unless they use the older parsed --quantization-config option.
Useful? React with 👍 / 👎.
|
Please fix conflicts. Thanks |
Found an issue with INT8 scenario, after fixing it, I'll resolve the conflict. Thanks! |
|
@Michael-Zzq pls fix the issue with INT8 scenario, thanks |
|
@dongbo910220 Take note vllm-project/vllm#39583 |


Purpose
This PR implements Bitsandbytes (BNB) 4-bit quantization support for Diffusion Pipelines in vllm-omni, as proposed in RFC #1527.
By integrating BNB quantization, we significantly reduce the peak VRAM requirements for diffusion model inference, enabling high-quality image generation on GPUs with limited memory (e.g., consumer-grade cards) and allowing for larger batch sizes in production environments.
Co-author: @Michael-Zzq
Users can enable quantization by specifying the backend:
Test Plan
pytest tests/diffusion/test_bitsandbytes_quantization.py`
Test Result
passed
Qualitative Comparison (VRAM Profile)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)