[Feature] Add quantization support for NextStep-1.1 model#2482
[Feature] Add quantization support for NextStep-1.1 model#2482banparth wants to merge 3 commits into
Conversation
Signed-off-by: Parth Bansal <parthbansal127@gmail.com>
|
cc @xin3he |
|
Show comparision for output image and do neccessary test, refer to #1470. And also which GPU device you are using. |
|
Hi @lishunyang12, LPIPS is coming to be quite high compared to the Diffusion models that you shared. I think it is most likely because of the autoregressive nature of the model. Also, some of the images appear to be just noise. I am also not seeing any speedup, most likely because of the overhead of quantizing the activation tensors. Here is the summary:
Are these results expected? Do you find anything wrong here? Would appreciate any insights. Thanks! Also, I am using H100. |
|
Hi @banparth I saw that this PR supports INT8 and FP8 on-the-fly quantization in vllm-omni. Do you have the plan to support quantized W4A16 model, such as INCModel/NextStep-1.1-W4A16-AutoRound? |
In this case, I think we should turn to offline quantization in which weight is well-calibrated. We can keep this pr as a reference. |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [Feature] Add quantization support for NextStep-1.1 model
Summary
The change is straightforward and follows the established pattern used by other diffusion models in the repo (Flux, Helios, Bagel, etc.) for threading quant_config through the model constructor chain. The plumbing is correct: pipeline -> NextStepModel -> LlamaDecoderLayer -> LlamaAttention / LlamaMLP -> vLLM parallel linear layers.
Issues
1. Missing type annotations on quant_config parameter (minor)
Every new quant_config=None parameter across the three files lacks a type hint. Other models in this repo annotate it as quant_config: "QuantizationConfig | None" = None (see e.g. diffusion/layers/adalayernorm.py). Please add type annotations for consistency and to help static analysis tools.
2. Quality results are concerning (not blocking, but needs discussion)
Per the benchmark table in the comments, FP8 on all layers and FP8 skipping MLP both produce noise rather than usable images. INT8 configurations show high LPIPS (0.27–0.57). This is a meaningful quality regression.
Since this PR is described as "the first step" for issue #1815, I understand the intent is to land the infrastructure first. However, I would recommend:
- Adding a note in the PR description or code comments about which quantization configurations are known to work and which produce degraded output.
- Consider whether the default behavior (quantizing all linear layers) should be gated or warned about, given that FP8-all-layers produces noise for this specific model.
3. lm_head is not quantized (informational)
NextStepModel.lm_head remains a plain nn.Linear. The existing comment says it is "not used during image generation", so this is likely intentional. Just flagging for awareness — if it is ever used in a future code path, it would silently remain unquantized.
Verdict
The code change itself is clean and minimal (+15/-8). The plumbing is correct and consistent with how other models handle quantization in this repo. The main concern is around output quality — the FP8 noise issue should at minimum be documented so users are aware.
LGTM with the minor suggestion to add type annotations. I am leaving this as a comment rather than an approval because:
- The type annotations should be added for consistency.
- The quality regression in the comments (FP8 producing noise) should be acknowledged in the PR description or as a code comment, so future users know the limitations.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR does the first step for this issue: #1815
Thread quant_config through the NextStep-1.1 model constructor chain so quantized model loading (FP8, Int8, etc.) is supported.
Test Plan
Run the following command to see the peak memory usage:
Peak GPU memory (this request): 29.58 GB
Peak GPU memory (this request): 17.36 GB
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)