Skip to content

[Feature] Add quantization support for NextStep-1.1 model#2482

Open
banparth wants to merge 3 commits into
vllm-project:mainfrom
banparth:next-step-quant
Open

[Feature] Add quantization support for NextStep-1.1 model#2482
banparth wants to merge 3 commits into
vllm-project:mainfrom
banparth:next-step-quant

Conversation

@banparth
Copy link
Copy Markdown

@banparth banparth commented Apr 4, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR does the first step for this issue: #1815

Thread quant_config through the NextStep-1.1 model constructor chain so quantized model loading (FP8, Int8, etc.) is supported.

Test Plan

Run the following command to see the peak memory usage:

.venv/bin/python examples/offline_inference/text_to_image/text_to_image.py --model stepfun-ai/NextStep-1.1 --prompt "a cat sitting on a windowsill"
--height 512 --width 512 --num-inference-steps 20 --guidance-scale 7.5 --seed 42 --output nextstep_bf16.png

Peak GPU memory (this request): 29.58 GB

.venv/bin/python examples/offline_inference/text_to_image/text_to_image.py --model stepfun-ai/NextStep-1.1 --prompt "a cat sitting on a windowsill" --height 512 --width 512 --num-inference-steps 20 --guidance-scale 7.5 --seed 42 --output nextstep_bf16.png --quantization fp8

Peak GPU memory (this request): 17.36 GB

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

banparth added 2 commits April 4, 2026 10:27
Signed-off-by: Parth Bansal <parthbansal127@gmail.com>
Signed-off-by: Parth Bansal <parthbansal127@gmail.com>
@banparth banparth requested a review from hsliuustc0106 as a code owner April 4, 2026 12:10
@banparth banparth changed the title Next step quant [Feature] Add quantization support for NextStep-1.1 model Apr 4, 2026
Signed-off-by: Parth Bansal <parthbansal127@gmail.com>
@banparth
Copy link
Copy Markdown
Author

banparth commented Apr 4, 2026

cc @xin3he

@lishunyang12
Copy link
Copy Markdown
Collaborator

Show comparision for output image and do neccessary test, refer to #1470. And also which GPU device you are using.

@banparth
Copy link
Copy Markdown
Author

banparth commented Apr 5, 2026

Hi @lishunyang12,

LPIPS is coming to be quite high compared to the Diffusion models that you shared. I think it is most likely because of the autoregressive nature of the model. Also, some of the images appear to be just noise. I am also not seeing any speedup, most likely because of the overhead of quantizing the activation tensors.

Here is the summary:

Config Peak Memory Mem Reduction Time LPIPS
BF16 baseline 29.58 GiB 56.3s (ref)  
FP8 skip self_attn 20.08 GiB 32% 69.9s 0.2701
INT8 skip mlp 27.11 GiB 8% 57.0s 0.3343  
INT8 skip self_attn 20.05 GiB 32% 59.7s 0.5273 
INT8 all layers 17.39 GiB 41% 56.5s 0.5754
FP8 all layers ~16.7 GiB 44% NOISE
FP8 skip mlp 26.85 GiB 9% 70.3s NOISE

Are these results expected? Do you find anything wrong here? Would appreciate any insights. Thanks!

Also, I am using H100.

@xin3he
Copy link
Copy Markdown

xin3he commented Apr 10, 2026

Hi @banparth I saw that this PR supports INT8 and FP8 on-the-fly quantization in vllm-omni. Do you have the plan to support quantized W4A16 model, such as INCModel/NextStep-1.1-W4A16-AutoRound?

@lishunyang12
Copy link
Copy Markdown
Collaborator

Hi @banparth I saw that this PR supports INT8 and FP8 on-the-fly quantization in vllm-omni. Do you have the plan to support quantized W4A16 model, such as INCModel/NextStep-1.1-W4A16-AutoRound?

In this case, I think we should turn to offline quantization in which weight is well-calibrated. We can keep this pr as a reference.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Feature] Add quantization support for NextStep-1.1 model

Summary

The change is straightforward and follows the established pattern used by other diffusion models in the repo (Flux, Helios, Bagel, etc.) for threading quant_config through the model constructor chain. The plumbing is correct: pipeline -> NextStepModel -> LlamaDecoderLayer -> LlamaAttention / LlamaMLP -> vLLM parallel linear layers.

Issues

1. Missing type annotations on quant_config parameter (minor)

Every new quant_config=None parameter across the three files lacks a type hint. Other models in this repo annotate it as quant_config: "QuantizationConfig | None" = None (see e.g. diffusion/layers/adalayernorm.py). Please add type annotations for consistency and to help static analysis tools.

2. Quality results are concerning (not blocking, but needs discussion)

Per the benchmark table in the comments, FP8 on all layers and FP8 skipping MLP both produce noise rather than usable images. INT8 configurations show high LPIPS (0.27–0.57). This is a meaningful quality regression.

Since this PR is described as "the first step" for issue #1815, I understand the intent is to land the infrastructure first. However, I would recommend:

  • Adding a note in the PR description or code comments about which quantization configurations are known to work and which produce degraded output.
  • Consider whether the default behavior (quantizing all linear layers) should be gated or warned about, given that FP8-all-layers produces noise for this specific model.

3. lm_head is not quantized (informational)

NextStepModel.lm_head remains a plain nn.Linear. The existing comment says it is "not used during image generation", so this is likely intentional. Just flagging for awareness — if it is ever used in a future code path, it would silently remain unquantized.

Verdict

The code change itself is clean and minimal (+15/-8). The plumbing is correct and consistent with how other models handle quantization in this repo. The main concern is around output quality — the FP8 noise issue should at minimum be documented so users are aware.

LGTM with the minor suggestion to add type annotations. I am leaving this as a comment rather than an approval because:

  1. The type annotations should be added for consistency.
  2. The quality regression in the comments (FP8 producing noise) should be acknowledged in the PR description or as a code comment, so future users know the limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants