[Feature] Add quantization support for NextStep-1.1 model by banparth · Pull Request #2482 · vllm-project/vllm-omni

banparth · 2026-04-04T12:10:03Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR does the first step for this issue: #1815

Thread quant_config through the NextStep-1.1 model constructor chain so quantized model loading (FP8, Int8, etc.) is supported.

Test Plan

Run the following command to see the peak memory usage:

.venv/bin/python examples/offline_inference/text_to_image/text_to_image.py --model stepfun-ai/NextStep-1.1 --prompt "a cat sitting on a windowsill"
--height 512 --width 512 --num-inference-steps 20 --guidance-scale 7.5 --seed 42 --output nextstep_bf16.png

Peak GPU memory (this request): 29.58 GB

.venv/bin/python examples/offline_inference/text_to_image/text_to_image.py --model stepfun-ai/NextStep-1.1 --prompt "a cat sitting on a windowsill" --height 512 --width 512 --num-inference-steps 20 --guidance-scale 7.5 --seed 42 --output nextstep_bf16.png --quantization fp8

Peak GPU memory (this request): 17.36 GB

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Parth Bansal <parthbansal127@gmail.com>

banparth · 2026-04-04T15:52:43Z

cc @xin3he

lishunyang12 · 2026-04-04T20:01:25Z

Show comparision for output image and do neccessary test, refer to #1470. And also which GPU device you are using.

banparth · 2026-04-05T13:23:32Z

Hi @lishunyang12,

LPIPS is coming to be quite high compared to the Diffusion models that you shared. I think it is most likely because of the autoregressive nature of the model. Also, some of the images appear to be just noise. I am also not seeing any speedup, most likely because of the overhead of quantizing the activation tensors.

Here is the summary:

Config	Peak Memory	Mem Reduction	Time	LPIPS
BF16 baseline	29.58 GiB	—	56.3s	(ref)
FP8 skip self_attn	20.08 GiB	32%	69.9s	0.2701
INT8 skip mlp	27.11 GiB	8%	57.0s	0.3343
INT8 skip self_attn	20.05 GiB	32%	59.7s	0.5273
INT8 all layers	17.39 GiB	41%	56.5s	0.5754
FP8 all layers	~16.7 GiB	44%	—	NOISE
FP8 skip mlp	26.85 GiB	9%	70.3s	NOISE

Are these results expected? Do you find anything wrong here? Would appreciate any insights. Thanks!

Also, I am using H100.

xin3he · 2026-04-10T03:09:02Z

Hi @banparth I saw that this PR supports INT8 and FP8 on-the-fly quantization in vllm-omni. Do you have the plan to support quantized W4A16 model, such as INCModel/NextStep-1.1-W4A16-AutoRound?

lishunyang12 · 2026-04-11T00:30:04Z

Hi @banparth I saw that this PR supports INT8 and FP8 on-the-fly quantization in vllm-omni. Do you have the plan to support quantized W4A16 model, such as INCModel/NextStep-1.1-W4A16-AutoRound?

In this case, I think we should turn to offline quantization in which weight is well-calibrated. We can keep this pr as a reference.

lishunyang12

Review: [Feature] Add quantization support for NextStep-1.1 model

Summary

The change is straightforward and follows the established pattern used by other diffusion models in the repo (Flux, Helios, Bagel, etc.) for threading quant_config through the model constructor chain. The plumbing is correct: pipeline -> NextStepModel -> LlamaDecoderLayer -> LlamaAttention / LlamaMLP -> vLLM parallel linear layers.

Issues

1. Missing type annotations on quant_config parameter (minor)

Every new quant_config=None parameter across the three files lacks a type hint. Other models in this repo annotate it as quant_config: "QuantizationConfig | None" = None (see e.g. diffusion/layers/adalayernorm.py). Please add type annotations for consistency and to help static analysis tools.

2. Quality results are concerning (not blocking, but needs discussion)

Per the benchmark table in the comments, FP8 on all layers and FP8 skipping MLP both produce noise rather than usable images. INT8 configurations show high LPIPS (0.27–0.57). This is a meaningful quality regression.

Since this PR is described as "the first step" for issue #1815, I understand the intent is to land the infrastructure first. However, I would recommend:

Adding a note in the PR description or code comments about which quantization configurations are known to work and which produce degraded output.
Consider whether the default behavior (quantizing all linear layers) should be gated or warned about, given that FP8-all-layers produces noise for this specific model.

3. lm_head is not quantized (informational)

NextStepModel.lm_head remains a plain nn.Linear. The existing comment says it is "not used during image generation", so this is likely intentional. Just flagging for awareness — if it is ever used in a future code path, it would silently remain unquantized.

Verdict

The code change itself is clean and minimal (+15/-8). The plumbing is correct and consistent with how other models handle quantization in this repo. The main concern is around output quality — the FP8 noise issue should at minimum be documented so users are aware.

LGTM with the minor suggestion to add type annotations. I am leaving this as a comment rather than an approval because:

The type annotations should be added for consistency.
The quality regression in the comments (FP8 producing noise) should be acknowledged in the PR description or as a code comment, so future users know the limitations.

banparth added 2 commits April 4, 2026 10:27

[Feature] Add support for quant_config in NextStep

dfe5247

Signed-off-by: Parth Bansal <parthbansal127@gmail.com>

update

5d4326a

Signed-off-by: Parth Bansal <parthbansal127@gmail.com>

banparth requested a review from hsliuustc0106 as a code owner April 4, 2026 12:10

banparth changed the title ~~Next step quant~~ [Feature] Add quantization support for NextStep-1.1 model Apr 4, 2026

update

4694068

Signed-off-by: Parth Bansal <parthbansal127@gmail.com>

lishunyang12 mentioned this pull request Apr 15, 2026

[RFC]: Continuous Quantization Support #1854

Open

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add quantization support for NextStep-1.1 model#2482

[Feature] Add quantization support for NextStep-1.1 model#2482
banparth wants to merge 3 commits into
vllm-project:mainfrom
banparth:next-step-quant

banparth commented Apr 4, 2026

Uh oh!

banparth commented Apr 4, 2026

Uh oh!

lishunyang12 commented Apr 4, 2026

Uh oh!

banparth commented Apr 5, 2026 •

edited

Loading

Uh oh!

xin3he commented Apr 10, 2026 •

edited

Loading

Uh oh!

lishunyang12 commented Apr 11, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

banparth commented Apr 4, 2026

Purpose

Test Plan

Test Result

Uh oh!

banparth commented Apr 4, 2026

Uh oh!

lishunyang12 commented Apr 4, 2026

Uh oh!

banparth commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xin3he commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 11, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: [Feature] Add quantization support for NextStep-1.1 model

Summary

Issues

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

banparth commented Apr 5, 2026 •

edited

Loading

xin3he commented Apr 10, 2026 •

edited

Loading