[Online Quantization] Support memory-efficient online quantization via layerwise loading by kylesayrs · Pull Request #34184 · vllm-project/vllm

kylesayrs · 2026-02-09T23:29:31Z

Purpose

Support online quantization in a more maintainable way by integrating with existing layerwise processing functionality

Changes

Change layerwise logic to only copy and re-place into kernel tensors if reloading

Weights with initialized values

Handling weights which require values placed at init time is a little tricky. One example are rotary embeddings, whose values are created at init time, and are not loaded from disk. In order to avoid overwriting these values with materialized empty tensors, we explicitly exclude these modules from our restore/materialize process. However, handling a weight which initializes some values and loads values would be more complex, although not at all impossible to handle. One way of doing this would be to load values directly into weights, and count those, rather than using get_numel_loaded.

TODO

Rename "reload" utilities to just "layerwise"
Break out attention processing into a separate for loop
Check that there are no side effects from loading and processing the model in device context

Testing

Quantized reloading regression tests pass (test_reload.py)
Smoke tested online quantization

from vllm import LLM; llm = LLM("Qwen/Qwen3-0.6B", quantization="fp8", tensor_parallel_size=2)
llm.generate("Hello there")

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a memory-efficient online quantization method by implementing layerwise loading. This is a significant improvement that refactors specialized loading logic from Fp8OnlineLinearMethod into a more generic and reusable layerwise loading mechanism. The changes are well-structured and adapt the existing reloading infrastructure for initial model loading.

However, I've identified a critical issue where the new generic loading mechanism is enabled for all online FP8 quantization paths, but the necessary refactoring was not applied to Fp8OnlineMoEMethod. This will likely cause issues for MoE models. My detailed comment addresses this with a recommendation for a fix.

vllm/model_executor/model_loader/base_loader.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

works with online fp8 and test

c4ea620

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

vllm/model_executor/model_loader/base_loader.py Show resolved Hide resolved

WIP: fp8moe

9852d72

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

vkuzo pushed a commit to vkuzo/vllm that referenced this pull request Feb 11, 2026

fp8.py online quant: reuse layerwise reloading infra, take 3

2e77014

Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

vkuzo mentioned this pull request Feb 11, 2026

fp8.py online quant: reuse layerwise reloading infra, take 3 #34332

Closed

5 tasks

kylesayrs closed this Feb 11, 2026

vkuzo pushed a commit to vkuzo/vllm that referenced this pull request Feb 11, 2026

fp8.py online quant: reuse layerwise reloading infra, take 3

39c805d

Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Online Quantization] Support memory-efficient online quantization via layerwise loading#34184

[Online Quantization] Support memory-efficient online quantization via layerwise loading#34184
kylesayrs wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/layerwise-loading-online-quant

kylesayrs commented Feb 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kylesayrs commented Feb 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Weights with initialized values

TODO

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kylesayrs commented Feb 9, 2026 •

edited by github-actions bot

Loading