[Bugfix][Quant] fix online fp8 quantization oom by CSWYF3634076 · Pull Request #32773 · vllm-project/vllm

CSWYF3634076 · 2026-01-21T10:59:37Z

Purpose

#29196 was intended to reduce GPU memory usage by using streaming online fp8 quantization, but in practice it did not achieve this goal. Instead, it increased peak memory usage (even exceeding the memory required by BF16), causing the model to fail to load weights properly.

Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT: Both can start on a single 80GB GPU with BF16, but fail to start with online FP8 due to OOM.
ERNIE-4.5-300B-A47B-PT: Previously could start with 8×80GB GPUs + online FP8, but now fails to start.

vllm 0.13.0 and 0.14.0 are affected

By revert #29196, the OOM issue with online FP8 quantization was fixed. With the current PR, Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT can now start normally.

Test Plan

# 80G*1 nvidia H GPU
vllm serve Qwen/Qwen3-30B-A3B --port 8506 --quantization fp8

Test Result

main

(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/models/qwen3_moe.py", line 590, in load_weights
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     success = weight_loader(
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1158, in patched_weight_loader
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     self.process_weights_after_loading(layer)
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1220, in process_weights_after_loading
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     w2 = torch.empty_like(layer.w2_weight, dtype=fp8_dtype)
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 111.88 MiB is free. Process 1955871 has 78.98 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 79.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This PR
successfully started

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request effectively addresses the Out-Of-Memory (OOM) issue encountered with online FP8 quantization by reverting the problematic streaming weight loading logic introduced in a previous PR. The removal of the CopyNumelCounter class and the patched_weight_loader functions, along with their associated state management, directly resolves the increased peak memory usage. The changes are consistent with the PR description and aim to restore stable operation for FP8 quantization. This is a critical bugfix that allows models like Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT to load and run correctly.

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

mergify · 2026-01-21T11:04:11Z

Hi @CSWYF3634076, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-21T11:08:13Z

Hi @CSWYF3634076, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

CSWYF3634076 · 2026-01-21T11:36:43Z

@vkuzo Hello,I'm very sorry for revert the code of #29196.
Regarding the online FP8 streaming quantization part of #29196, it seems a more comprehensive solution could be redesigned and thoroughly tested to ensure that peak GPU memory usage does not increase.

vkuzo · 2026-01-21T12:49:53Z

@CSWYF3634076 , #31914 is up to fix the memory issue. If that PR can't get landed in time for branch cut, I agree with reverting the overall functionality (this PR).

CSWYF3634076 · 2026-01-22T12:27:23Z

@CSWYF3634076 , #31914 is up to fix the memory issue. If that PR can't get landed in time for branch cut, I agree with reverting the overall functionality (this PR).

@vkuzo I think the current PR can be merged first to ensure normal usage. I tested #31914 and found that it throws an error when using --load_format dummy, while the current PR does not have this issue.

vkuzo · 2026-01-22T13:52:00Z

@CSWYF3634076 , would be awesome if you share what "normal usage" means to you and I can help making sure we have test coverage for it to prevent future regressions. What else other that --load_format dummy do you need?

Here is a stack trace for the issue you just mentioned for posterity: https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c, will look into making this work shortly.

CSWYF3634076 · 2026-01-23T12:16:17Z

@CSWYF3634076 , would be awesome if you share what "normal usage" means to you and I can help making sure we have test coverage for it to prevent future regressions. What else other that --load_format dummy do you need?如果你能分享一下“正常使用”对你来说意味着什么，那就太好了，我可以帮忙确保我们有测试覆盖，以防止未来的回归。除了——load_format 假人，还需要什么？

Here is a stack trace for the issue you just mentioned for posterity: https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c, will look into making this work shortly.这里有你刚才提到的问题的栈跟踪图以备后遗忘：https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c，我很快会研究如何让它正常工作。

@vkuzo Thanks for the reply. “normal usage” didn’t mean any specific scenario. I just happened to discover this issue while frequently using --load_format dummy.

vkuzo · 2026-01-23T12:32:34Z

got it, thanks! #31914 has been updated to handle --load_format dummy.

CSWYF3634076 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 21, 2026 10:59

mergify bot added the bug Something isn't working label Jan 21, 2026

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

[Bugfix][Quant] fix online fp8 quantization oom

7938c32

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

CSWYF3634076 force-pushed the fix-fp8-online branch from 737c21c to 7938c32 Compare January 21, 2026 11:04

[Bugfix][Quant] fix online fp8 quantization oom pre-commit

255a1a2

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

vkuzo mentioned this pull request Jan 22, 2026

fix memory for online fp8 quantization with streaming weight load #31914

Merged

5 tasks

CSWYF3634076 closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Quant] fix online fp8 quantization oom#32773

[Bugfix][Quant] fix online fp8 quantization oom#32773
CSWYF3634076 wants to merge 2 commits intovllm-project:mainfrom
CSWYF3634076:fix-fp8-online

CSWYF3634076 commented Jan 21, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Jan 21, 2026

Uh oh!

mergify bot commented Jan 21, 2026

Uh oh!

CSWYF3634076 commented Jan 21, 2026 •

edited

Loading

Uh oh!

vkuzo commented Jan 21, 2026

Uh oh!

CSWYF3634076 commented Jan 22, 2026

Uh oh!

vkuzo commented Jan 22, 2026

Uh oh!

CSWYF3634076 commented Jan 23, 2026

Uh oh!

vkuzo commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

CSWYF3634076 commented Jan 21, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Jan 21, 2026

Uh oh!

mergify bot commented Jan 21, 2026

Uh oh!

CSWYF3634076 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Jan 21, 2026

Uh oh!

CSWYF3634076 commented Jan 22, 2026

Uh oh!

vkuzo commented Jan 22, 2026

Uh oh!

CSWYF3634076 commented Jan 23, 2026

Uh oh!

vkuzo commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CSWYF3634076 commented Jan 21, 2026 •

edited by github-actions bot

Loading

CSWYF3634076 commented Jan 21, 2026 •

edited

Loading