Skip to content

[Bugfix][Quant] fix online fp8 quantization oom#32773

Closed
CSWYF3634076 wants to merge 2 commits intovllm-project:mainfrom
CSWYF3634076:fix-fp8-online
Closed

[Bugfix][Quant] fix online fp8 quantization oom#32773
CSWYF3634076 wants to merge 2 commits intovllm-project:mainfrom
CSWYF3634076:fix-fp8-online

Conversation

@CSWYF3634076
Copy link
Contributor

@CSWYF3634076 CSWYF3634076 commented Jan 21, 2026

Purpose

#29196 was intended to reduce GPU memory usage by using streaming online fp8 quantization, but in practice it did not achieve this goal. Instead, it increased peak memory usage (even exceeding the memory required by BF16), causing the model to fail to load weights properly.

  • Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT: Both can start on a single 80GB GPU with BF16, but fail to start with online FP8 due to OOM.
  • ERNIE-4.5-300B-A47B-PT: Previously could start with 8×80GB GPUs + online FP8, but now fails to start.

vllm 0.13.0 and 0.14.0 are affected

By revert #29196, the OOM issue with online FP8 quantization was fixed. With the current PR, Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT can now start normally.

Test Plan

# 80G*1 nvidia H GPU
vllm serve Qwen/Qwen3-30B-A3B --port 8506 --quantization fp8

Test Result

main

(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/models/qwen3_moe.py", line 590, in load_weights
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     success = weight_loader(
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1158, in patched_weight_loader
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     self.process_weights_after_loading(layer)
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]   File "/root/paddlejob/workspace/env_run/output/wangyafeng/myGithub/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1220, in process_weights_after_loading
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]     w2 = torch.empty_like(layer.w2_weight, dtype=fp8_dtype)
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=641420) ERROR 01-21 18:57:04 [core.py:935] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 111.88 MiB is free. Process 1955871 has 78.98 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 79.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This PR
successfully started


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the bug Something isn't working label Jan 21, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the Out-Of-Memory (OOM) issue encountered with online FP8 quantization by reverting the problematic streaming weight loading logic introduced in a previous PR. The removal of the CopyNumelCounter class and the patched_weight_loader functions, along with their associated state management, directly resolves the increased peak memory usage. The changes are consistent with the PR description and aim to restore stable operation for FP8 quantization. This is a critical bugfix that allows models like Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT to load and run correctly.

Signed-off-by: wangyafeng <wangyafeng@baidu.com>
@mergify
Copy link

mergify bot commented Jan 21, 2026

Hi @CSWYF3634076, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Jan 21, 2026

Hi @CSWYF3634076, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: wangyafeng <wangyafeng@baidu.com>
@CSWYF3634076
Copy link
Contributor Author

CSWYF3634076 commented Jan 21, 2026

@vkuzo Hello,I'm very sorry for revert the code of #29196.
Regarding the online FP8 streaming quantization part of #29196, it seems a more comprehensive solution could be redesigned and thoroughly tested to ensure that peak GPU memory usage does not increase.

@vkuzo
Copy link
Contributor

vkuzo commented Jan 21, 2026

@CSWYF3634076 , #31914 is up to fix the memory issue. If that PR can't get landed in time for branch cut, I agree with reverting the overall functionality (this PR).

@CSWYF3634076
Copy link
Contributor Author

@CSWYF3634076 , #31914 is up to fix the memory issue. If that PR can't get landed in time for branch cut, I agree with reverting the overall functionality (this PR).

@vkuzo I think the current PR can be merged first to ensure normal usage. I tested #31914 and found that it throws an error when using --load_format dummy, while the current PR does not have this issue.

@vkuzo
Copy link
Contributor

vkuzo commented Jan 22, 2026

@CSWYF3634076 , would be awesome if you share what "normal usage" means to you and I can help making sure we have test coverage for it to prevent future regressions. What else other that --load_format dummy do you need?

Here is a stack trace for the issue you just mentioned for posterity: https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c, will look into making this work shortly.

@CSWYF3634076
Copy link
Contributor Author

@CSWYF3634076 , would be awesome if you share what "normal usage" means to you and I can help making sure we have test coverage for it to prevent future regressions. What else other that --load_format dummy do you need?如果你能分享一下“正常使用”对你来说意味着什么,那就太好了,我可以帮忙确保我们有测试覆盖,以防止未来的回归。除了——load_format 假人,还需要什么?

Here is a stack trace for the issue you just mentioned for posterity: https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c, will look into making this work shortly.这里有你刚才提到的问题的栈跟踪图以备后遗忘:https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c,我很快会研究如何让它正常工作。

@vkuzo Thanks for the reply. “normal usage” didn’t mean any specific scenario. I just happened to discover this issue while frequently using --load_format dummy.

@vkuzo
Copy link
Contributor

vkuzo commented Jan 23, 2026

got it, thanks! #31914 has been updated to handle --load_format dummy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants