[Bugfix][Quant] fix online fp8 quantization oom#32773
[Bugfix][Quant] fix online fp8 quantization oom#32773CSWYF3634076 wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the Out-Of-Memory (OOM) issue encountered with online FP8 quantization by reverting the problematic streaming weight loading logic introduced in a previous PR. The removal of the CopyNumelCounter class and the patched_weight_loader functions, along with their associated state management, directly resolves the increased peak memory usage. The changes are consistent with the PR description and aim to restore stable operation for FP8 quantization. This is a critical bugfix that allows models like Qwen3-30B-A3B and ERNIE-4.5-VL-28B-A3B-PT to load and run correctly.
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
|
Hi @CSWYF3634076, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
737c21c to
7938c32
Compare
|
Hi @CSWYF3634076, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
|
@CSWYF3634076 , #31914 is up to fix the memory issue. If that PR can't get landed in time for branch cut, I agree with reverting the overall functionality (this PR). |
@vkuzo I think the current PR can be merged first to ensure normal usage. I tested #31914 and found that it throws an error when using |
|
@CSWYF3634076 , would be awesome if you share what "normal usage" means to you and I can help making sure we have test coverage for it to prevent future regressions. What else other that Here is a stack trace for the issue you just mentioned for posterity: https://gist.github.com/vkuzo/7fe1edf98a83d0eddae1f8bf2ebab30c, will look into making this work shortly. |
@vkuzo Thanks for the reply. “normal usage” didn’t mean any specific scenario. I just happened to discover this issue while frequently using |
|
got it, thanks! #31914 has been updated to handle |
Purpose
#29196 was intended to reduce GPU memory usage by using streaming online fp8 quantization, but in practice it did not achieve this goal. Instead, it increased peak memory usage (even exceeding the memory required by BF16), causing the model to fail to load weights properly.
Qwen3-30B-A3BandERNIE-4.5-VL-28B-A3B-PT: Both can start on a single 80GB GPU with BF16, but fail to start with online FP8 due to OOM.ERNIE-4.5-300B-A47B-PT: Previously could start with 8×80GB GPUs + online FP8, but now fails to start.vllm 0.13.0 and 0.14.0 are affected
By revert #29196, the OOM issue with online FP8 quantization was fixed. With the current PR,
Qwen3-30B-A3BandERNIE-4.5-VL-28B-A3B-PTcan now start normally.Test Plan
# 80G*1 nvidia H GPU vllm serve Qwen/Qwen3-30B-A3B --port 8506 --quantization fp8Test Result
main
This PR
successfully started
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.