[Bugfix][FP8] Fix dynamic FP8 Marlin quantization #7219

mgoin · 2024-08-06T20:05:10Z

We simply weren't expanding the per-tensor scales to channelwise for the fp16->fp8 dynamic quant case, which is needed for FP8 Marlin.

Also adds VLLM_TEST_FORCE_FP8_MARLIN environment variable so we can force testing of FP8 Marlin on GPUs with hardware support for FP8

github-actions · 2024-08-06T20:05:22Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2024-08-06T23:52:58Z

Hi @youkaichao I'm not sure the state of the build right now, but the failing tests seem to be unrelated. Lmk if you think I should investigate further

youkaichao · 2024-08-07T07:07:03Z

yes test failures are unrelated. I will hand it over to @robertgshaw2-neuralmagic for review on the quantization side.

tests/quantization/test_fp8.py

Signed-off-by: Alvant <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

Fix dynamic FP8 Marlin quantization

c48f545

mgoin requested a review from robertgshaw2-redhat August 6, 2024 20:05

Add override env var for testing

e0d2f00

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 6, 2024

mgoin mentioned this pull request Aug 6, 2024

[Bug]: GPTQ Marlin with cpu-offload-gb fails on 0.5.4 #7204

Closed

.

35af046

mgoin requested a review from youkaichao August 6, 2024 23:51

youkaichao reviewed Aug 7, 2024

View reviewed changes

tests/quantization/test_fp8.py Outdated Show resolved Hide resolved

mgoin added 2 commits August 7, 2024 14:10

Update to VLLM_TEST_FORCE_FP8_MARLIN

b607902

Format

5492a68

robertgshaw2-redhat approved these changes Aug 7, 2024

View reviewed changes

youkaichao merged commit 5223199 into main Aug 7, 2024
50 of 52 checks passed

youkaichao deleted the fix-dynamic-fp8-marlin branch August 7, 2024 18:23

DrNochi mentioned this pull request Aug 9, 2024

[BUG] Running FP8 quantized model fails on NVIDIA L4 (repack_fp8_for_marlin) huggingface/text-generation-inference#2388

Open

4 tasks

alxiang mentioned this pull request Aug 28, 2024

[Bug] Dynamic FP8 quantization fails due to incorrect tensor shape sgl-project/sglang#1178

Closed

5 tasks

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (vllm-project#7219)

d86dcf3

Signed-off-by: Alvant <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (vllm-project#7219)

b372aeb

Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization #7219

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization #7219

mgoin commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024

mgoin commented Aug 6, 2024

youkaichao commented Aug 7, 2024

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization #7219

[Bugfix][FP8] Fix dynamic FP8 Marlin quantization #7219

Conversation

mgoin commented Aug 6, 2024 • edited Loading

github-actions bot commented Aug 6, 2024

mgoin commented Aug 6, 2024

youkaichao commented Aug 7, 2024

mgoin commented Aug 6, 2024 •

edited

Loading