[Quantization] Quark MXFP4 format loading by BowenBao · Pull Request #16943 · vllm-project/vllm

BowenBao · 2025-04-21T21:09:34Z

Initial PR to integrate loading MXFP4 models quantized by Quark.
This PR supports running MXFP4 emulation for devices where micro-scaling datatype is not natively supported.

Next Steps

MoE MXFP4 support.
Faster emulation.
Triton kernel integration.

github-actions · 2025-04-21T21:09:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-24T14:12:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

vllm/model_executor/layers/fused_moe/fused_moe.py

BowenBao · 2025-04-30T21:21:59Z

@mgoin thanks for taking a look! This PR is now ready for review. More PRs will follow.

vllm/model_executor/layers/quantization/utils/mxfp4_utils.py

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

mgoin · 2025-05-01T00:05:00Z

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Can we remove the env var and always do weight decompress at runtime? This is the expected behavior from other quantization methods so I feel it is strange to not do compression

We found it to be more efficient for emulation evaluations doing aot weight dequant. That being said, this can be removed with the support of more efficient dequant kernels. I would prefer keeping this option for now but let me know if you feel strongly about it.

Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

mgoin · 2025-05-01T00:07:49Z

Seems reasonable, thanks. Is there a small model for testing that you could add under vllm/tests/models/decoder_only/language/test_mxfp4.py, even if it is disabled/skipped for now?

mergify · 2025-05-01T17:20:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

BowenBao · 2025-05-01T17:26:06Z

Seems reasonable, thanks. Is there a small model for testing that you could add under vllm/tests/models/decoder_only/language/test_mxfp4.py, even if it is disabled/skipped for now?

Test added. Skipped for now til model is publicly released.

vllm/model_executor/layers/rotary_embedding.py

BowenBao · 2025-05-02T22:46:35Z

@mgoin I have addressed most of the comments. Please take a look again, thanks!

wip wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <bowenbao@amd.com> lint Signed-off-by: Bowen Bao <bowenbao@amd.com> turn on emulation based on platform Signed-off-by: Bowen Bao <bowenbao@amd.com> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <bowenbao@amd.com> Mxfp4 memory leak fixes (#2) Signed-off-by: Felix Marty <felmarty@amd.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Signed-off-by: Bowen Bao <bowenbao@amd.com> Add test Signed-off-by: Bowen Bao <bowenbao@amd.com> revert rope local fix Signed-off-by: Bowen Bao <bowenbao@amd.com> remove print Signed-off-by: Bowen Bao <bowenbao@amd.com> rename scale calculation mode Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao · 2025-05-06T16:20:11Z

@mgoin friendly ping. We have a few more PRs lined up after this one, would greatly appreciate if you could take another look!

mgoin · 2025-05-06T16:25:36Z

Thank you for the ping! Will look today

mgoin

This looks good to me for now as a skeleton. I think it would be good to get at least a basic emulation implementation in as a reference in the future when kernel tests are added, like https://github.com/vllm-project/vllm/blob/621ca2c0aba8268d72d380fa3e479ddafa529479/tests/kernels/quantization/test_nvfp4_quant.py
This can be done in follow up work though

mgoin · 2025-05-06T18:19:01Z

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason

BowenBao · 2025-05-07T17:39:21Z

@mgoin makes sense. We will submit follow-ups regarding both suggestions.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

BowenBao force-pushed the mxfp4 branch from 543838f to 7e5b7c1 Compare April 22, 2025 23:33

mergify bot added the needs-rebase label Apr 24, 2025

mgoin reviewed Apr 29, 2025

View reviewed changes

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

BowenBao force-pushed the mxfp4 branch from 489501f to 157eece Compare April 30, 2025 20:17

mergify bot removed the needs-rebase label Apr 30, 2025

BowenBao force-pushed the mxfp4 branch from 157eece to a7a0674 Compare April 30, 2025 20:31

BowenBao marked this pull request as ready for review April 30, 2025 21:21

BowenBao requested review from robertgshaw2-redhat and tlrmchlsmth as code owners April 30, 2025 21:21

mgoin reviewed May 1, 2025

View reviewed changes

BowenBao requested review from DarkLight1337 and ywang96 as code owners May 1, 2025 17:19

mergify bot added the needs-rebase label May 1, 2025

BowenBao commented May 1, 2025

View reviewed changes

vllm/model_executor/layers/rotary_embedding.py Outdated Show resolved Hide resolved

mergify bot removed the needs-rebase label May 1, 2025

BowenBao force-pushed the mxfp4 branch from ac8e5e9 to b9c1d97 Compare May 1, 2025 18:18

fxmarty-amd force-pushed the mxfp4 branch from b9c1d97 to 8602793 Compare May 5, 2025 12:59

fxmarty-amd and others added 6 commits May 5, 2025 18:33

Separate moe to another PR

c439a13

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Fix VLLM_QUARK_EMU_MEM_OPT codepath

edc4980

Signed-off-by: Bowen Bao <bowenbao@amd.com>

lint

2f72aa9

Signed-off-by: Bowen Bao <bowenbao@amd.com>

Relax device requirement due to emulation

e1a9b91

Signed-off-by: Bowen Bao <bowenbao@amd.com>

fxmarty-amd force-pushed the mxfp4 branch from 8602793 to 108a802 Compare May 5, 2025 16:33

mgoin approved these changes May 6, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025

mgoin merged commit db593aa into vllm-project:main May 7, 2025
77 checks passed

fxmarty-amd mentioned this pull request May 9, 2025

[Feature][Quantization] MXFP4 support for MOE models #17888

Merged

4 tasks

princepride pushed a commit to princepride/vllm that referenced this pull request May 10, 2025

[Quantization] Quark MXFP4 format loading (vllm-project#16943)

29aed29

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Quantization] Quark MXFP4 format loading (vllm-project#16943)

1c9a9b9

Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

fxmarty-amd deleted the mxfp4 branch May 19, 2025 14:37

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Quantization] Quark MXFP4 format loading (vllm-project#16943)

60993f4

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

fxmarty-amd mentioned this pull request Jul 18, 2025

[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 #21166

Merged

3 tasks

zeryx mentioned this pull request Feb 27, 2026

[Feature]: Support serving ModelOpt W4A8 MXFP4+FP8 checkpoints #35528

Open

1 task

Uh oh!

Conversation

BowenBao commented Apr 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps

Uh oh!

github-actions bot commented Apr 21, 2025

Uh oh!

mergify bot commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

BowenBao commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin May 1, 2025

Choose a reason for hiding this comment

Uh oh!

BowenBao May 1, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin commented May 1, 2025

Uh oh!

mergify bot commented May 1, 2025

Uh oh!

BowenBao commented May 1, 2025

Uh oh!

Uh oh!

BowenBao commented May 2, 2025

Uh oh!

BowenBao commented May 6, 2025

Uh oh!

mgoin commented May 6, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin May 6, 2025

Choose a reason for hiding this comment

Uh oh!

BowenBao commented May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BowenBao commented Apr 21, 2025 •

edited by github-actions bot

Loading