[Quantization] Quark MXFP4 format loading #16943
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
|
@mgoin thanks for taking a look! This PR is now ready for review. More PRs will follow. |
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Can we remove the env var and always do weight decompress at runtime? This is the expected behavior from other quantization methods so I feel it is strange to not do compression
There was a problem hiding this comment.
We found it to be more efficient for emulation evaluations doing aot weight dequant. That being said, this can be removed with the support of more efficient dequant kernels. I would prefer keeping this option for now but let me know if you feel strongly about it.
There was a problem hiding this comment.
Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
|
Seems reasonable, thanks. Is there a small model for testing that you could add under |
|
This pull request has merge conflicts that must be resolved before it can be |
Test added. Skipped for now til model is publicly released. |
|
@mgoin I have addressed most of the comments. Please take a look again, thanks! |
wip wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <bowenbao@amd.com> lint Signed-off-by: Bowen Bao <bowenbao@amd.com> turn on emulation based on platform Signed-off-by: Bowen Bao <bowenbao@amd.com> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <bowenbao@amd.com> Mxfp4 memory leak fixes (#2) Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com> Add test Signed-off-by: Bowen Bao <bowenbao@amd.com> revert rope local fix Signed-off-by: Bowen Bao <bowenbao@amd.com> remove print Signed-off-by: Bowen Bao <bowenbao@amd.com> rename scale calculation mode Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
@mgoin friendly ping. We have a few more PRs lined up after this one, would greatly appreciate if you could take another look! |
|
Thank you for the ping! Will look today |
mgoin
left a comment
There was a problem hiding this comment.
This looks good to me for now as a skeleton. I think it would be good to get at least a basic emulation implementation in as a reference in the future when kernel tests are added, like https://github.com/vllm-project/vllm/blob/621ca2c0aba8268d72d380fa3e479ddafa529479/tests/kernels/quantization/test_nvfp4_quant.py
This can be done in follow up work though
There was a problem hiding this comment.
Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason
|
@mgoin makes sense. We will submit follow-ups regarding both suggestions. |
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Initial PR to integrate loading MXFP4 models quantized by Quark.
This PR supports running MXFP4 emulation for devices where micro-scaling datatype is not natively supported.
Next Steps