[mxfp4] Remove unnecessary process_weights_after_loading handling in case simulation is used#26111
Conversation
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
0b64348 to
d6c0f13
Compare
There was a problem hiding this comment.
Code Review
This pull request correctly removes unnecessary and buggy weight processing logic from process_weights_after_loading for the mxfp4 emulation (simulation) mode. The removed code was dequantizing the weights prematurely during the model loading phase. This conflicted with the apply_weights method, which is designed to perform dequantization at runtime during the forward pass. By removing this block, the bug is fixed, and weights are correctly handled, remaining in their quantized format until they are needed for computation. The change is clean, well-justified, and improves the correctness of the implementation. No issues were found in this change.
BowenBao
left a comment
There was a problem hiding this comment.
LGTM. Should not dequant weight during loading. It is handled at inference time by dq_w = dequant_mxfp4(layer.weight, layer.weight_scale, x.dtype).
|
cc @maleksan85 |
|
did you check that fp4 model runs on MI300 where dequantization is needed? |
|
@maleksan85 Yes, there are tests in e.g. vllm/tests/quantization/test_quark.py Line 183 in 824a3f4 For context: #17888 (comment) |
|
if you don't mind, please correct implementation to offline dequant seems better for perf then runtime. cc @BowenBao if this option is being supported. Thanks. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @maleksan85 ideally we'd prefer the current option as it keeps low memory consumption. This help with experimenting large models on mi300x. |
probably the worth to keep a flag what is preferable... I mean either speed or memory with default to memory. |
|
This fix was merged as part of #21166, see https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/quark/schemes/quark_ocp_mx.py The flag indeed existed, but was removed as per Michael request (#17888 (comment)). I agree it may be useful to have it, but at the same time the inflation of rocm env variables is something we'd wish to limit #21138 |
As per title.
#25135 added some code in the simulation (QDQ) case for
process_weights_after_loading, which is not necessary, and which was not really the purpose of the PR.See my comments #25135 (comment) and #25135 (comment)
cc @maleksan85