Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request is part of a larger refactoring effort for Mixture-of-Experts (MoE) layers, specifically focusing on integrating the Marlin FP8 kernel into the modular kernel framework. The changes introduce a new quantization configuration function fp8_w8a16_moe_quant_config and wire it up for the Marlin backend in Fp8MoEMethod. While the overall direction of the refactoring is sound, I've found a critical issue in the implementation of the new configuration function that needs to be addressed.
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
opening to run the ci |
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
|
unblocked various MoE tests |
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>
1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
SUMMARY:
TEST PLAN:
TEST RESULT: