Skip to content

[MoE Refactor][4/N] Marlin Fp8 Mk#31036

Merged
mgoin merged 14 commits intomainfrom
marlin-fp8-mk
Dec 21, 2025
Merged

[MoE Refactor][4/N] Marlin Fp8 Mk#31036
mgoin merged 14 commits intomainfrom
marlin-fp8-mk

Conversation

@robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Dec 19, 2025

SUMMARY:

  • convert marlin fp8 to mk in fp8.py

TEST PLAN:

MODEL := "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
PORT := "8000"
GPUS := "1"

launch_fi:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_dg:
	VLLM_USE_DEEP_GEMM=1 VLLM_MOE_USE_DEEP_GEMM=1 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_triton:
	VLLM_USE_DEEP_GEMM=0 VLLM_MOE_USE_DEEP_GEMM=0 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_marlin:
	VLLM_TEST_FORCE_FP8_MARLIN=1 VLLM_USE_DEEP_GEMM=0 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

eval LIMIT:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" \
		--limit {{LIMIT}}

TEST RESULT:

local-completions (model=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8,base_url=http://localhost:8003/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|| 0.91|±  |0.0288|
|     |       |strict-match    |     5|exact_match|| 0.90|±  |0.0302|

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is part of a larger refactoring effort for Mixture-of-Experts (MoE) layers, specifically focusing on integrating the Marlin FP8 kernel into the modular kernel framework. The changes introduce a new quantization configuration function fp8_w8a16_moe_quant_config and wire it up for the Marlin backend in Fp8MoEMethod. While the overall direction of the refactoring is sound, I've found a critical issue in the implementation of the new configuration function that needs to be addressed.

Robert Shaw added 4 commits December 19, 2025 12:15
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat
Copy link
Collaborator Author

opening to run the ci

Robert Shaw added 4 commits December 19, 2025 19:32
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify
Copy link

mergify bot commented Dec 20, 2025

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
@mergify
Copy link

mergify bot commented Dec 20, 2025

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
@mergify
Copy link

mergify bot commented Dec 20, 2025

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
@mergify
Copy link

mergify bot commented Dec 20, 2025

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
@robertgshaw2-redhat
Copy link
Collaborator Author

unblocked various MoE tests

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@mgoin mgoin merged commit b471092 into main Dec 21, 2025
60 checks passed
@mgoin mgoin deleted the marlin-fp8-mk branch December 21, 2025 17:37
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
adobrzyn pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Dec 23, 2025
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
iboiko-habana added a commit to iboiko-habana/vllm-gaudi that referenced this pull request Dec 23, 2025
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: slokesha <slokeshappa@habana.ai>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
1) Quick fix for upstream changes:
[PR30684](vllm-project/vllm#30684)
2) Fix for upstream changes:
vllm-project/vllm#28891 (Port:
[PR751](vllm-project#751))
3) Fix for vllm-project/vllm#31036
issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test
```(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init
(EngineCore_DP0 pid=5792)     self.quant_method.get_fused_moe_quant_config(self)
(EngineCore_DP0 pid=5792)   File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config
(EngineCore_DP0 pid=5792)     w1_scale=layer.w13_weight_scale,
(EngineCore_DP0 pid=5792)              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5792)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__
(EngineCore_DP0 pid=5792)     raise AttributeError(
(EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'```

This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants