[MoE Refactor][4/N] Marlin Fp8 Mk by robertgshaw2-redhat · Pull Request #31036 · vllm-project/vllm

robertgshaw2-redhat · 2025-12-19T15:55:09Z

SUMMARY:

convert marlin fp8 to mk in fp8.py

TEST PLAN:

MODEL := "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
PORT := "8000"
GPUS := "1"

launch_fi:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_dg:
	VLLM_USE_DEEP_GEMM=1 VLLM_MOE_USE_DEEP_GEMM=1 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_triton:
	VLLM_USE_DEEP_GEMM=0 VLLM_MOE_USE_DEEP_GEMM=0 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

launch_marlin:
	VLLM_TEST_FORCE_FP8_MARLIN=1 VLLM_USE_DEEP_GEMM=0 chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} --enforce-eager -tp {{GPUS}} --max-model-len 8192 --trust-remote-code --port {{PORT}}

eval LIMIT:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" \
		--limit {{LIMIT}}

TEST RESULT:

local-completions (model=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8,base_url=http://localhost:8003/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.91|±  |0.0288|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0302|

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist

Code Review

This pull request is part of a larger refactoring effort for Mixture-of-Experts (MoE) layers, specifically focusing on integrating the Marlin FP8 kernel into the modular kernel framework. The changes introduce a new quantization configuration function fp8_w8a16_moe_quant_config and wire it up for the Marlin backend in Fp8MoEMethod. While the overall direction of the refactoring is sound, I've found a critical issue in the implementation of the new configuration function that needs to be addressed.

vllm/model_executor/layers/fused_moe/config.py

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat · 2025-12-19T21:41:08Z

opening to run the ci

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mergify · 2025-12-20T00:48:48Z

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

mergify · 2025-12-20T21:22:51Z

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

mergify · 2025-12-20T21:31:28Z

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

mergify · 2025-12-20T21:36:51Z

Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

robertgshaw2-redhat · 2025-12-20T21:44:30Z

unblocked various MoE tests

mgoin

Nice

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: slokesha <slokeshappa@habana.ai>

1) Quick fix for upstream changes: [PR30684](vllm-project/vllm#30684) 2) Fix for upstream changes: vllm-project/vllm#28891 (Port: [PR751](vllm-project#751)) 3) Fix for vllm-project/vllm#31036 issue: failed test case run_qwen3_compressed_tensor_dynamic_scaling_test ```(EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1487, in ensure_moe_quant_config_init (EngineCore_DP0 pid=5792) self.quant_method.get_fused_moe_quant_config(self) (EngineCore_DP0 pid=5792) File "/root/logs/vllm/vllm/model_executor/layers/quantization/fp8.py", line 1225, in get_fused_moe_quant_config (EngineCore_DP0 pid=5792) w1_scale=layer.w13_weight_scale, (EngineCore_DP0 pid=5792) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=5792) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in __getattr__ (EngineCore_DP0 pid=5792) raise AttributeError( (EngineCore_DP0 pid=5792) AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv'``` This issue was already present, but it was not detected as marlin was disabled. After moe refactor in vllm-project/vllm#31036, parameter self.use_marlin was replaced by self.fp8_backend. self.fp8_backend is disabled now --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>

make quant config for fp8-w8a16

c101813

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/config.py Outdated Show resolved Hide resolved

Robert Shaw added 4 commits December 19, 2025 12:15

working end to end

1a3a1c8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean up a bit

aa48342

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fp8 w8a16 config

469fcf9

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fp8 w8a16 config

647441a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat marked this pull request as ready for review December 19, 2025 21:41

robertgshaw2-redhat requested review from mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners December 19, 2025 21:41

Robert Shaw added 4 commits December 19, 2025 19:32

Merge remote-tracking branch 'origin/main' into marlin-fp8-mk

70d863d

merge with overall mk refactor

cf59403

Signed-off-by: Robert Shaw <robshaw@redhat.com>

nits

be763bd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update

c96f5c9

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'main' into marlin-fp8-mk

b60bf35

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

robertgshaw2-redhat added 2 commits December 20, 2025 21:24

fix test

31de97a

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

fix tuple

441f257

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

fix pre-commit

8caad9e

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

pre-commit

c64f20e

Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025

mgoin approved these changes Dec 21, 2025

View reviewed changes

mgoin merged commit b471092 into main Dec 21, 2025
60 checks passed

mgoin deleted the marlin-fp8-mk branch December 21, 2025 17:37

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[MoE Refactor][4/N] Marlin Fp8 Mk (vllm-project#31036)

32d2d55

iboiko-habana mentioned this pull request Dec 22, 2025

[FIX_FOR_VLLM_LATEST] Quick fix for PR30684 vllm-project/vllm-gaudi#742

Merged

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[MoE Refactor][4/N] Marlin Fp8 Mk (vllm-project#31036)

c0dd7fd

Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[MoE Refactor][4/N] Marlin Fp8 Mk (vllm-project#31036)

b8b87e6

Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[MoE Refactor][4/N] Marlin Fp8 Mk (vllm-project#31036)

b1cdfd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor][4/N] Marlin Fp8 Mk#31036

[MoE Refactor][4/N] Marlin Fp8 Mk#31036
mgoin merged 14 commits intomainfrom
marlin-fp8-mk

robertgshaw2-redhat commented Dec 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 19, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

robertgshaw2-redhat commented Dec 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 19, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

mergify bot commented Dec 20, 2025

Uh oh!

robertgshaw2-redhat commented Dec 20, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertgshaw2-redhat commented Dec 19, 2025 •

edited by github-actions bot

Loading