-
-
Notifications
You must be signed in to change notification settings - Fork 15.6k
[MoE Refactor] MXFP4 Cutlass Experts to MK #34542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vllm-bot
merged 29 commits into
vllm-project:main
from
zyongye:mxfp4_refactor_cutlass_experts
Feb 26, 2026
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
16e7e13
adding mxfp4 quant key
zyongye 0c6da53
runnable but not correct
zyongye 0ed72eb
remove unused variable
zyongye 8c7da24
bug fix
zyongye eff5f48
convert bf16
zyongye 51e8b0d
revert back scalar dtype
zyongye 6d80baa
fix trtllm moe
zyongye 7a088dc
add tune size to flashinfer experts
zyongye d10d307
move kernel setup to process_weight
zyongye 2935058
only cast when act is fp8
zyongye 25aac05
add topk_ids contiguous assertion
zyongye 2a3aa22
add testing infrastructure
zyongye 8a5885b
fix pre-commit
zyongye ae3105e
change parameter inside the kernels
zyongye 8cd30a3
change ci to h100
zyongye 094fc4c
add back quant function parameters
zyongye 28ae123
add back dep interface
zyongye a258129
add back dep interface
zyongye b353697
fixing trtllm moe and pre commit
zyongye cd93ad9
assert not using dep
zyongye 70b025f
bring back dep
zyongye 03e0a58
pre-commit
zyongye 8ec4e1d
update ci tests
zyongye 6de17ac
update device to use in moe config
zyongye 275db17
move fake scale into init
zyongye 91f8c70
add dtype into scales
zyongye a501137
unifing moe_mk interface
zyongye c17109a
adding activation type to experts
zyongye 2618960
fix typos and update tests
zyongye File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # GPQA Evaluation using GPT-OSS | ||
|
|
||
| This directory contains GPQA evaluation tests using the GPT-OSS evaluation package and vLLM server. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Run tests with pytest (like buildkite) | ||
|
|
||
| ```bash | ||
| # H200 | ||
| pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py \ | ||
| --config-list-file=configs/models-h200.txt | ||
|
|
||
| # B200 | ||
| pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py \ | ||
| --config-list-file=configs/models-b200.txt | ||
| ``` | ||
|
|
||
| ## Configuration Format | ||
|
|
||
| Model configs in `configs/` directory use this YAML format: | ||
|
|
||
| ```yaml | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 # Minimum expected accuracy | ||
| reasoning_effort: "low" # Reasoning effort level (default: "low") | ||
| server_args: "--tensor-parallel-size 2" # Server arguments | ||
| startup_max_wait_seconds: 1800 # Max wait for server startup (default: 1800) | ||
| env: # Environment variables (optional) | ||
| SOME_VAR: "value" | ||
| ``` | ||
|
|
||
| The `server_args` field accepts any arguments that can be passed to `vllm serve`. | ||
|
|
||
| The `env` field accepts a dictionary of environment variables to set for the server process. | ||
|
|
||
| ## Adding New Models | ||
|
|
||
| 1. Create a new YAML config file in the `configs/` directory | ||
| 2. Add the filename to the appropriate `models-*.txt` file | ||
|
|
||
| ## Tiktoken Encoding Files | ||
|
|
||
| The tiktoken encoding files required by the vLLM server are automatically downloaded from OpenAI's public blob storage on first run: | ||
|
|
||
| - `cl100k_base.tiktoken` | ||
| - `o200k_base.tiktoken` | ||
|
|
||
| Files are cached in the `data/` directory. The `TIKTOKEN_ENCODINGS_BASE` environment variable is automatically set to point to this directory when running evaluations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 | ||
| reasoning_effort: "low" | ||
| server_args: "--tensor-parallel-size 2" |
8 changes: 8 additions & 0 deletions
8
tests/evals/gpt_oss/configs/gpt-oss-20b-flashinfer-mxfp4-bf16.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 | ||
| reasoning_effort: "low" | ||
| server_args: "--tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: "1" |
8 changes: 8 additions & 0 deletions
8
tests/evals/gpt_oss/configs/gpt-oss-20b-flashinfer-mxfp4-mxfp8-cutlass.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 | ||
| reasoning_effort: "low" | ||
| server_args: "--tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: "1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 | ||
| reasoning_effort: "low" | ||
| server_args: "--tensor-parallel-size 2" | ||
| env: | ||
| VLLM_MXFP4_USE_MARLIN: "1" |
8 changes: 8 additions & 0 deletions
8
tests/evals/gpt_oss/configs/gpt-oss-20b-sm100-fi-mxfp4-mxfp8-trtllm.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| model_name: "openai/gpt-oss-20b" | ||
| metric_threshold: 0.568 | ||
| reasoning_effort: "low" | ||
| server_args: "--tensor-parallel-size 2" | ||
| env: | ||
| VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # B200 model configurations for GPQA evaluation | ||
| # Tests different environment variable combinations | ||
| gpt-oss-20b-flashinfer-mxfp4-bf16.yaml | ||
| gpt-oss-20b-flashinfer-mxfp4-mxfp8-cutlass.yaml | ||
| gpt-oss-20b-sm100-fi-mxfp4-mxfp8-trtllm.yaml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # H100 model configurations for GPQA evaluation | ||
| # Tests different environment variable combinations | ||
| gpt-oss-20b-baseline.yaml | ||
| gpt-oss-20b-flashinfer-mxfp4-bf16.yaml | ||
| gpt-oss-20b-marlin.yaml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to add these here, please remove the duplicated ones in misc.yaml