Add a flag to use FusedMoE kernel in compressed quantization#23442

Closed

chenxi-yang wants to merge 2 commits intovllm-project:mainfrom

chenxi-yang:export-D80552023

Contributor

chenxi-yang commented Aug 22, 2025 •

edited by github-actions bot

Loading

Summary: Allows customizing kernel usage in compressed-tensor quantization.

Test Plan:
CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.85
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size 192
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8

Before:

QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

After:

QPS: 1.86
Avg latency: 92.238s
Avg TTFT (client): 1856.53ms
P50 TTFT (client): 1912.39ms
P99 TTFT (client): 2694.97ms
Avg TTIT (client): 70.38ms
P50 TTIT (client): 75.11ms
P99 TTIT (client): 76.74ms
Avg TTFT (server): 2984.87ms
Avg TTIT (server): 77.98ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

Rollback Plan:

Differential Revision: D80552023

chenxi-yang requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners

August 22, 2025 20:04

chenxi-yang force-pushed the export-D80552023 branch from 2a4d41e to f7876e3 Compare

August 22, 2025 20:04

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces a new environment variable, VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION, to allow forcing the use of the fused MoE kernel for compressed quantization. The changes correctly implement this new flag. My main feedback is to refactor duplicated code in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py to improve maintainability. The changes also include an unrelated but correct refactoring in vllm/model_executor/models/glm4_1v.py.

chenxi-yang force-pushed the export-D80552023 branch from f7876e3 to f4c0c63 Compare

August 22, 2025 20:58

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

chenxi-yang force-pushed the export-D80552023 branch from f4c0c63 to 0d750f6 Compare

August 22, 2025 21:35

chenxi-yang force-pushed the export-D80552023 branch from 0d750f6 to 1cf0baa Compare

August 22, 2025 21:37

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

chenxi-yang force-pushed the export-D80552023 branch from 1cf0baa to 77195d7 Compare

August 22, 2025 21:38

chenxi-yang force-pushed the export-D80552023 branch from 77195d7 to 6731b28 Compare

August 22, 2025 21:38

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

1 similar comment

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

chenxi-yang force-pushed the export-D80552023 branch from 6731b28 to 93d9371 Compare

August 22, 2025 22:03

chenxi-yang force-pushed the export-D80552023 branch from 93d9371 to fae752b Compare

August 22, 2025 22:42

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

1 similar comment

facebook-github-bot commented Aug 22, 2025

This pull request was exported from Phabricator. Differential Revision: D80552023

chenxi-yang force-pushed the export-D80552023 branch from fae752b to 4a0c609 Compare

August 22, 2025 22:45

talwolman commented Aug 25, 2025

@yewentao256 any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them

Member

yewentao256 commented Aug 26, 2025

I don't think this is needed for common users, since envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION is a short path and it ignore other conditions needed for fused_moe.
What do you think @mgoin

Contributor Author

chenxi-yang commented Aug 28, 2025

I don't think this is needed for common users, since envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION is a short path and it ignore other conditions needed for fused_moe. What do you think @mgoin

Hi, could you elaborate a bit about other conditions needed for fused moe? Thank you!

houseroad reviewed

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated

    
                          # small-batch fallback on SM100

                          if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8:

                          # fused_moe flag or small-batch fallback on SM100

                          if envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION or (

Collaborator

houseroad Sep 11, 2025

Probably this env var name is not appropriate. Shall we call it FUSED_MOE_BACKEND?

So users can force to pick a fused MOE backend?

Member

yewentao256 commented Sep 11, 2025

I am thinking we don't need a env here, if we do find the fused moe supporting sm90 and no need for topk_ids.shape[0] <= 8, we can directly update the condition there. An environment variable here seems to be a hack for specific user only

chenxi-yang force-pushed the export-D80552023 branch from 5f58b2c to e6ef397 Compare

September 12, 2025 00:03

facebook-github-bot commented Sep 12, 2025

@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023.

Contributor Author

chenxi-yang commented Sep 12, 2025

I am thinking we don't need a env here, if we do find the fused moe supporting sm90 and no need for topk_ids.shape[0] <= 8, we can directly update the condition there. An environment variable here seems to be a hack for specific user only

Cleaned up the condition with sm90 checking. Please feel free to review.

houseroad reviewed

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated

    
                          # small-batch fallback on SM100

                          if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8:

                          # SM90 or small-batch fallback on SM100

                          if self.is_fp8_w8a8_sm90 or (

Collaborator

houseroad Sep 12, 2025

would like to confirm, do we feel in general, SM90 should go with this path?

Member

yewentao256 Sep 12, 2025

Yes, I am worried about if topk_ids.shape[0] <= 8 should be applied as well.
Could you please add lm_eval and possible tests to show this is validated?

Contributor Author

chenxi-yang Sep 14, 2025 •

edited

Loading

Thanks for the comments!

I’m benchmarking Triton-fused MoE and CUTLASS MoE to better understand the differences. Is there an existing script for benchmarking cutlass_moe (and _fp8)? I’ve been using benchmark_moe.py on GLM, llama, Kimi, and DeepSeek for Triton-fused MoE, and would like to compare against CUTLASS across models. Otherwise, I’ll write one (just want to avoid duplication).

For lm_eval: I’m new to vllm — is there a guideline or README for adding lm_eval and kernel-level tests? I’d be happy to follow the recommended process.

Just FYI, I also noticed a similar observation in llama-Scout where fused moe (fp8) with op-config was faster than CUTLASS moe: #19714
.

Member

yewentao256 Sep 15, 2025

For the benchmark script, if you can't find one in the benchmark folder, feel free to write one by yourself (GPT is really good at this)
For lm_eval, vllm/docs/features/quantization/fp8.md you can take a look at the document here. Note that should find a model that is using the fusedmoe

Contributor Author

chenxi-yang Sep 18, 2025

@yewentao256 Hi Wentao, here is lm_eval for Llama-4-Maverick-17B-128E-Instruct-FP8. Llama-4-Maverick-17B-128E-Instruct-FP8 originally used

vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Line 950 in 81b16a2

return cutlass_moe_fp8(

.

lm_eval command:

MODEL=$MODEL_DIR/Llama-4-Maverick-17B-128E-Instruct-FP8
lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=8 \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

lm_eval without this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.924|±  |0.0168|
|     |       |strict-match    |     5|exact_match|↑  |0.924|±  |0.0168|

lm_eval with this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.916|±  |0.0176|
|     |       |strict-match    |     5|exact_match|↑  |0.928|±  |0.0164|

What kind of test do we expect? Separately, I was wondering why we may want to add if topk_ids.shape[0] <= 8. Is the concern accuracy or throughput related? (does larger bs lead to numerical precision issue for fused moe or performance cliff)

Member

yewentao256 Sep 18, 2025

Hi Chenxi @chenxi-yang
For the accuracy test, the main test we want to do is to make sure it won't affect the accuracy of other models. It is expected to be nearly the same with both method (fused moe/cutlass); The test should compare the acc for both methods with the same model.

Member

yewentao256 Sep 18, 2025

I am not 100% sure the context for if topk_ids.shape[0] <=8, perhaps @mgoin knows more details.

Contributor Author

chenxi-yang Sep 18, 2025

Thanks for the explanation!
For now, I am planning to test the following for glm-4.5v-fp8 and Llama-4-Maverick-17B-128E-Instruct-FP8:

lm_eval both models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)
benchmark the fused_moe() [with optimal kernel config] and cutlass_moe_fp8() for glm-4.5v-fp8 and Llama-4-Maverick-17B-128E-Instruct-FP8.

vllm/vllm/model_executor/layers/fused_moe/fused_moe.py

Line 1211 in 1c3dad2

def fused_experts(

vllm/vllm/model_executor/layers/fused_moe/cutlass_moe.py

Line 392 in 1c3dad2

def cutlass_moe_fp8(

What do you think? @yewentao256 @houseroad

Contributor Author

chenxi-yang Sep 20, 2025 •

edited

Loading

@yewentao256 @houseroad I added the cutlass moe fp8 benchmark here: #25302, PTAL. glm config is here: #24911

Triton fused moe with optimal config is generally faster by > 20% than cutlass moe. However, triton fused is worse than cutlass when using default config. The gain seems to come from a combination of fusion and kernel tuning.

In the future, there may be some optimization options: 1). add fused cutlass moe 2). customize vllm moe options

Here are the details:

glm4.5v-fp8 setting

triton with op config

triton with default config

Llama-4-Maverick-17B-128E-Instruct-FP8 setting

triton with op config

triton with default config

Member

yewentao256 Sep 26, 2025

lm_eval both models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)

This looks good to me, could you add report for that?

chenxi-yang mentioned this pull request

[Performance]: No signficant speedup from Wfp8Afp8 (vs Wbf16Abf16) in Llama-4 Scout #19714

Closed

1 task

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

chenxi-yang mentioned this pull request

[Model]Force use triton compressed_tensor_moe instead of cutlass #22345

Closed

4 tasks

mergify bot removed the needs-rebase label

chenxi-yang requested review from houseroad and yewentao256

September 22, 2025 19:44


          Add a flag to use FusedMoE kernel in compressed quantizationon on sm90 (

996dc60

vllm-project#23442)

Summary:
Pull Request resolved: vllm-project#23442

Allows fused moe kernel usage in compressed-tensor quantization on sm90.

Signed-off-by: Chenxi Yang <cxyang@meta.com>

Test Plan:
CUDA_VISIBLE_DEVICES=6,7 \
  VLLM_DISABLE_COMPILE_CACHE=1 \
  VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \
  VLLM_GPU_MEMORY_UTILIZATION=0.85 \
  buck2 run @//mode/{opt,inplace} \
    -c fbcode.enable_vllm=true \
    -c fbcode.enable_gpu_sections=true \
    -c fbcode.nvcc_arch=h100a \
    //smart/inference_platform_sp/llm_predictor_gpu:service -- \
    --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \
    --try_local_cache \
    --max_seq_len=16384 \
    --max_batch_size 192 \
    --thrift_server_port 12345 \
    --enable_warmup=true \
    --model_mf_bucket=llm_inference \
    --model_mf_path=tree/oss/GLM-4.5V-FP8 \
    --force_llm_format=true \
    --allow_custom_stop_tokens \
    --model_parallel_size 2 \
    --vllm_engine \
    --cpu_offload_gb=0 \
    --kv_cache_quantization 8

Before:

QPS:                 1.26
Avg latency:         49.998s
Avg TTFT (client):   1679.44ms
P50 TTFT (client):   1584.17ms
P99 TTFT (client):   5748.46ms
Avg TTIT (client):   48.32ms
P50 TTIT (client):   48.21ms
P99 TTIT (client):   59.81ms
Avg TTFT (server):   2481.96ms
Avg TTIT (server):   48.06ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

After:

QPS:                 1.86
Avg latency:         92.238s
Avg TTFT (client):   1856.53ms
P50 TTFT (client):   1912.39ms
P99 TTFT (client):   2694.97ms
Avg TTIT (client):   70.38ms
P50 TTIT (client):   75.11ms
P99 TTIT (client):   76.74ms
Avg TTFT (server):   2984.87ms
Avg TTIT (server):   77.98ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

The accuracy is 0.74 (w/ this diff and w/o this diff are on par)

Reviewed By: zzh142857, wangwenchen0407

Differential Revision: D80552023

facebook-github-bot commented Sep 24, 2025

@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023.

chenxi-yang force-pushed the export-D80552023 branch from 8920fa2 to 996dc60 Compare

September 24, 2025 18:48


          Merge branch 'main' into export-D80552023

74a1502

github-actions bot commented Dec 26, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions bot added the stale label

mergify bot commented Dec 26, 2025

Hi @chenxi-yang, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

github-actions bot added unstale and removed stale labels

mergify bot commented Dec 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

Member

hmellor commented Mar 6, 2026

The ability to override MoE kernel selection is now available via the --moe-backend CLI argument, added in #33807 (merged 2026-02-26).

hmellor closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

mgoin Awaiting requested review from mgoin mgoin is a code owner

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat robertgshaw2-redhat is a code owner

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth is a code owner

houseroad Awaiting requested review from houseroad

yewentao256 Awaiting requested review from yewentao256 yewentao256 is a code owner

1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

needs-rebase unstale