Skip to content

Add a flag to use FusedMoE kernel in compressed quantization#23442

Closed
chenxi-yang wants to merge 2 commits intovllm-project:mainfrom
chenxi-yang:export-D80552023
Closed

Add a flag to use FusedMoE kernel in compressed quantization#23442
chenxi-yang wants to merge 2 commits intovllm-project:mainfrom
chenxi-yang:export-D80552023

Conversation

@chenxi-yang
Copy link
Contributor

@chenxi-yang chenxi-yang commented Aug 22, 2025

Summary: Allows customizing kernel usage in compressed-tensor quantization.

Test Plan:
CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.85
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size 192
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8

Before:

QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

After:

QPS: 1.86
Avg latency: 92.238s
Avg TTFT (client): 1856.53ms
P50 TTFT (client): 1912.39ms
P99 TTFT (client): 2694.97ms
Avg TTIT (client): 70.38ms
P50 TTIT (client): 75.11ms
P99 TTIT (client): 76.74ms
Avg TTFT (server): 2984.87ms
Avg TTIT (server): 77.98ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens

Rollback Plan:

Differential Revision: D80552023

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new environment variable, VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION, to allow forcing the use of the fused MoE kernel for compressed quantization. The changes correctly implement this new flag. My main feedback is to refactor duplicated code in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py to improve maintainability. The changes also include an unrelated but correct refactoring in vllm/model_executor/models/glm4_1v.py.

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

1 similar comment
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

1 similar comment
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D80552023

@talwolman
Copy link

@yewentao256 any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them

@yewentao256
Copy link
Member

I don't think this is needed for common users, since envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION is a short path and it ignore other conditions needed for fused_moe.
What do you think @mgoin

@chenxi-yang
Copy link
Contributor Author

I don't think this is needed for common users, since envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION is a short path and it ignore other conditions needed for fused_moe. What do you think @mgoin

Hi, could you elaborate a bit about other conditions needed for fused moe? Thank you!

# small-batch fallback on SM100
if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8:
# fused_moe flag or small-batch fallback on SM100
if envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this env var name is not appropriate. Shall we call it FUSED_MOE_BACKEND?

So users can force to pick a fused MOE backend?

@yewentao256
Copy link
Member

I am thinking we don't need a env here, if we do find the fused moe supporting sm90 and no need for topk_ids.shape[0] <= 8, we can directly update the condition there. An environment variable here seems to be a hack for specific user only

@facebook-github-bot
Copy link

@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023.

@chenxi-yang
Copy link
Contributor Author

I am thinking we don't need a env here, if we do find the fused moe supporting sm90 and no need for topk_ids.shape[0] <= 8, we can directly update the condition there. An environment variable here seems to be a hack for specific user only

Cleaned up the condition with sm90 checking. Please feel free to review.

# small-batch fallback on SM100
if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8:
# SM90 or small-batch fallback on SM100
if self.is_fp8_w8a8_sm90 or (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to confirm, do we feel in general, SM90 should go with this path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am worried about if topk_ids.shape[0] <= 8 should be applied as well.
Could you please add lm_eval and possible tests to show this is validated?

Copy link
Contributor Author

@chenxi-yang chenxi-yang Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

I’m benchmarking Triton-fused MoE and CUTLASS MoE to better understand the differences. Is there an existing script for benchmarking cutlass_moe (and _fp8)? I’ve been using benchmark_moe.py on GLM, llama, Kimi, and DeepSeek for Triton-fused MoE, and would like to compare against CUTLASS across models. Otherwise, I’ll write one (just want to avoid duplication).

For lm_eval: I’m new to vllm — is there a guideline or README for adding lm_eval and kernel-level tests? I’d be happy to follow the recommended process.

Just FYI, I also noticed a similar observation in llama-Scout where fused moe (fp8) with op-config was faster than CUTLASS moe: #19714
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the benchmark script, if you can't find one in the benchmark folder, feel free to write one by yourself (GPT is really good at this)
For lm_eval, vllm/docs/features/quantization/fp8.md you can take a look at the document here. Note that should find a model that is using the fusedmoe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yewentao256 Hi Wentao, here is lm_eval for Llama-4-Maverick-17B-128E-Instruct-FP8. Llama-4-Maverick-17B-128E-Instruct-FP8 originally used

.

lm_eval command:

MODEL=$MODEL_DIR/Llama-4-Maverick-17B-128E-Instruct-FP8
lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=8 \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250

lm_eval without this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.924|±  |0.0168|
|     |       |strict-match    |     5|exact_match|↑  |0.924|±  |0.0168|

lm_eval with this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.916|±  |0.0176|
|     |       |strict-match    |     5|exact_match|↑  |0.928|±  |0.0164|

What kind of test do we expect? Separately, I was wondering why we may want to add if topk_ids.shape[0] <= 8. Is the concern accuracy or throughput related? (does larger bs lead to numerical precision issue for fused moe or performance cliff)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chenxi @chenxi-yang
For the accuracy test, the main test we want to do is to make sure it won't affect the accuracy of other models. It is expected to be nearly the same with both method (fused moe/cutlass); The test should compare the acc for both methods with the same model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure the context for if topk_ids.shape[0] <=8, perhaps @mgoin knows more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!
For now, I am planning to test the following for glm-4.5v-fp8 and Llama-4-Maverick-17B-128E-Instruct-FP8:

  1. lm_eval both models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)
  2. benchmark the fused_moe() [with optimal kernel config] and cutlass_moe_fp8() for glm-4.5v-fp8 and Llama-4-Maverick-17B-128E-Instruct-FP8.

What do you think? @yewentao256 @houseroad

Copy link
Contributor Author

@chenxi-yang chenxi-yang Sep 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yewentao256 @houseroad I added the cutlass moe fp8 benchmark here: #25302, PTAL. glm config is here: #24911

Triton fused moe with optimal config is generally faster by > 20% than cutlass moe. However, triton fused is worse than cutlass when using default config. The gain seems to come from a combination of fusion and kernel tuning.

In the future, there may be some optimization options: 1). add fused cutlass moe 2). customize vllm moe options

Here are the details:

glm4.5v-fp8 setting

triton with op config
image
triton with default config
image

Llama-4-Maverick-17B-128E-Instruct-FP8 setting

triton with op config
image
triton with default config
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lm_eval both models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)

This looks good to me, could you add report for that?

@mergify
Copy link

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm-project#23442)

Summary:
Pull Request resolved: vllm-project#23442

Allows fused moe kernel usage in compressed-tensor quantization on sm90.

Signed-off-by: Chenxi Yang <cxyang@meta.com>

Test Plan:
CUDA_VISIBLE_DEVICES=6,7 \
  VLLM_DISABLE_COMPILE_CACHE=1 \
  VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \
  VLLM_GPU_MEMORY_UTILIZATION=0.85 \
  buck2 run @//mode/{opt,inplace} \
    -c fbcode.enable_vllm=true \
    -c fbcode.enable_gpu_sections=true \
    -c fbcode.nvcc_arch=h100a \
    //smart/inference_platform_sp/llm_predictor_gpu:service -- \
    --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \
    --try_local_cache \
    --max_seq_len=16384 \
    --max_batch_size 192 \
    --thrift_server_port 12345 \
    --enable_warmup=true \
    --model_mf_bucket=llm_inference \
    --model_mf_path=tree/oss/GLM-4.5V-FP8 \
    --force_llm_format=true \
    --allow_custom_stop_tokens \
    --model_parallel_size 2 \
    --vllm_engine \
    --cpu_offload_gb=0 \
    --kv_cache_quantization 8

Before:

QPS:                 1.26
Avg latency:         49.998s
Avg TTFT (client):   1679.44ms
P50 TTFT (client):   1584.17ms
P99 TTFT (client):   5748.46ms
Avg TTIT (client):   48.32ms
P50 TTIT (client):   48.21ms
P99 TTIT (client):   59.81ms
Avg TTFT (server):   2481.96ms
Avg TTIT (server):   48.06ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

After:

QPS:                 1.86
Avg latency:         92.238s
Avg TTFT (client):   1856.53ms
P50 TTFT (client):   1912.39ms
P99 TTFT (client):   2694.97ms
Avg TTIT (client):   70.38ms
P50 TTIT (client):   75.11ms
P99 TTIT (client):   76.74ms
Avg TTFT (server):   2984.87ms
Avg TTIT (server):   77.98ms
Avg prefill len:     2643.00 tokens
P50 prefill len:     2643.00 tokens
P99 prefill len:     2643.00 tokens
Avg decode len:      1000.00 tokens
P50 decode len:      1000.00 tokens
P99 decode len:      1000.00 tokens

The accuracy is 0.74 (w/ this diff and w/o this diff are on par)

Reviewed By: zzh142857, wangwenchen0407

Differential Revision: D80552023
@facebook-github-bot
Copy link

@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Dec 26, 2025
@mergify
Copy link

mergify bot commented Dec 26, 2025

Hi @chenxi-yang, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Dec 28, 2025
@mergify
Copy link

mergify bot commented Dec 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chenxi-yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 28, 2025
@hmellor
Copy link
Member

hmellor commented Mar 6, 2026

The ability to override MoE kernel selection is now available via the --moe-backend CLI argument, added in #33807 (merged 2026-02-26).

@hmellor hmellor closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase unstale Recieved activity after being labelled stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants