Add a flag to use FusedMoE kernel in compressed quantization#23442
Add a flag to use FusedMoE kernel in compressed quantization#23442chenxi-yang wants to merge 2 commits intovllm-project:mainfrom
Conversation
2a4d41e to
f7876e3
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
There was a problem hiding this comment.
Code Review
This pull request introduces a new environment variable, VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION, to allow forcing the use of the fused MoE kernel for compressed quantization. The changes correctly implement this new flag. My main feedback is to refactor duplicated code in vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py to improve maintainability. The changes also include an unrelated but correct refactoring in vllm/model_executor/models/glm4_1v.py.
f7876e3 to
f4c0c63
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
f4c0c63 to
0d750f6
Compare
0d750f6 to
1cf0baa
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
1cf0baa to
77195d7
Compare
77195d7 to
6731b28
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
6731b28 to
93d9371
Compare
93d9371 to
fae752b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D80552023 |
fae752b to
4a0c609
Compare
|
@yewentao256 any chance you can help review and get this merged? @chenxi-yang is working on very high priority projects and these PRs are critical for them |
|
I don't think this is needed for common users, since |
Hi, could you elaborate a bit about |
| # small-batch fallback on SM100 | ||
| if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8: | ||
| # fused_moe flag or small-batch fallback on SM100 | ||
| if envs.VLLM_USE_FUSED_MOE_KERNEL_IN_COMPRESSED_QUANTIZATION or ( |
There was a problem hiding this comment.
Probably this env var name is not appropriate. Shall we call it FUSED_MOE_BACKEND?
So users can force to pick a fused MOE backend?
|
I am thinking we don't need a env here, if we do find the fused moe supporting sm90 and no need for |
5f58b2c to
e6ef397
Compare
|
@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023. |
Cleaned up the condition with sm90 checking. Please feel free to review. |
| # small-batch fallback on SM100 | ||
| if self.is_fp8_w8a8_sm100 and topk_ids.shape[0] <= 8: | ||
| # SM90 or small-batch fallback on SM100 | ||
| if self.is_fp8_w8a8_sm90 or ( |
There was a problem hiding this comment.
would like to confirm, do we feel in general, SM90 should go with this path?
There was a problem hiding this comment.
Yes, I am worried about if topk_ids.shape[0] <= 8 should be applied as well.
Could you please add lm_eval and possible tests to show this is validated?
There was a problem hiding this comment.
Thanks for the comments!
I’m benchmarking Triton-fused MoE and CUTLASS MoE to better understand the differences. Is there an existing script for benchmarking cutlass_moe (and _fp8)? I’ve been using benchmark_moe.py on GLM, llama, Kimi, and DeepSeek for Triton-fused MoE, and would like to compare against CUTLASS across models. Otherwise, I’ll write one (just want to avoid duplication).
For lm_eval: I’m new to vllm — is there a guideline or README for adding lm_eval and kernel-level tests? I’d be happy to follow the recommended process.
Just FYI, I also noticed a similar observation in llama-Scout where fused moe (fp8) with op-config was faster than CUTLASS moe: #19714
.
There was a problem hiding this comment.
For the benchmark script, if you can't find one in the benchmark folder, feel free to write one by yourself (GPT is really good at this)
For lm_eval, vllm/docs/features/quantization/fp8.md you can take a look at the document here. Note that should find a model that is using the fusedmoe
There was a problem hiding this comment.
@yewentao256 Hi Wentao, here is lm_eval for Llama-4-Maverick-17B-128E-Instruct-FP8. Llama-4-Maverick-17B-128E-Instruct-FP8 originally used
lm_eval command:
MODEL=$MODEL_DIR/Llama-4-Maverick-17B-128E-Instruct-FP8
lm_eval \
--model vllm \
--model_args pretrained=$MODEL,add_bos_token=True,tensor_parallel_size=8 \
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
lm_eval without this PR:
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.924|± |0.0168|
| | |strict-match | 5|exact_match|↑ |0.924|± |0.0168|
lm_eval with this PR:
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.916|± |0.0176|
| | |strict-match | 5|exact_match|↑ |0.928|± |0.0164|
What kind of test do we expect? Separately, I was wondering why we may want to add if topk_ids.shape[0] <= 8. Is the concern accuracy or throughput related? (does larger bs lead to numerical precision issue for fused moe or performance cliff)
There was a problem hiding this comment.
Hi Chenxi @chenxi-yang
For the accuracy test, the main test we want to do is to make sure it won't affect the accuracy of other models. It is expected to be nearly the same with both method (fused moe/cutlass); The test should compare the acc for both methods with the same model.
There was a problem hiding this comment.
I am not 100% sure the context for if topk_ids.shape[0] <=8, perhaps @mgoin knows more details.
There was a problem hiding this comment.
Thanks for the explanation!
For now, I am planning to test the following for glm-4.5v-fp8 and Llama-4-Maverick-17B-128E-Instruct-FP8:
lm_evalboth models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)- benchmark the
fused_moe() [with optimal kernel config]andcutlass_moe_fp8()forglm-4.5v-fp8andLlama-4-Maverick-17B-128E-Instruct-FP8.
What do you think? @yewentao256 @houseroad
There was a problem hiding this comment.
@yewentao256 @houseroad I added the cutlass moe fp8 benchmark here: #25302, PTAL. glm config is here: #24911
Triton fused moe with optimal config is generally faster by > 20% than cutlass moe. However, triton fused is worse than cutlass when using default config. The gain seems to come from a combination of fusion and kernel tuning.
In the future, there may be some optimization options: 1). add fused cutlass moe 2). customize vllm moe options
Here are the details:
glm4.5v-fp8 setting
triton with op config
triton with default config
Llama-4-Maverick-17B-128E-Instruct-FP8 setting
triton with op config
triton with default config
There was a problem hiding this comment.
lm_eval both models' accuracy with and without this PR. (the maverick-fp8 results are already shown above)
This looks good to me, could you add report for that?
|
This pull request has merge conflicts that must be resolved before it can be |
vllm-project#23442) Summary: Pull Request resolved: vllm-project#23442 Allows fused moe kernel usage in compressed-tensor quantization on sm90. Signed-off-by: Chenxi Yang <cxyang@meta.com> Test Plan: CUDA_VISIBLE_DEVICES=6,7 \ VLLM_DISABLE_COMPILE_CACHE=1 \ VLLM_MQ_MAX_CHUNK_BYTES_MB=256 \ VLLM_GPU_MEMORY_UTILIZATION=0.85 \ buck2 run @//mode/{opt,inplace} \ -c fbcode.enable_vllm=true \ -c fbcode.enable_gpu_sections=true \ -c fbcode.nvcc_arch=h100a \ //smart/inference_platform_sp/llm_predictor_gpu:service -- \ --local_cache_dir "$HOME/local/models/GLM-4.5V-FP8" \ --try_local_cache \ --max_seq_len=16384 \ --max_batch_size 192 \ --thrift_server_port 12345 \ --enable_warmup=true \ --model_mf_bucket=llm_inference \ --model_mf_path=tree/oss/GLM-4.5V-FP8 \ --force_llm_format=true \ --allow_custom_stop_tokens \ --model_parallel_size 2 \ --vllm_engine \ --cpu_offload_gb=0 \ --kv_cache_quantization 8 Before: QPS: 1.26 Avg latency: 49.998s Avg TTFT (client): 1679.44ms P50 TTFT (client): 1584.17ms P99 TTFT (client): 5748.46ms Avg TTIT (client): 48.32ms P50 TTIT (client): 48.21ms P99 TTIT (client): 59.81ms Avg TTFT (server): 2481.96ms Avg TTIT (server): 48.06ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens After: QPS: 1.86 Avg latency: 92.238s Avg TTFT (client): 1856.53ms P50 TTFT (client): 1912.39ms P99 TTFT (client): 2694.97ms Avg TTIT (client): 70.38ms P50 TTIT (client): 75.11ms P99 TTIT (client): 76.74ms Avg TTFT (server): 2984.87ms Avg TTIT (server): 77.98ms Avg prefill len: 2643.00 tokens P50 prefill len: 2643.00 tokens P99 prefill len: 2643.00 tokens Avg decode len: 1000.00 tokens P50 decode len: 1000.00 tokens P99 decode len: 1000.00 tokens The accuracy is 0.74 (w/ this diff and w/o this diff are on par) Reviewed By: zzh142857, wangwenchen0407 Differential Revision: D80552023
|
@chenxi-yang has exported this pull request. If you are a Meta employee, you can view the originating diff in D80552023. |
8920fa2 to
996dc60
Compare
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @chenxi-yang, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
The ability to override MoE kernel selection is now available via the |
Summary: Allows customizing kernel usage in compressed-tensor quantization.
Test Plan:
CUDA_VISIBLE_DEVICES=6,7
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_MQ_MAX_CHUNK_BYTES_MB=256
VLLM_GPU_MEMORY_UTILIZATION=0.85
buck2 run @//mode/{opt,inplace}
-c fbcode.enable_vllm=true
-c fbcode.enable_gpu_sections=true
-c fbcode.nvcc_arch=h100a
//smart/inference_platform_sp/llm_predictor_gpu:service --
--local_cache_dir "$HOME/local/models/GLM-4.5V-FP8"
--try_local_cache
--max_seq_len=16384
--max_batch_size 192
--thrift_server_port 12345
--enable_warmup=true
--model_mf_bucket=llm_inference
--model_mf_path=tree/oss/GLM-4.5V-FP8
--force_llm_format=true
--allow_custom_stop_tokens
--model_parallel_size 2
--vllm_engine
--cpu_offload_gb=0
--kv_cache_quantization 8
Before:
QPS: 1.26
Avg latency: 49.998s
Avg TTFT (client): 1679.44ms
P50 TTFT (client): 1584.17ms
P99 TTFT (client): 5748.46ms
Avg TTIT (client): 48.32ms
P50 TTIT (client): 48.21ms
P99 TTIT (client): 59.81ms
Avg TTFT (server): 2481.96ms
Avg TTIT (server): 48.06ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens
After:
QPS: 1.86
Avg latency: 92.238s
Avg TTFT (client): 1856.53ms
P50 TTFT (client): 1912.39ms
P99 TTFT (client): 2694.97ms
Avg TTIT (client): 70.38ms
P50 TTIT (client): 75.11ms
P99 TTIT (client): 76.74ms
Avg TTFT (server): 2984.87ms
Avg TTIT (server): 77.98ms
Avg prefill len: 2643.00 tokens
P50 prefill len: 2643.00 tokens
P99 prefill len: 2643.00 tokens
Avg decode len: 1000.00 tokens
P50 decode len: 1000.00 tokens
P99 decode len: 1000.00 tokens
Rollback Plan:
Differential Revision: D80552023