[Model]Force use triton compressed_tensor_moe instead of cutlass#22345
[Model]Force use triton compressed_tensor_moe instead of cutlass#22345access2rohit wants to merge 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new environment variable, VLLM_TRITON_COMPRESSED_TENSORS_MOE_KERNEL, to allow forcing the use of the Triton-based kernel for compressed tensor Mixture of Experts (MoE) layers. This is intended to improve performance for certain models. The implementation correctly adds the environment variable and uses it to control the kernel selection logic. My main feedback concerns the robustness of parsing this new environment variable, as the current method can lead to a ValueError if an invalid string is provided, causing a crash. I've suggested a safer parsing method to prevent this.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
3aad897 to
2a2a3c1
Compare
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work!
Could you show more data about performance improvement? Not sure if this is actually needed
|
This pull request has merge conflicts that must be resolved before it can be |
… cutlass this improves performance for llama4 Signed-off-by: Rohit Kumar Srivastava <srivastava.141@osu.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rohit Kumar Srivastava <141.srivastava@gmail.com> Signed-off-by: Rohit Kumar Srivastava <srivastava.141@osu.edu>
Signed-off-by: Rohit Kumar Srivastava <srivastava.141@osu.edu>
Signed-off-by: Rohit Kumar Srivastava <srivastava.141@osu.edu>
2a2a3c1 to
11f21bd
Compare
|
Hi @access2rohit , I had a similar purpose PR here: #23442 |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This change is no longer needed since vllm automatically defaults to triton instead of cutlass. Perhaps the next stage could be to simply run perf benchamrks to see if triton still outperforms cutlass or not |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Thanks for the contribution! As you noted, this change is no longer needed since vLLM now automatically defaults to triton instead of cutlass. We're closing this PR accordingly. Thank you for your work on improving Llama 4 performance! |
… this improves performance for llama4
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
This PR is to improve performance of LLama 4 by force use triton based compressed tensor MOE kernel with the help of a flagh. Changed the following files:
Test Plan
an with a uploaded scout based eagle to test E2E
Example cmd
CUDA_VISIBLE_DEVICES=4,5,6,7 VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py --num_spec_tokens 7 --num_prompts 1 --method eagle --model_dir /home/$USER/local/models/scout_base_HF_20250605_201140 --eagle_dir /home/$USER/local/models/scout_draft_HF_20250605_202942 --tp 4
unit test: python -m pytest tests/v1/e2e/test_spec_decode.py
vllm serve + benchmarking
EAGLE server cmd
#!/bin/bash
Configuration of environment variables
export CUDA_VISIBLE_DEVICES=4,5,6,7
export VLLM_USE_V1=1
Command to run the vllm server
spec_dec_config='{"method": "eagle", "model": "/home/$USER/local/models/scout_draft_HF_20250605_202942", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 4, "max_model_len": 32768}'
vllm serve /home/$USER/local/models/scout_base_HF_20250605_201140 --disable-log-requests
-tp 4
--max-num-seqs 128
--max_num_batched_tokens=80000
--max-model-len=32768
--no-enable-prefix-caching
--trust-remote-code
--speculative-config="$spec_dec_config"
--num-lookahead-slots=3
2>&1 | tee /data/users/$USER/logs/server/vllm_17b16e_vllm_serving$(date +%Y%m%d_%H%M%S).log
base cmd = eagle server cmd, removing --speculative-config="$spec_dec_config" \
benchmarking
python benchmarks/benchmark_serving.py --backend vllm --model /home/$USER/local/models/scout_base_HF_20250605_201140 --dataset-name hf --dataset-path philschmid/mt-bench --seed 0 --max-concurrency 16 2>&1 | tee /data/users/$USER/tmp/vllm_17b16e_vllm_loadgen$(date +%Y%m%d_%H%M%S).log\n
Test Result
[WIP]
(Optional) Documentation Update