-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967
Conversation
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
…LM/vllm into aiter-fmoe-integration
…o that the models unit tests would be triggered when aiter envs are switched on and off Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
vllm/envs.py
Outdated
| "VLLM_ROCM_USE_AITER_MOE": | ||
| lambda: | ||
| (os.getenv("VLLM_ROCM_USE_AITER", "False").lower() in | ||
| ("true", "1") and os.getenv("VLLM_ROCM_USE_AITER_MOE", "True").lower() in | ||
| ("true", "1")), | ||
|
|
||
| # use aiter block scaled moe op if aiter ops are enabled. | ||
| # by default this is disabled. | ||
| "VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE": | ||
| lambda: | ||
| (os.getenv("VLLM_ROCM_USE_AITER", "False").lower() in | ||
| ("true", "1") and os.getenv("VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE", | ||
| "false").lower() in ("true", "1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep vllm.envs simple by not doing any cascading here. The cascading logic should belong somewhere else (e.g. in the platform class, or in the place where it's actually being used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the cascading logic is a bit much for the vllm.envs, but I don't think that the platforms class is really the right place for kernel selection logic. I'd prefer to keep all of these environment variable checks down in the "layer" level where we are actually selecting kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 @SageMoore
have been addressed in this commit
|
I have two high level requests for this PR. The first is that we remove AITER enablement in any unit test that does not exercise this kernel. It's important that we have a good understanding of where this kernel is being unit tested and that's hard to figure out in this PR's current state. The second is that you include lm_eval results for any models that should be supported by this kernel. It sounds like that's just Deepseek V3 and Mixtral? Regardless, we need to make sure that accuracy is maintained with those models before we merge. Thank you so much for the contribution and for working with us to get this merged. We are very excited about the Deepseek performance improvements! |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
…d run test cases. Signed-off-by: vllmellm <[email protected]>
…nal format Signed-off-by: vllmellm <[email protected]>
|
Hi, @SageMoore : can we prioritize to merge this PR asap? This is very important feature. Thanks. |
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me. Thanks for cleaning up the tests and running lm_eval.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
…o its default value which is false Signed-off-by: vllmellm <[email protected]>
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp
Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]>
Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]>
Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Mu Huai <[email protected]>
This PR integrates fused MoE kernels from AITER (AI Tensor Engine for ROCm)
Several fused MoE kernels have been integrated for different scenarios:
The
ck_moekernel from AITER is integrated for unquantized model weights. It is enabled by default whenVLLM_ROCM_USE_AITER=1is set. It can be specifically enabled or disabled using the dedicated environment variableVLLM_ROCM_USE_AITER_MOE. This is suitable for MoE models such as Mixtral.The
asm_moekernel from AITER is integrated for dynamic per-tensor quantization model weights. It is enabled by default whenVLLM_ROCM_USE_AITER=1is set. It can be specifically enabled or disabled using the dedicated environment variableVLLM_ROCM_USE_AITER_MOE. This is suitable for MoE models such as Mixtral for fp8 quantization.The
fmoe_fp8_block_scaledkernel from AITER is integrated for block fp8 quantization method. Unlike the above features, this is disabled by default even when the parent switch (VLLM_ROCM_USE_AITER=1) is enabled. To use this kernel, both the parent switch and its dedicated environment variableVLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOEmust be enabled. This kernel is suitable for DeepSeek models.These MoE kernels are integrated in
/vllm/model_executor/layers/fused_moe/fused_moe.py. The necessary processing steps required for these kernels are included in their respective MoE Methods for both Unquantized (UnquantizedMoEMethod) in/vllm/model_executor/layers/fused_moe/layer.pyand FP8 quantized (FP8MoEMethod) in/vllm/model_executor/layers/quantization/fp8.py.Performance Improvement Tables
Mixtral-8x7B-FP8
Mixtral-8x7B-FP16
DeepSeekV3 Throughput
DeepSeekV3 Latency
AITER Operations Testing Overview
1. High-Level Integration Tests
The integration of AITER ops is tested at a higher module level in the following files under
/tests/models/decoder_only/language:test_models.pytest_mistral.pyThese tests involve running various models to ensure overall functionality.
2. AITER MoE Specific Test
/tests/kernels/test_moe.py3. Quantization Testing
/tests/quantization/test_fp8.py4. Kernel Function Dispatch Testing
/tests/model_executor/test_enabled_custom_ops.pylm_eval results
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mixtral-8x22B-Instruct-v0.1
Deepseek-V3