[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967

vllmellm · 2025-03-17T15:15:42Z

This PR integrates fused MoE kernels from AITER (AI Tensor Engine for ROCm)

Several fused MoE kernels have been integrated for different scenarios:

The ck_moe kernel from AITER is integrated for unquantized model weights. It is enabled by default when VLLM_ROCM_USE_AITER=1 is set. It can be specifically enabled or disabled using the dedicated environment variable VLLM_ROCM_USE_AITER_MOE. This is suitable for MoE models such as Mixtral.
The asm_moe kernel from AITER is integrated for dynamic per-tensor quantization model weights. It is enabled by default when VLLM_ROCM_USE_AITER=1 is set. It can be specifically enabled or disabled using the dedicated environment variable VLLM_ROCM_USE_AITER_MOE. This is suitable for MoE models such as Mixtral for fp8 quantization.
The fmoe_fp8_block_scaled kernel from AITER is integrated for block fp8 quantization method. Unlike the above features, this is disabled by default even when the parent switch (VLLM_ROCM_USE_AITER=1) is enabled. To use this kernel, both the parent switch and its dedicated environment variable VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE must be enabled. This kernel is suitable for DeepSeek models.

These MoE kernels are integrated in /vllm/model_executor/layers/fused_moe/fused_moe.py. The necessary processing steps required for these kernels are included in their respective MoE Methods for both Unquantized (UnquantizedMoEMethod) in /vllm/model_executor/layers/fused_moe/layer.py and FP8 quantized (FP8MoEMethod) in /vllm/model_executor/layers/quantization/fp8.py.

Performance Improvement Tables

Mixtral-8x7B-FP8

Summary	Performance Improvement Over No AITER
With Fused MoE	-14~75%

Mixtral-8x7B-FP16

Summary	Performance Improvement Over No AITER
With Fused MoE	-11~2%

DeepSeekV3 Throughput

Summary	Performance Improvement Over No AITER
fmoe_fp8_block_scaled (VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=1)	8~26.7%

DeepSeekV3 Latency

Summary	SpeedUp in TPOT	SpeedUp in TTFT
fmoe_fp8_block_scaled (VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE=1)	-2%	41%

AITER Operations Testing Overview

1. High-Level Integration Tests

The integration of AITER ops is tested at a higher module level in the following files under /tests/models/decoder_only/language:

test_models.py
test_mistral.py

These tests involve running various models to ensure overall functionality.

2. AITER MoE Specific Test

The AITER Mixture of Experts (MoE) is specifically tested for the Mixtral model in:
/tests/kernels/test_moe.py

3. Quantization Testing

Quantization methods for AITER-enabled modules are tested in:
/tests/quantization/test_fp8.py

4. Kernel Function Dispatch Testing

The correct dispatching of kernel functions (AITER-enabled or not) is verified in:
/tests/model_executor/test_enabled_custom_ops.py

lm_eval results

mistralai/Mixtral-8x7B-Instruct-v0.1

Tasks	Version	Filter	n-shot	Metric	Quantization	Value (Without AITER)	Stderr (Without AITER)	Value (With AITER)	Stderr (With AITER)
gsm8k	3	flexible-extract	5	exact_match ↑	Unquantized	0.6338	±0.0133	0.6475	±0.0132
gsm8k	3	strict-match	5	exact_match ↑	Unquantized	0.6315	±0.0133	0.6437	±0.0132
gsm8k	3	flexible-extract	5	exact_match ↑	FP8	0.6399	±0.0132	0.6376	±0.0132
gsm8k	3	strict-match	5	exact_match ↑	FP8	0.6353	±0.0133	0.6323	±0.0133

mistralai/Mixtral-8x22B-Instruct-v0.1

Tasks	Version	Filter	n-shot	Metric	Quantization	Value (Without AITER)	Stderr (Without AITER)	Value (With AITER)	Stderr (With AITER)
gsm8k	3	flexible-extract	5	exact_match ↑	Unquantized	0.8544	±0.0097	0.8522	±0.0098
gsm8k	3	strict-match	5	exact_match ↑	Unquantized	0.8415	±0.0101	0.8415	±0.0101
gsm8k	3	flexible-extract	5	exact_match ↑	FP8	0.8506	±0.0098	0.8552	±0.0097
gsm8k	3	strict-match	5	exact_match ↑	FP8	0.8378	±0.0102	0.8469	±0.0099

Deepseek-V3

Tasks	Version	Filter	n-shot	Metric	Value (Without AITER)	Stderr (Without AITER)	Value (With AITER)	Stderr (With AITER)
gsm8k	3	flexible-extract	5	exact_match ↑	0.9469	±0.0062	0.9492	±0.0060
gsm8k	3	strict-match	5	exact_match ↑	0.9477	±0.0061	0.9484	±0.0061

Signed-off-by: vllmellm <[email protected]>

github-actions · 2025-03-17T15:15:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: vllmellm <[email protected]>

…LM/vllm into aiter-fmoe-integration

…o that the models unit tests would be triggered when aiter envs are switched on and off Signed-off-by: vllmellm <[email protected]>

Signed-off-by: vllmellm <[email protected]>

DarkLight1337 · 2025-03-18T14:32:56Z

vllm/envs.py

+    "VLLM_ROCM_USE_AITER_MOE":
+    lambda:
+    (os.getenv("VLLM_ROCM_USE_AITER", "False").lower() in
+     ("true", "1") and os.getenv("VLLM_ROCM_USE_AITER_MOE", "True").lower() in
+     ("true", "1")),
+
+    # use aiter block scaled moe op if aiter ops are enabled.
+    # by default this is disabled.
+    "VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE":
+    lambda:
+    (os.getenv("VLLM_ROCM_USE_AITER", "False").lower() in
+     ("true", "1") and os.getenv("VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE",
+                                 "false").lower() in ("true", "1")),


Let's keep vllm.envs simple by not doing any cascading here. The cascading logic should belong somewhere else (e.g. in the platform class, or in the place where it's actually being used)

I agree that the cascading logic is a bit much for the vllm.envs, but I don't think that the platforms class is really the right place for kernel selection logic. I'd prefer to keep all of these environment variable checks down in the "layer" level where we are actually selecting kernels.

@DarkLight1337 @SageMoore
have been addressed in this commit

SageMoore · 2025-03-19T17:04:45Z

I have two high level requests for this PR. The first is that we remove AITER enablement in any unit test that does not exercise this kernel. It's important that we have a good understanding of where this kernel is being unit tested and that's hard to figure out in this PR's current state. The second is that you include lm_eval results for any models that should be supported by this kernel. It sounds like that's just Deepseek V3 and Mixtral? Regardless, we need to make sure that accuracy is maintained with those models before we merge.

Thank you so much for the contribution and for working with us to get this merged. We are very excited about the Deepseek performance improvements!

Signed-off-by: vllmellm <[email protected]>

…d run test cases. Signed-off-by: vllmellm <[email protected]>

…nal format Signed-off-by: vllmellm <[email protected]>

hongxiayang · 2025-03-24T16:50:41Z

Hi, @SageMoore : can we prioritize to merge this PR asap? This is very important feature. Thanks.

SageMoore

This looks reasonable to me. Thanks for cleaning up the tests and running lm_eval.

mergify · 2025-03-25T03:48:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <[email protected]>

…o its default value which is false Signed-off-by: vllmellm <[email protected]>

DarkLight1337

Stamp

Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]>

Signed-off-by: vllmellm <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: Mu Huai <[email protected]>

vllmellm added 10 commits March 17, 2025 09:35

add AITER in rocm docker base file

4c296ae

Signed-off-by: vllmellm <[email protected]>

add AITER fused moe kernels

8761424

Signed-off-by: vllmellm <[email protected]>

add preprocessing steps required when using AITER moe kernels

18e0717

Signed-off-by: vllmellm <[email protected]>

add required ENV variables to enabled AITER ops

19b0cd2

Signed-off-by: vllmellm <[email protected]>

add test for fused moe dispatcher logic

38d5995

Signed-off-by: vllmellm <[email protected]>

bugfix: update aiter moe enable check

6028eab

Signed-off-by: vllmellm <[email protected]>

add end to end model test when AITER ops are enabled for rocm

fab94ea

Signed-off-by: vllmellm <[email protected]>

fix pre-commit errors

8e419df

Signed-off-by: vllmellm <[email protected]>

enable AITER for rocm platform in more tests

d78a2ae

Signed-off-by: vllmellm <[email protected]>

enable AITER for rocm platform in related tests cases for fp8 quant

06c92e6

Signed-off-by: vllmellm <[email protected]>

mergify bot added the ci/build label Mar 17, 2025

tjtanaa mentioned this pull request Mar 17, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

61 tasks

vllmellm added 4 commits March 18, 2025 04:02

bugfix AITER block scaled moe wrong depency on a wrong envs variable

8976e55

Signed-off-by: vllmellm <[email protected]>

Merge branch 'vllm-project:main' into aiter-fmoe-integration

8109aa0

separate out the moe kernels from aiter into different file

4d8d15b

Signed-off-by: vllmellm <[email protected]>

Merge branch 'aiter-fmoe-integration' of https://github.com/EmbeddedL…

4b942b7

…LM/vllm into aiter-fmoe-integration

vllmellm marked this pull request as ready for review March 18, 2025 04:28

vllmellm requested review from DarkLight1337, WoosukKwon, mgoin, robertgshaw2-redhat, tlrmchlsmth and ywang96 as code owners March 18, 2025 04:28

vllmellm added 3 commits March 18, 2025 07:22

move AITER moe enability check from top of file into function level s…

c069a66

…o that the models unit tests would be triggered when aiter envs are switched on and off Signed-off-by: vllmellm <[email protected]>

fix AITER Fused MoE distpatcher tests

4047344

Signed-off-by: vllmellm <[email protected]>

fix get envs variables in unit tests

547464d

Signed-off-by: vllmellm <[email protected]>

DarkLight1337 reviewed Mar 18, 2025

View reviewed changes

DarkLight1337 mentioned this pull request Mar 18, 2025

[FEAT][ROCm] Integrate Paged Attention Kernel from AITER #15001

Merged

Merge remote-tracking branch 'origin/main' into aiter-fmoe-integration

b9158ad

Merge remote-tracking branch 'origin/main' into aiter-fmoe-integration

0b55c4c

tjtanaa mentioned this pull request Mar 19, 2025

[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature #14959

Merged

vllmellm added 6 commits March 20, 2025 12:01

bugfix topk softmax functions to return the tensors

b8dd58a

Signed-off-by: vllmellm <[email protected]>

remove unused tests for AITER MoE and keep only mixtral moe unit test

d2f86c0

Signed-off-by: vllmellm <[email protected]>

Merge remote-tracking branch 'origin/main' into aiter-fmoe-integration

3f230d7

Merge remote-tracking branch 'origin/main' into aiter-fmoe-integration

91d0bda

Signed-off-by: vllmellm <[email protected]>

fix test cases in test_fp8.py to test AITER ops enability for load an…

05734e4

…d run test cases. Signed-off-by: vllmellm <[email protected]>

remove the extra line gaps and revert the test_phimoe.py to its origi…

f242bf2

…nal format Signed-off-by: vllmellm <[email protected]>

hongxiayang added the rocm Related to AMD ROCm label Mar 24, 2025

SageMoore approved these changes Mar 24, 2025

View reviewed changes

mergify bot added the needs-rebase label Mar 25, 2025

hongxiayang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2025

Merge remote-tracking branch 'origin/main' into aiter-fmoe-integration

598dec9

Signed-off-by: vllmellm <[email protected]>

mergify bot removed the needs-rebase label Mar 26, 2025

match the VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE variable in envs t…

61edbd4

…o its default value which is false Signed-off-by: vllmellm <[email protected]>

DarkLight1337 approved these changes Mar 26, 2025

View reviewed changes

DarkLight1337 merged commit 5ebf667 into vllm-project:main Mar 26, 2025
40 checks passed

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

tjtanaa mentioned this pull request Apr 20, 2025

[FEAT] [ROCm]: Support AITER Linear #14916

Closed

tjtanaa deleted the aiter-fmoe-integration branch May 16, 2025 16:27

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967

[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967

Uh oh!

vllmellm commented Mar 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

DarkLight1337 Mar 18, 2025 •

edited

Loading

Uh oh!

SageMoore Mar 18, 2025

Uh oh!

vllmellm Mar 19, 2025

Uh oh!

SageMoore commented Mar 19, 2025

Uh oh!

hongxiayang commented Mar 24, 2025

Uh oh!

SageMoore left a comment

Uh oh!

mergify bot commented Mar 25, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967

[FEAT][ROCm] Integrate Fused MoE Kernels from AITER #14967

Uh oh!

Conversation

vllmellm commented Mar 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR integrates fused MoE kernels from AITER (AI Tensor Engine for ROCm)

Performance Improvement Tables

Mixtral-8x7B-FP8

Mixtral-8x7B-FP16

DeepSeekV3 Throughput

DeepSeekV3 Latency

AITER Operations Testing Overview

1. High-Level Integration Tests

2. AITER MoE Specific Test

3. Quantization Testing

4. Kernel Function Dispatch Testing

lm_eval results

mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x22B-Instruct-v0.1

Deepseek-V3

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

DarkLight1337 Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SageMoore Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore commented Mar 19, 2025

Uh oh!

hongxiayang commented Mar 24, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 25, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vllmellm commented Mar 17, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Mar 18, 2025 •

edited

Loading