Skip to content

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180

Merged
gshtras merged 11 commits intovllm-project:mainfrom
ROCm:fused_aiter_triton_rope_kvcache
Feb 25, 2026
Merged

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180
gshtras merged 11 commits intovllm-project:mainfrom
ROCm:fused_aiter_triton_rope_kvcache

Conversation

@Rohan138
Copy link
Contributor

@Rohan138 Rohan138 commented Feb 24, 2026

Purpose

Follow-up to #25135 and #33443 to fix and enable the AITER RoPE custom op by default, and enable RoPE+KVCache fusion. During prefill (q.shape[0] > 256), we thus use the AITER unfused RoPE kernel instead of the vllm native custom op, which gives another 1% uplift on gpt-oss.

This is also a mild prerequisite for MLA RoPE+KVCache fusion on ROCm, since there currently isn't a vllm native custom op for DeepseekScalingRotaryEmbedding.

Test Plan

Before:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  118.01    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         2504.76   
Total Token throughput (tok/s):          5001.60  

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.17|±  |0.0266|

Test Result

After:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  116.47    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.75      
Output token throughput (tok/s):         2537.77   
Total Token throughput (tok/s):          5067.52

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.19|±  |0.0278|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the AITER RoPE custom op and RoPE+KVCache fusion for ROCm. The changes look good overall, enabling the new custom op and updating relevant parts of the code, including tests and configuration. However, I found a critical issue in the configuration that prevents the new RoPE+KVCache fusion from being enabled by default, which contradicts the goal of this PR. Please see the detailed comment.

Comment on lines +112 to +115
return (
cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
and cfg.compilation_config.use_inductor_graph_partition
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition and cfg.compilation_config.use_inductor_graph_partition will cause this function to return False for the default optimization levels (O2, O3), as use_inductor_graph_partition is set to False for them. This effectively disables the fuse_rope_kvcache feature that this pull request intends to enable. To ensure the fusion is enabled by default as intended, this condition should be removed.

    return cfg.compilation_config.is_custom_op_enabled("rotary_embedding")

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

return (
rocm_aiter_ops.is_enabled()
and cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
and cfg.compilation_config.use_inductor_graph_partition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you guys using inductor graph partition on rocm by default? Otherwise we should also return true here I'd dynamo partition and kv cache op not in splitting ops

Copy link
Contributor Author

@Rohan138 Rohan138 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Somehow GH ate my original PR comment that explained this)

This PR is necessary but not sufficient to actually enable this fusion by default. We also need:

return true if dynamo partition and kv cache op not in splitting ops

https://github.com/vllm-project/vllm/blob/main/vllm/config/compilation.py#1001 is called in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L961 after the defaults are set in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L807. So if inductor partition is not enabled, we would return true for this, then append kv cache to splitting ops, which would silently break the fusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links are broken but I know what you mean - but if splitting_ops=[] is passed kvcache won't be added so it should still work. So this check should be if inductor_partition or len(splitting_ops)==0

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@gshtras gshtras enabled auto-merge (squash) February 24, 2026 22:57
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026
@gshtras gshtras merged commit f38f8c9 into vllm-project:main Feb 25, 2026
67 of 68 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 25, 2026
tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026
@Rohan138 Rohan138 deleted the fused_aiter_triton_rope_kvcache branch February 27, 2026 21:19
flutist pushed a commit to flutist/vllm_custom_dataset_img_support_base64 that referenced this pull request Feb 28, 2026
…project#35180)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: xjx <493337577@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
@tjtanaa
Copy link
Collaborator

tjtanaa commented Mar 5, 2026

@Rohan138 is the lm eval score from DeepSeek-R1? The GSM8K score should be at 0.95. 0.9 seems low.

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…project#35180)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants