-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
[Model][Hardware][AMD]: Part 1/2 -> Enable e2e QK Norm + RoPE + KV Cache runtime fusion for Qwen3-30B-A3B on ROCM_AITER_FA, and ROCM_AITER_UNIFIED_ATTN #42749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jhu960213
wants to merge
52
commits into
vllm-project:main
Choose a base branch
from
jhu960213:jhu96/optimize-qwen30b-part1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 18 commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
e3ce8aa
Add fused QK norm + RoPE + KV cache fusion infrastructure
jhu960213 034e4de
Add fused QK norm + RoPE + KV cache support for ROCM_AITER_FA and ROC…
jhu960213 cc9e330
Add unit test for QK norm + RoPE + KV cache fusion pass
jhu960213 803b694
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 5697b3a
Fix matcher_utils.py merge conflict resolution
jhu960213 f61f6a2
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 d7f4c57
Dropped hip_ convention for the fused kernel and removed auto enable …
jhu960213 439e265
Revamped fused qk norm rope kvcacheactivation from compilation and pa…
jhu960213 b96c04b
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 9eba884
Moved the caching of CPU scalar copies of k/v scales into the Attenti…
jhu960213 14b30bb
Fixed ruff auto staged
jhu960213 74c9c6a
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 039055f
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 72801be
Restaged ruff formatted file
jhu960213 5b18c74
Removed MatcherRMSNorm from matcher utils since we use vllm ir forwar…
jhu960213 f95b157
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 a96f91b
Ruff strip trailing whitespace from merg confict resolution
jhu960213 65d599f
Clean up code
jhu960213 ae4a653
Parametrized custom ops list in UT for qk norm rope kvcache
jhu960213 322b9e9
Restaging ruff formatted
jhu960213 8180635
Cleaned up more code
jhu960213 32406f2
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 4da1ddf
Removed the no ops from ROCM_ATTN for part1 and added to unified attn…
jhu960213 2f45e60
Added partial rotary embedding support
jhu960213 9ed3810
Added partial rotary embedding triage in qk norm rope cache unit test
jhu960213 e3eeaa6
Removed comments etc
jhu960213 b102dfc
Renamed unit test to align with rocm aiter convention
jhu960213 ea7dc41
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 45309aa
Removed some more comments
jhu960213 e4dc639
Some more cleanup
jhu960213 93ac152
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 98b313a
Parmetrized token num and shuffle write for this fusion and also lose…
jhu960213 9b3d9ce
Turned shuffle kv cache write off when qk norm + rope cache pts quant…
jhu960213 b00eb44
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 d67effd
Added another pattern and replacement logic for fp8 KV + ROCM_AITER_U…
jhu960213 79492b7
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 4f813fd
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 d0f260d
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 0014ca6
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 f8fc70c
Used correct per tensor quant call for both pattern and replacement f…
jhu960213 4288bd4
Fixed fp8 quant query pattern matching logic for unified attn
jhu960213 0f89797
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 b30d0cf
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 dcfc7f8
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 0d709e0
Merge branch 'main' into jhu96/optimize-qwen30b-part1
dllehr-amd 66b4875
Added fix for kv scales device to host synching when model is loaded …
jhu960213 5dc9a1a
Added guard for this fusion to only fire when head dims are in the su…
jhu960213 322d6ed
Zero out q and k outs when fusion is skipped during profiling runs
jhu960213 f9407b7
Added a guard to ensure that layers only fire with this fusion if the…
jhu960213 c5ec548
Refactored this fusion to use a common helper func in _aiter_ops.py f…
jhu960213 cfa1208
Fixed fp8 quant query pattern match issue in unified attention
jhu960213 49e3d15
Merge branch 'main' into jhu96/optimize-qwen30b-part1
jhu960213 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
430 changes: 430 additions & 0 deletions
430
tests/compile/passes/test_qk_norm_rope_kvcache_fusion.py
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be in qk_norm_rope_kvcache_fusion.py? Or is the fused kernel not have the head dim issue? Also will this be restrictive to the existing qk_norm_rope usecase? I figured we'd hit an error by now if we were restricted to just the 3 head dim sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for the QK Norm + Rope fusion, this fusion pass uses the
vllm/csrc/fused_qknorm_rope_kernel.cu
Line 552 in 2d6b348
vllm/vllm/_custom_ops.py
Line 330 in 2d6b348
vllm/csrc/fused_qknorm_rope_kernel.cu
Line 591 in 2d6b348
I added SUPPORTED_FUSED_QK_NORM_ROPE_HEAD_DIMS so the pass skips pattern registration when head_size isn’t supported, instead of matching/replacing and failing at runtime. Previously, there was no pass-level guard.... we only had the C++ switch. This doesn’t apply to fuse_qk_norm_rope_kvcache (AITER KV-cache fusion uses a different kernel).