-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Perf] Optimize reshape_and_cache CUDA Kernel
#25955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zjy0516 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes the reshape_and_cache CUDA kernel by vectorizing the key cache update. The change splits the main loop, creating a separate, vectorized loop for key updates to achieve coalesced memory access, while the value update logic remains in a separate loop. This is a sound optimization strategy. My review found one issue: an unnecessary header file is included, which should be removed to improve code hygiene and reduce dependencies.
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
Could you also add more test results like #22036?
Unit test, model eval, kernel perf comparison etc.
|
We alread have unit test in |
|
And do you have any suggestions for further optimization? I'm not sure if this is sufficient, and I'm keen to explore any potential improvements you might see. |
@ZJY0516 I used |
Don't worry about perf too much now, firstly set up the pipeline to validate and get performance. And later experiment would be much easier |
Signed-off-by: zjy0516 <[email protected]>
kernel Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
|
cc @Liu-congo |
Awesome! thank you so much! |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the work!
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: Karan Goel <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Liu-congo <[email protected]>
Purpose
FIX #25705
Optimize
reshape_and_cacheCUDA Kernel.Separate key/value loops - Allows specialized indexing for each
Test
gsm8k
Performance
test on L40
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.