Optimize device_select, device_partition for MI3xx multi-stream performance#1012
Closed
stanleytsang-amd wants to merge 4 commits into
Closed
Optimize device_select, device_partition for MI3xx multi-stream performance#1012stanleytsang-amd wants to merge 4 commits into
stanleytsang-amd wants to merge 4 commits into
Conversation
Contributor
Author
|
Closing in favour of #1041 |
stanleytsang-amd
added a commit
that referenced
this pull request
Aug 2, 2025
…rmance (#1041) Closed #1012 because branch name was of wrong format. Credit goes to @Naraenda and @NB4444 for writing the original fix for 6.4. Under certain conditions, device_select and device_partition can experience slowdowns on MI3xx GPUs when running on multiple streams. This fix mitigates the slowdown by utilizing atomic counters instead of flat block id's to assign work. To preserve performance when the user knows in advance that multiple streams are not being used, or on non MI3xx-based architectures, this optimization is opt-in only, via template argument.
stanleytsang-amd
added a commit
that referenced
this pull request
Aug 2, 2025
…rmance (#1041) Closed #1012 because branch name was of wrong format. Credit goes to @Naraenda and @NB4444 for writing the original fix for 6.4. Under certain conditions, device_select and device_partition can experience slowdowns on MI3xx GPUs when running on multiple streams. This fix mitigates the slowdown by utilizing atomic counters instead of flat block id's to assign work. To preserve performance when the user knows in advance that multiple streams are not being used, or on non MI3xx-based architectures, this optimization is opt-in only, via template argument.
xiaohuguo2023
pushed a commit
to xiaohuguo2023/rocm-libraries
that referenced
this pull request
Aug 3, 2025
Swathi9494
pushed a commit
that referenced
this pull request
Aug 5, 2025
…rmance (#1041) Closed #1012 because branch name was of wrong format. Credit goes to @Naraenda and @NB4444 for writing the original fix for 6.4. Under certain conditions, device_select and device_partition can experience slowdowns on MI3xx GPUs when running on multiple streams. This fix mitigates the slowdown by utilizing atomic counters instead of flat block id's to assign work. To preserve performance when the user knows in advance that multiple streams are not being used, or on non MI3xx-based architectures, this optimization is opt-in only, via template argument.
assistant-librarian Bot
pushed a commit
that referenced
this pull request
Aug 21, 2025
This PR adds a missing header file that is required to compile rocsolver in debug mode.
ammallya
pushed a commit
that referenced
this pull request
Sep 18, 2025
This PR adds a missing header file that is required to compile rocsolver in debug mode. [ROCm/rocSOLVER commit: 1b07f97]
ammallya
pushed a commit
that referenced
this pull request
Sep 26, 2025
This PR adds a missing header file that is required to compile rocsolver in debug mode. (cherry picked from commit 1b07f97)
ammallya
pushed a commit
that referenced
this pull request
Sep 26, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Credit goes to @Naraenda and @NB4444 for writing the original fix for 6.4.
Under certain conditions, device_select and device_partition can experience slowdowns on MI3xx GPUs when running on multiple streams. This fix mitigates the slowdown by utilizing atomic counters instead of flat block id's to assign work. To preserve performance when the user knows in advance that multiple streams are not being used, or on non MI3xx-based architectures, this optimization is opt-in only, via template argument.