Create tensor descriptor inside kernel to improve performance of small/tiny gemm cases#596
Closed
carlushuang wants to merge 16 commits into
Closed
Create tensor descriptor inside kernel to improve performance of small/tiny gemm cases#596carlushuang wants to merge 16 commits into
carlushuang wants to merge 16 commits into
Conversation
Contributor
|
@carlushuang Could you measure the argument size with the improvement? |
asroy
reviewed
Mar 2, 2023
asroy
reviewed
Mar 3, 2023
Contributor
Author
@zjing14 updated in PR description |
* clean up * fast gelu using builtin function * clean * clean * clean * clean: * clean * fix compilation * clean * clean --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
* fix a bug blocking wmma_gemm_multipleD * Utilize matrix padder in device_wmma_op * cosmetic change for gemmpadding format * clang format * Change gridwise gemm from FIFO to KMN loop fashion
* suppress the reserved-identifier warnings * keep BUILD_DEV=On and use -Werror by default
* add new parallel stage on navi node * dont run performance tests on navi, get rid of 9110 compiler * only run navi build when not doing QA * fix syntax * use navi21 label * dont stash profiler on navi nodes, scp deb package to ginger * disable tests on navi nodes * test posting a binary to ginger * add sshpass and use it to copy deb package * fix the scp example * fix syntax * debug the scp issues * add jenkins user to docker * dont try whoami * change jenkins uid and add user with uid=1002 * try scp from the last stage on micimaster * rename and stash the package, scp from micimaster
Contributor
Author
|
dependes on #616 -> to fix the atomic bug |
asroy
suggested changes
Mar 15, 2023
Contributor
asroy
left a comment
There was a problem hiding this comment.
This PR need to be refactored:
- move the file to correct location (deviceop and gridwise op)
- update the existing gridwise and device op instead of adding new ones
- Update the example using the new device op and gridwise. No need to add a new example "simplified_karg"
https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/example/35_splitK_gemm
Contributor
Author
|
close this PR since we have refactored into #644 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CK_EXPERIMENTAL_USE_BUFFER_ATOMIC_ADD_OOB_CHECK_OFFSET_TRICKhas bug when using OOB with atomic. Have to switch to 0.=> #616
for 2D gemm original karg is 292 Byte, simplified karg is 68 Byte.
up to 10%+ performance improvement
