Adding support for RDNA2 (gfx103X) cards#1629
Conversation
[composable_kernel] FAILED: library/src/tensor_operation_instance/gpu/quantization/CMakeFiles/device_quantization_instance.dir/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp.o
[composable_kernel] /therock/output/build/core/clr/dist/lib/llvm/bin/clang++ -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_GFX1030_SUPPORT -DCK_TILE_USE_WMMA=0 -DCK_TIME_KERNEL=1 -DDL_KERNELS -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -I/therock/src/ml-libs/composable_kernel/library/include -I/therock/src/ml-libs/composable_kernel/include -I/therock/output/build/ml-libs/composable_kernel/build/include -I/therock/output/build/profiler/roctracer/stage/include -I/therock/output/build/base/half/stage/include -isystem /therock/output/build/core/clr/dist/include -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class --hip-path=/therock/output/build/core/clr/dist --hip-device-lib-path=/therock/output/build/core/clr/dist/lib/llvm/amdgcn/bitcode -O3 -DNDEBUG -std=c++20 -fPIC -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Wno-nrvo -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-coerce-illegal-types=1 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics --offload-compress -x hip --offload-arch=gfx1030 --offload-arch=gfx1031 --offload-arch=gfx1032 --offload-arch=gfx1033 --offload-arch=gfx1034 --offload-arch=gfx1035 --offload-arch=gfx1036 --offload-arch=gfx1030 --offload-arch=gfx1031 --offload-arch=gfx1032 --offload-arch=gfx1033 --offload-arch=gfx1034 --offload-arch=gfx1035 --offload-arch=gfx1036 -MD -MT library/src/tensor_operation_instance/gpu/quantization/CMakeFiles/device_quantization_instance.dir/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp.o -MF library/src/tensor_operation_instance/gpu/quantization/CMakeFiles/device_quantization_instance.dir/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp.o.d -o library/src/tensor_operation_instance/gpu/quantization/CMakeFiles/device_quantization_instance.dir/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp.o -c /therock/src/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/quantization/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp
[composable_kernel] In file included from /therock/src/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/quantization/conv2d_fwd/device_conv2d_dl_bias_perchannel_quantization_int8_instance.cpp:4:
[composable_kernel] In file included from /therock/src/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/quantization/conv2d_fwd/device_conv2d_dl_int8_instance.hpp:7:
[composable_kernel] In file included from /therock/src/ml-libs/composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk.hpp:12:
[composable_kernel] In file included from /therock/src/ml-libs/composable_kernel/include/ck/utility/common_header.hpp:37:
[composable_kernel] /therock/src/ml-libs/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:32:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
[composable_kernel] 32 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
[composable_kernel] | ^
[composable_kernel] /therock/src/ml-libs/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
[composable_kernel] 47 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
[composable_kernel] | ^
[composable_kernel] /therock/src/ml-libs/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:60:22: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
[composable_kernel] 60 | auto flags = CK_BUFFER_RESOURCE_3RD_DWORD;
[composable_kernel] | ^
[composable_kernel] /therock/src/ml-libs/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:72:22: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
[composable_kernel] 72 | auto flags = CK_BUFFER_RESOURCE_3RD_DWORD;
[composable_kernel] | ^
[composable_kernel] 4 errors generated when compiling for gfx1033.I wanted to add the gfx1033 and gfx1034 alongside these but currently the build fails when building composable_kernel for the gfx1033. Will do more testing locally on these architectures. edit: Added the gfx1033 and gfx1034, disabled ck in the gfx1033 build. |
Probably need to add gfx1033 here at the minimum: |
Eh, all these older RNDA2 cards apart from the flagship gfx1030 have lackluster support for ROCm. It was probably just forgotten, the gfx1031 is the most salient target barring the gfx1030 and even it was excluded. Though the gfx1033 is different in that it doesn't support composable kernel currently. Not sure if anyone will bother to fix it though, it's an igpu and an old one at that. |
Also, regarding the Since all RNDA2 architectures have the same instruction set, I think this is fine for the most part. However, I'm not sure if these two values: |
|
I can see where the 63 for MaxVmcnt comes from in the ISA: https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/llvm/html/AMDGPU/gfx1030_waitcnt.html gfx1100 for comparision: https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/llvm/html/AMDGPU/gfx11_waitcnt.html ISA documentation: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna2-shader-instruction-set-architecture.pdf some of the capabilities that are set to false also seem to be available eg v_fmac_f16 |
Thanks, I'll definitely look into this and test it locally. Looks like the original config for the gfx1030 was done haphazardly, I'll see if there's anything else I can enable based on the sources you provided. |
|
Yeah the LGKM count seems to be a legacy of gfx8/9 where it was 15 (4 bits). Does not seem to be fixed for gfx10, 11 or 12. Unless these capabilities have nothing to with the hardware ISA capabilities and are from a lack of software implementation. There are derived capabilities available using the assembler (LGKM count is still hardcoded to 15 here) in Tensile/Common.py#L2012 |
c137a82 to
b0a83a8
Compare
marbre
left a comment
There was a problem hiding this comment.
@umarinkovic thanks for working on this. Can we rather try to push the main patch to rocm-libraries (might need to split it up into smaller chunks). We try to burn down patches in TheRock and keep only absolute necessary patches but I would like to not add new ones if there is a way around. Patches, especially if that large regularly break and need to be rebased manually when bumping the submodules.
FYI @jammm @ScottTodd
|
Great work!
Ideally we shouldn't have to set this at all. I think rocBLAS needs PR's to include the relevant tensile file(s) to support these archs. copy-pasting ones from other navi2 archs seems to work as an initial step. E.g., for gfx1103 which copied from gfx1100 iirc ROCm/rocm-libraries#1320 |
No problem, I'll see to it that I open a PR in rocm-libraries.
Yup, already got burnt with that 😅 |
Thanks!
I think that all these archs are already supported with this patch actually. What I meant was, I don't have access to gfx1032, gfx1033 etc. cards. I only have access to a gfx1031 and gfx1036. However, any RDNA2 card will run any gfx103X kernel, as long as you set the |
Feel free to ping me on the PR if there is help needed to route this or to raise attention. |
|
Link to upstream PR: ROCm/rocm-libraries#1943 |
|
Hi, any update on RDNA2 support? My current gpu is RX 6700s. |
This is the upstream PR we're working on merging: ROCm/rocm-libraries#1943 It adds support for your card. We're still waiting on one reviewer to finish reviewing so hopefully that will be merged soon, after-which this PR will go through as well. In the meantime, you can check-out this PR, and the upstream PR in rocm-libraries, and build locally. Though, just to warn you, a full build does require robust hardware capacities, you'll need plenty of RAM and CPU power and even then it might run you a couple of hours. If you are on Linux, you can also build for gfx1030 and set the env variable to edit: feel free to ping me if you need any help, also, if you'd like to leave a comment on the upstream PR so that the reviewers see there is interest in this card you could help speed up the process of reviewing 😄 |
Thank you so much for writing back. I am so happy to see they are still cosidering RDNA2 gpus and working on it. I have been meaning to install GROMACS for my work but everytime I come back to same ROCm hurdle, not supporting anything. I had literally given any hope on it, but I am so happy to see it's still alive.
I'm currently on windows 11 25H2 with WSL2. Will that help? I am okay letting it build for hours. But I do not know how to build it locally. My gpu is just sitting idle doing nothing. |
I'm not entirely sure you can run ROCm on WSL, officially TheRock supports building natively on Windows but WSL is still experimental AFAIK. Maybe someone else can chime in? I can help you with building on Windows/Linux natively if that is of any help to you. Perhaps it'd be best to move our discussion to #1125, so as to not clutter this PR. |
|
Looks like ROCm/rocm-libraries#1943 merged! |
c612b5b to
d9c5396
Compare
|
Just to catch up with this issue, were we last trying to get enough machines to enable the tests? |
IMHO, we do not need to run tests in the CI (at least not for landing this) but it would be good to test once prior to landing. I can't because I don't have access to the appropriate HW. Lower hanging fruit, can we at least opt-in the CI to build this @amd-justchen? |
|
@amd-justchen any updates here? Otherwise, @geomin12 can you help here with at least building this for gfx103X. I think you recently added labels. |
|
Finally decided to try compiling this on my pc, for building rocm : Besides the default vsbuild install I had to modify my Visual Studio 2022 BuildTools installation Also winget install bloodrock.pkg-config-lite ; for dvl here are the resulting rocm packages alongside with pytorch packages. https://app.mediafire.com/folder/mvrwkgj96lkua I have confirmation that they work for rx 6800 (my system), rx 6600 and rx 6750xt at least. So besides the gfx1030 and gfx1032 , gfx1031 seems to be successful. Don't have other gpu's so can't test them all but they all seems to be listed in the libraries. So... my point is ... add them already :) |
Can you give the detailing procedure for the same? I have tried the official method and umarinkovic process, but I am not able to understand it properly. I have rx 6700s. I can confirm for mine too on windows 11 then. |
here , patientx/ComfyUI-Zluda#435 and check the other issue I linked there for more details. |
Just confirming this works for RX 6700s as well on windows 11. I was able to use pytorch with cuda successfully. |
|
@marbre , @amd-justchen ; I built the rocm and later pytorch packages for the whole rdna2 (#1629 (comment)) using this pull request and got confirmation from most of the gpu's in the target's as working so can you now merge this ? |
|
The label added Since this run two months ago, Linux and Windows fail: https://github.com/ROCm/TheRock/actions/runs/19926599541/job/58326544022 , https://github.com/ROCm/TheRock/actions/runs/19926599541/job/58326544134 Can we re-trigger the CI to see if it passes? |
if true, can you post logs proving this works in the PR description? |
I am sorry , I am not sure what you mean. What logs are you talking about , I didn't build it on a server or cloud I built it in my pc over two days, interrupting and continuing a few times. I just put them out there and people are using it. I haven't took any logs either , though the build folder is still there if I can get anything from it. EDIT : it seems it succeeded on your tests too. |
There was a problem hiding this comment.
looks like it builds here!
https://github.com/ROCm/TheRock/actions/runs/21878278407/job/63155329548?pr=1629 / https://github.com/ROCm/TheRock/actions/runs/21878278407/job/63155329393?pr=1629
We will add testing when we get test machines
For the time being, these artifacts can be used for folks who have these GPUs
Other errors are unrelated / flaky
Hi, thanks for the approval. If my understanding is correct, currently the bottleneck is due to having no test machines for the RDNA2 (gfx103X) cards? Is there anything that can be handled from my side to speed things up? |
@umarinkovic any test results from your end would be helpful. Nonetheless, this PR can still be merged as the CI machines from our side is an ongoing effort with no precise ETA - @marbre is this ready to be merged? |
hmm, I don't have any remaining logs but I did do extensive testing for rocBLAS on gfx1031/gfx1036. You can see the discussion in the corresponding rocm-libraries PR that was merged some time ago: ROCm/rocm-libraries#1943 |
marbre
left a comment
There was a problem hiding this comment.
@lucbruni-amd yeah, ready to merge. Testing would be nice but the minimal thing I wanted to see (and where Geo pointed to) is a passing build.
|
It will be great if change nightly build from gfx103x-dgpu to gfx103x-all. |
you can probably inquire about this on their discord, I doubt it'll be seen here since the PR has been merged |
## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> gfx103X builds appear to be failing since #2300 got merged. This should address these build errors and unblock gfx103X releases. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> - rocprofiler-compute does not support gfx103X architectures, so we added the project to the `EXCLUDE_TARGET_PROJECTS` list for the gfx103X family in `therock_amdgpu_targets.cmake` - It appears that `EXCLUDE_TARGET_PROJECTS` entries for rocprofiler-compute gfx1031, gfx1033, and gfx1034 were missing - This was due to #1629 being merged two weeks ago, and the changes related to `therock_amdgpu_targets.cmake` in the #2300 PR happening before this new inclusion - Failures were not found before merging due to PR not running gfx103X builds ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> Ensure that gfx103X builds are able to pass without any errors from rocprofiler-compute ## Test Result <!-- Briefly summarize test outcomes. --> - Ran a manual gfx103X workflow with these changes: https://github.com/ROCm/TheRock/actions/runs/22497338111/job/65175257319 ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
@umarinkovic I noticed that in TheRock nightly builds for gfx103X-dgpu there are ONLY Navi31 hipBLASLt kernels present:
All RDNA2 targets are excluded from hipBLASLt TheRock builds https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake#L102 and I am not sure how/why Navi31 hipBLASLt kernels ended up in gfx103X package... |
|
That is unfortunately a quirk of TheRock / hipBLASLt. hipBLASLt can't be built without passing an architecture, so TheRock defaults to passing it I tried introducing a feature that removed these targets from builds that did not support them but it went nowhere (rightfully so, I have since come to realize that the aforementioned paths are the only viable way to perform this): As far as excluding libhipcxx from some but not all architectures, that is probably a mistake on my part. Seeing as how this PR was blocked for ~2 months on this PR: ROCm/rocm-libraries#1943, and then blocked for another ~2 months because the maintainers had their hands full elsewhere, there were changes to projects included in TheRock with varying support that I did not accommodate for. Since #2946 was merged very soon after this PR, removing libhipcxx from the excluded list of the rest of the targets, this has since been fixed and is consistent both between architectures and with the actual targets supported by the project. As far as CK exclusion from gfx1033 is concerned, it was excluded from that target because it (CK) fails to build for that target and builds for the rest. An issue for that should probably be opened upstream because it is not a problem with TheRock but with that particular project, whether there is a bug or this specific target was excluded for a valid reason. The goal of this PR was not to enable hipBLASLt or CK or any other library for these cards but to simply catch these cards up to speed with the capabilities that they currently have. There is still work to be done and this was just the initial push to unblock them so that users can at least use something. In this regard, the gfx1030, gfx1031 and gfx1032, are the most important imo. I think that it is understandable for users that targets such as gfx1033, which is an integrated gpu, are less supported than the more powerful discrete cards. |
@umarinkovic thank you for your detailed answer and for all work that you've done for gfx103X enablement |


Motivation
Progress on #1564
Closes #1198, #1443
Relates to #1125
Relates to #1002
Different approach to: #1565
Technical Details
Added all missing gfx103X (RDNA2) architectures to the list of allowed targets, following the changes in: ROCm/rocm-libraries#1943 that allowed these architectures to build rocBLAS.
Test Plan
Built locally with gfx103X-all as target and ran smoke-tests on available GPUs, on both Windows and Linux.
Test Result
Tested the resulting rocBLAS kernels on gfx1031 and gfx1036 GPUs both natively and by overriding
HSA_OVERRIDE_GFX_VERSIONto use kernels for other targets, rocBLAS tests no longer fail.Submission Checklist