Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naive_conv_nonpacked_fwd_nchw_half_double_half in KDB cache breaks after #2863 #3023

Open
junliume opened this issue Jun 5, 2024 · 5 comments

Comments

@junliume
Copy link
Collaborator

junliume commented Jun 5, 2024

[Observations}:
Reproducer:

./bin/MIOpenDriver convfp16 -n 128 -c 3 -H 224 -W 224 -k 96 -y 7 -x 7 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

and

MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_nonpacked_fwd_nchw_half_double_half
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_nonpacked_fwd_nchw_half_double_half, global_work_dim = { 3145728, 1, 1 }, local_work_dim = { 256, 1, 1 }
GPU core dump failed
:0:rocdevice.cpp            :2958: 129010028118 us: [pid:163363 tid:0x7f35d0dff640] Callback: Queue 0x7ef5c8a00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

However, if we delete the system KDB, this workload can run through

[Analysis]
Something in #2863 has caused the incompatibility

FYI: @JehandadKhan @atamazov


@junliume
Copy link
Collaborator Author

junliume commented Jun 5, 2024

It should be changes in this file which are causing this problem:
https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv_fwd.cpp

and similar issues might be in other directions too

@junliume
Copy link
Collaborator Author

junliume commented Jun 5, 2024

By removing these lines:

alpha_val,
beta_val,

It starts to work properly again.

So the KDB cache does not accept these parameters. However, why the refreshed KDB still does not accept them? @cderb

@junliume
Copy link
Collaborator Author

junliume commented Jun 5, 2024

@atamazov FYI, this issue and the other GPU Target embedded in compiler issues are related. We will branch soon and need to regenerate quite a few KDBs with all fixes merged in develop branch first.

@atamazov
Copy link
Contributor

atamazov commented Jun 6, 2024

@junliume I recommend reverting #2863, if we have this possibility. Adding alpha/beta to the existing convolution primitive is a nontrivial thing and should be carefully analyzed prior implementing.

@atamazov
Copy link
Contributor

@junliume

[Analysis] Something in #2863 has caused the incompatibility

Of course! If a PR changes the number of input arguments, or their types, then KDB must be regenerated. And changing the number of input arguments is a substantial change (see https://github.com/ROCm/MIOpen/pull/2863/files#r1641574996).

[Note] Unfortunately, I see that some deeper redesign is required in order to make this bilinear stuff (alpha/beta) working correctly, see #2863 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants