Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XNACK] GPU is asleep during the copy and not waking back up when it should. #2941

Open
junliume opened this issue May 2, 2024 · 7 comments

Comments

@junliume
Copy link
Collaborator

junliume commented May 2, 2024

          The cause appears to be that the GPU is asleep during the copy and not waking back up when it should. 

Changing the grub options allowed these tests to pass on my test machine.
ROCm/ROCm#2418 (comment)

Originally posted by @cderb in #2864 (comment)

@junliume
Copy link
Collaborator Author

junliume commented May 2, 2024

@cderb let's search if we have an internal ticket on this or not, if not we should create one for the runtime and driver. Thanks! FYI: @JehandadKhan @atamazov

@atamazov
Copy link
Contributor

atamazov commented May 2, 2024

@cderb What is the base driver version? 5.6, like mentioned in ROCm/ROCm#2418?

@cderb
Copy link
Contributor

cderb commented May 6, 2024

@atamazov this test docker was on rocm 6.1.0-82

@atamazov
Copy link
Contributor

atamazov commented May 6, 2024

@cderb Thanks but the base driver is not included with the image. Can you please provide output of (run it outside the container):

modinfo amdgpu | grep -i -E "(version:)|(vermagic:)"

or

/opt/rocm/bin/rocm-smi --showdriverversion

@cderb
Copy link
Contributor

cderb commented May 6, 2024

@atamazov

version:        5.18.13
srcversion:     7D4E7C8EA7D467BB8AED6A1
vermagic:       5.15.0-105-generic SMP mod_unload modversions

Perhaps that would mean updating the base driver on this machine could resolve this issue?

@cderb
Copy link
Contributor

cderb commented May 6, 2024

However the base version on our CI nodes is 6.2.4 and we observe the same issue I believe.

@atamazov
Copy link
Contributor

atamazov commented May 6, 2024

@cderb Hmm... The most recent released ROCm is 6.1.0, how CI nodes may have 6.2.4 installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants