-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: Assertion `finder != device_functions.end()' failed #1682
Comments
@muraj Can you take a look at this? |
Could really use some help with this as Im trying to get some Gordon Bell runs done on Perlmutter. |
After talked with @syamajala , we figured out that the issue is that we build the Realm with CUDART_HIJACK=ON, but the cray wrappers links the cudart automatically, so CUDART_HIJACK is not turned on during runtime, and all these kernels are not registered with Realm. |
Yeah the cray wrappers are broken and have no way to turn off linking against cudart. I opened a NERSC ticket about this a year ago. Their solution was for me to manually link things by hand and just remove I am able to do my runs now. |
@syamajala is there a reason you are using the cudart hijack? Is there something that Realm can do to make the cudart hijack not necessary for your use case? |
#1059 is why we need hijack still. |
Roger that. @elliottslaughter, @magnatelee can we get progress on #1059 so we can close out on all these related issues? @syamajala I don't think there is anything here that Realm or Legion can do to resolve this conflict, as this is a known limitation of the cudart hijack. Can we close this issue and direct you to helping prioritize #1059 so we can remove the cudart hijack? |
I think we should keep it open for now because I had completely forgotten what happens when we try to build/run the TDB branch of S3D on Perlmutter. I will likely forget again. We can close it when #1059 is fixed. |
Understood. I'll at least label this with a cudart_hijack label so it won't be prioritized until #1059 is dealt with. |
This bug is open because I need to urgently run S3D on Perlmutter for unrelated reasons. The linking hack is obnoxious enough that I might just rip out the CUDA hijack in Regent; but it will depend on what ends up being easier. |
Now that https://gitlab.com/StanfordLegion/legion/-/merge_requests/1502 is available I plan to retest this in that branch, and if resolved, I'll close this issue. |
The Regent changes have merged. There are still a few application-level changes required to run properly without the hijack (because the application includes hand-written CUDA kernels) that I am currently testing. |
I've ported the application to avoid needing the hijack as well, so I think we're done here. The error does not occur if the application does not use the hijack. |
I'm hitting the following assertion in Realm:
This only seems to be happening on Perlmutter. I was able to run on blaze and sapling without any problems. I tried cuda 11.7, 12.0, and 12.2 on Perlmutter, but they all have the same issue.
Here is a stack trace:
The text was updated successfully, but these errors were encountered: