Data corruption with CUDA hijack off #1782

elliottslaughter · 2024-10-25T18:21:21Z

This is a follow-on from #1682. I'm building S3D on a variety of machines, and the behavior I currently see is:

Sapling:
- CUDA hijack on: works
- CUDA hijack off: data/memory corruption (see below)
Perlmutter:
- CUDA hijack on: hits Realm: Assertion `finder != device_functions.end()' failed #1682 (comment)
- CUDA hijack off: data/memory corruption (see below)
Frontier: works (this is HIP so hijack on/off is sort of irrelevant)

In the case with the CUDA hijack off, errors look like:

DID NOT CONVERGE!!!!
[2 - 7f0703e40000]    5.668820 {6}{gpu}: CUDA error reported on GPU 0: an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS)
*** Caught a fatal signal (proc 2): SIGABRT(6)

Because it goes away with hijack (at least on Sapling), this smells like a synchronization issue.

What's notable to me here is that Frontier and Perlmutter (with hijack off) should be following the same code paths. HIP, as you may recall, has never really had a hijack, so users have always been required to query the task HIP stream for kernel launches. I have now gone and unified the code so that, as much as possible, we are running identical code in the CUDA case. It should be difficult or impossible for the HIP code to be obtaining the task stream but not doing so in the CUDA case. However, we still hit the issue above.

Since I suspected a synchronization issue, I went and commented out every instance of set_task_ctxsync_required(false) in the application. If I understand correctly, this should force a synchronization after every task. This is the primary difference between hijack and non-hijack modes, so it seems like the more likely culprit. To be really sure, I also applied the following diff to Realm:

diff --git a/runtime/realm/cuda/cuda_module.cc b/runtime/realm/cuda/cuda_module.cc
index f06e9cd21..628fe70e0 100644
--- a/runtime/realm/cuda/cuda_module.cc
+++ b/runtime/realm/cuda/cuda_module.cc
@@ -4218,6 +4218,7 @@ namespace Realm {
 
     void CudaModule::set_task_ctxsync_required(bool is_required)
     {
+      abort(); // Elliott: make sure we never hit this
       // if we're not in a gpu task, setting this will have no effect
       ThreadLocal::context_sync_required = (is_required ? 1 : 0);
     }

If I understand correctly, this ensures that we do not hit this code path, anywhere in the application. But after rebuilding I'm still hitting the error above.

At this point, I wonder if cuCtxRecordEvent (from #1730 (comment)) is somehow not having the behavior we expect? Again, I don't see what else could be different between the hijack and non-hijack builds. We are running the same application code and Legion version. In the case of Sapling, I can literally rebuild with one flag set.

Is there a way to shut off the cuCtxRecordEvent code path and just do a plain old cuCtxSynchronize? Unless someone else has another suggestion, this seems like the next thing to try.

@muraj for visibility.

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2024-10-25T18:29:05Z

Here's the driver on the node in question. I guess it's too old to have cuCtxRecordEvent.

$ srun -n 1 -N 1 nvidia-smi
Fri Oct 25 11:28:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

That was my last idea. Suggestions?

Edit: actually, I guess the driver is distinct from the CUDA version. Can someone check if this driver has cuCtxRecordEvent?

muraj · 2024-10-26T03:13:09Z

It does not, you need to compile with at least 12.5 and the driver needs to be at least r550 I believe. I don't think cuCtxRecordEvent is the problem here. My guess is there is some race happening somewhere that the hijack makes less likely somehow. I would try to look at the address that is causing the failure and backtrack where it came from and why it's invalid.

lightsighter · 2024-10-26T06:47:32Z

@elliottslaughter Can you run with compute-sanitizer --lineinfo ... to see which line of which kernel is doing the illegal memory access?

muraj · 2024-12-04T21:37:40Z

@elliottslaughter any updates on this?

elliottslaughter · 2024-12-10T00:03:24Z

Unfortunately, I need this working sooner rather than later, so I will probably hack the build so I can turn the hijack back on.

I'll try to find some time to respond to the specific debug suggestion as well, but first priority is to get the app up and running.

muraj self-assigned this Oct 26, 2024

syamajala mentioned this issue Nov 1, 2024

Freeze in S3D TDB on Marlowe #1788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption with CUDA hijack off #1782

Data corruption with CUDA hijack off #1782

elliottslaughter commented Oct 25, 2024

elliottslaughter commented Oct 25, 2024 •

edited

Loading

muraj commented Oct 26, 2024

lightsighter commented Oct 26, 2024

muraj commented Dec 4, 2024

elliottslaughter commented Dec 10, 2024

Data corruption with CUDA hijack off #1782

Data corruption with CUDA hijack off #1782

Comments

elliottslaughter commented Oct 25, 2024

elliottslaughter commented Oct 25, 2024 • edited Loading

muraj commented Oct 26, 2024

lightsighter commented Oct 26, 2024

muraj commented Dec 4, 2024

elliottslaughter commented Dec 10, 2024

elliottslaughter commented Oct 25, 2024 •

edited

Loading