-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: Profiling breaks with CUPTI #1800
Comments
Right the manifestation for this is a request for a The other problem here is that the CI currently doesn't have a job that runs the profiler on logs generated from an application running on the GPU. @elliottslaughter to add a new CI job to ensure that we catch cases like this in the future. |
I feel like the issue is that we forgot to set the Maybe run with |
@elliottslaughter can you send me the application that this occurs on? This shouldn't happen, since we push the CUPTI correlation activity record for each GPU task, so it'll show up in the CUPTI buffer when we flush (when the task completes). Until the buffer is flushed, yes, the GPU timestamps will be invalid since they aren't known yet. |
I think Elliott was seeing it on Regent's implementation of Pennant. I saw it once too when running with Pennant C++, but that was on the only run that I was successfully able to get through the machine without hitting #1802 or #1803, so I wouldn't suggest trying to use Pennant C++ until those are fixed. |
@muraj I have a reproducer sitting on Sapling here: It includes a script to run the application:
Then you just run the profiler on the resulting logs to see it crash. If you want a reproducer you can rebuild, let me know and I can provide those instructions as well. |
When running a Legion GPU program with CUPTI support, I observed that the following GPU task record was generated:
GPUTaskInfo with invalid GPU time
Note the invalid timetamps for
gpu_stop
andgpu_start
(9223372036854775808, aka 0x8000000000000000). This currently causes the profiler to choke.I observed that running the program with
-cuda:cupti 0
causes the problem to go away, and the profile renders successfully.There are a couple things I think we need to do:
I produced this on Sapling with CUDA 12.1. Logs are available at
/scratch/eslaught/legion-retreat-2024/language/pennant.run1/logs_cupti
. Up one directory from there is the script (and binaries) to reproduce this issue.CC @lightsighter for visibility.
The text was updated successfully, but these errors were encountered: