-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm GPU Profiling Is Not Precise #1732
Comments
cperry/cupti should be nearly production ready to deal with this. I'm using CUPTI to provide the tight bounds on the timestamps. Unfortunately, due to some limitations with cupti, we have to enable most of the activities' recording fairly early on for the duration of the application, which can cause performance issues. So we'll need to expose a flag or something to enable this and fallback to the old way when we can't. |
For Legion applications this should be fine since we turn on the profiling for the whole run. It's either off or on. How much overhead would cupti cause if say I only added it to the first task launch and then didn't use it for the rest of the run? |
And how much overhead are we actually talking about here (in ballpark terms)? If it's really excessive then I think a tradeoff in precision vs overhead would be appreciated. If it's small enough then we may live with just having it on all the time. |
@elliottslaughter Actually, on further research, I was able to limit the activity recording pretty much down to just when the task requests it, and only for the context in question. The gpu time recording will incur launch overhead at each call for the recording, but compared to what we do with the cudart hijack, this should be minimal. I don't have a proper measurement of the overhead at the moment, I'll have to ask someone with proper test to confirm, as we really only have rudimentary tests for this at the moment. Ballparking (as requested), launch latency should add perhaps another 0.25us GPU time to each launch (in order to record the start and stop of each kernel, memcpy, memset, etc), which is not introduced by us but by CUPTI. CPU time, since the retrieving of the buffer is done in another thread and we wake up and wait for that thread to flush the profiling records from CUPTI, the starting overhead is not significant but the end of task overhead may be decent depending on how many CUDA operations are done in the same GPU context. I don't really have a number for this and I've tried narrowing the scope of this as much as I could with CUPTI. The biggest issue is I have to enable the activity records for each driver and runtime API call, which can quickly fill these buffers. All we really need are the kernel, memcpy, and memset records with their corresponding external correlation records. Unfortunately, CUPTI won't add the external correlation records without enabling all the API records as well.
@lightsighter In the current iteration of the branch, there shouldn't be any overhead if you don't explicitly request it. I do have a push/pop of the external correlation id at each beginning, end, and task switch, but those are thread local operations that are super fast and I don't really have a good way to enable/disable this as needed. |
Some more details @muraj and I discussed and which I'll document here for everyone: In the case that Realm does not see any GPU activity during the task, then it is going to return Currently Realm is only going to detect kernel launches and memcpy operations done on the GPU. There are lots of other kinds of CUDA activity that we could detect using CUPTI, but it adds a bunch of complexity and code that is probably not going to be exercised very often. Consequently, I've given my blessing that we just capture kernels and memcpy operations for now since that should handle >95% of the things we want Realm to profile on the GPU anyway. If there's something else that we should profile people should speak up and say what it is.
I think we can live with that. |
@lightsighter by default the start and end timestamps were originally set to INVALID_TIMESTAMP which is -INT_MAX. Do we want to keep to that instead of zero or should we change INVALID_TIMESTAMP to zero? |
I'd be fine with keeping it as is with the |
Any sentinel value should definitely not be something that can occur in real data. I think we looked at this a while ago and concluded that while a Having said that, |
Ah, you're right, LLONG_MIN is what we're using |
https://gitlab.com/StanfordLegion/legion/-/merge_requests/1403 is ready for review, please let me know if this is acceptable. Thanks! |
Realm provides a profiling response for measuring the upper and lower bound on when kernels launched by a GPU task are executed. The current implementation though is imprecise on the start time. It marks the start time at point where the task starts running on the host and not when the first kernel is actually launched on the GPU. This is true both with and without the CUDA runtime hijack. This leads to profiles that mistakenly suggest that the GPU is busy running kernels when it is not.
@jjwilke for visibility.
The text was updated successfully, but these errors were encountered: