Session-aware and thread-safe CUDA profiler#13706
Conversation
| std::string parent_name; | ||
|
|
||
| // Tracers may not use Jan 1 1970 as an epoch for timestamps. | ||
| // So, we need to adjust the timestamp to something sensible. |
There was a problem hiding this comment.
I am not sure how this works actually, can you elaborate a bit?
There was a problem hiding this comment.
Sure. This code is essentially trying to account for the fact that the timestamps reported by tracers (specifically, the ROCm tracer) for the beginning and end of a kernel are not guaranteed to use the "standard" epoch of Jan 1 1970. I believe the ROCm tracer uses system boot time as the epoch for timestamps (but I couldn't find this documented anywhere). Now, when we subtract the profiling start time (which is the number of ticks from Jan 1 1970), we end up with a large negative value, which messes up the visualization in a trace viewer as you observed. This code essentially tries to work around this behavior by ignoring the timestamps from tracers, and filling in something that is logical.
Note that this does not affect the durations of events: those are calculated simply as duration = end_time_ns - start_time_ns. It only affects the start timestamp. I'd argue that despite the inaccuracy, the start time is closer to the actual start time than if we simply used the tracer timestamps.
| evt.args["parent_name"] = parent_name; | ||
| } | ||
| evt.ts = last_ts + increment; | ||
| increment += evt.dur; |
There was a problem hiding this comment.
Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.
Alternatively, is it viable just substract the timestamp diff between CPU events' time epoch and GPU events' time epoch? That is, we convert to unify them directly, instead of compute and try to reconstruct?
There was a problem hiding this comment.
Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.
Thanks! This is a much neater way of reconciling the two time streams: we preserve the relative timing between kernel launches. I've implemented a similar idea, by approximating the offset of the first kernel event, and then using that offset to adjust the subsequent kernel events.
As for the alternative approach you suggested: that's challenging, primarily because I could not find documentation about the epoch used by roctracer. I'm presuming that it's system boot time, but I could be wrong.
| while (event_iter != event_end && | ||
| (event_iter->ts < ts || | ||
| (event_iter->ts == ts && | ||
| (event_iter + 1) != event_end && | ||
| (event_iter + 1)->ts == ts))) { | ||
| merged_events.emplace_back(*event_iter); | ||
| ++event_iter; | ||
| } |
There was a problem hiding this comment.
It seems that you are doing a linear scan here, does events guaranteed to be sorted by timestamp?
There was a problem hiding this comment.
From my understanding of the code, events is sorted by timestamp. That said, I've only refactored existing code, which assumed that events is sorted in non-decreasing order.
### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](microsoft#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description #### TLDR This PR ports the existing C++ [EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359) interfaces used by provider-bridge EPs to the binary-stable C APIs for plugin EPs. It introduces C/C++ APIs for creating/querying profiling events, a container for appending EP events, and callback hooks (`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in real-time. #### Changes to the original C++ API The original `EpProfiler` C++ interface was adapted for the C API with the following intentional changes: 1. **`StartProfiling`** now receives an offset indicating the elapsed time since profiling started, as opposed to receiving an absolute/epoch-dependent profiling start time. This prevents EPs from having to do epoch conversions. Credit to @edgchen1 for the idea. 2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based correlation ID (`ort_event_correlation_id`)** instead of a relative ORT event ID. The `PluginEpProfiler` bridge layer automatically converts the C++ `relative_ort_event_id` (microseconds since profiling start) to an absolute `ort_event_correlation_id` by adding the epoch-based profiling start time. This means plugin EPs can use the correlation ID directly with profiling utilities like CUPTI or ROCTracer without computing the conversion themselves. 3. **`StopEvent` now receives the completed ORT event as a parameter.** This allows EPs to optionally inspect ORT event metadata (e.g., `op_name`, `event_name`) at the time the event ends, facilitating annotation of correlated EP events. 4. **`EndProfiling` only allows EPs to *append* events (via `OrtProfilingEventsContainer`), not read or modify the full events array.** This is motivated by: - Prevent any one EP from modifying events generated by ORT or another EP. - Certain EPs (VitisAI and WebGPU) already only append events without reading the entire events array. - The CUDA EP reads the entire events array solely to merge/sort its own EP events next to correlated ORT events and add `parent_name`/`op_name` metadata. However: - Merging/sorting is mostly unnecessary since trace viewers that load these files do their own event sorting. - This merging/sorting step was previously required to augment CUDA EP events with metadata from the correlated ORT event. However, that can now be obtained more simply via the new `StopEvent` parameter that provides the EP with the full correlated ORT event. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see #13706 (comment)). Fixing this would require sorting the entire Events array before asking a provider-bridge EP to merge in its events into the global events array. Not sure this is worth the runtime cost. #### Naming conventions for ORT event IDs - **C++ `EpProfiler` interface** (existing): Uses `relative_ort_event_id` — a timestamp offset in microseconds relative to profiling start. - **C API `OrtEpProfilerImpl`** (new in this PR): Uses `ort_event_correlation_id` — an absolute, epoch-based timestamp in microseconds computed from `std::chrono::high_resolution_clock` (platform-defined epoch). Unique across concurrent profiling sessions within the same process. - **Conversion**: The `PluginEpProfiler` bridge class (in `ep_event_profiling.cc`) performs `ort_event_correlation_id = relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the pattern in `GPUTracerManager::PushCorrelation`. ### New C APIs | API | Description | |-----|-------------| | `CreateProfilingEvent` | Create a profiling event with category, process/thread IDs, name, timestamp, duration, and key-value args | | `ReleaseProfilingEvent` | Release a profiling event | | `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`, `KERNEL`, `API`) | | `ProfilingEvent_GetName` | Get event name | | `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) | | `ProfilingEvent_GetDurationUs` | Get event duration (µs) | | `ProfilingEvent_GetArgValue` | Get an event argument value by key | | `ProfilingEventsContainer_AddEvents` | Append an array of EP events to the output container | | `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler implementation | | `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a profiling session. Receives elapsed time offset (ns) since ORT profiling started | | `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT event has started. Receives an absolute `ort_event_correlation_id` | | `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT event has ended. Receives the same `ort_event_correlation_id` and ORT event metadata | | `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling session and collect EP events into the output container | | `OrtEpProfilerImpl::Release` | Release the profiler instance | ### New C++ wrapper classes | Class | Description | |-------|-------------| | `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) | | `Ort::ProfilingEvent` | Owning wrapper that creates and manages an `OrtProfilingEvent` (e.g., for `EndProfiling`) | | `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding events to an `OrtProfilingEventsContainer` during `EndProfiling` | ### Example EP profiling implementation This PR updates an example plugin EP to use the new profiling APIs: - Plugin EP code: [test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry) - `OrtEpProfilerImpl` implementation: [ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h) / [ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc) - `OrtEp::CreateProfiler()` implementation: [ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc) ### Existing bugs found Not fixed in this PR. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see #13706 (comment)). - Run profilers do not handle subgraphs (e.g., subgraph of a control-flow operator). Has been the case since run profilers were [introduced](#26846). ### Motivation and Context Allows plugin EPs to generate profiling events, further closing the functionality gap between provider-bridge EPs and plugin EPs. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Description
The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both.
Motivation and Context
PR 13549 brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well.
Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed.
Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing.
sd_profile_outputs_cuda.tar.gz
sd_profile_outputs_rocm.tar.gz
check_profile_output_well_formedness.zip